diff --git a/docs/source/index.rst b/docs/source/index.rst index 62987ae03..47d728956 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -11,30 +11,57 @@ spaCy is a library for industrial-strength text processing in Python and Cython. It features extremely efficient, up-to-date algorithms, and a rethink of how those algorithms should be accessed. -Most text-processing libraries give you APIs that look like this: +A typical text-processing API looks something like this: >>> import nltk >>> nltk.pos_tag(nltk.word_tokenize('''Some string of language.''')) [('Some', 'DT'), ('string', 'VBG'), ('of', 'IN'), ('language', 'NN'), ('.', '.')] -A list of strings is good for poking around, or for printing the annotation to -evaluate it. But to actually *use* the output, you have to jump through some -hoops. If you're doing some machine learning, all the strings have to be -mapped to integers, and you have to save and load the mapping at training and -runtime. If you want to display mark-up based on the annotation, you have to -realign the tokens to your original string. +This API often leaves you with a lot of busy-work. If you're doing some machine +learning or information extraction, all the strings have to be mapped to integers, +and you have to save and load the mapping at training and runtime. If you want +to display mark-up based on the annotation, you have to realign the tokens to your +original string. -With spaCy, you should never have to do any string processing at all: +I've been writing NLP systems for almost ten years now, so I've done these +things dozens of times. When designing spaCy, I thought carefully about how to +make the right thing easy. + +We begin by initializing a global vocabulary store: >>> from spacy.en import EN - >>> from spacy.en import feature_names as fn - >>> tokens = EN.tokenize('''Some string of language.''') - >>> tokens.to_array((fn.WORD, fn.SUFFIX, fn.CLUSTER, fn.POS, fn.LEMMA)) + >>> EN.load() -A range of excellent features are pre-computed for you, and by default the -words are part-of-speech tagged and lemmatized. We do this by default because -even with these extra processes, spaCy is still several times faster than -most tokenizers: +The vocabulary reads in a data file with all sorts of pre-computed lexical +features. You can load anything you like here, but by default I give you: + +* String IDs for the word's string, its prefix, suffix and "shape"; +* Length (in unicode code-points) +* A cluster ID, representing distributional similarity; +* A cluster ID, representing its typical POS tag distribution; +* Good-turing smoothed unigram probability; +* 64 boolean features, for assorted orthographic and distributional features. + +With so many features pre-computed, you usually don't have to do any string +processing at all. You give spaCy your string, and tell it to give you either +a numpy array, or a counts dictionary: + + >>> from spacy.en import feature_names as fn + >>> tokens = EN.tokenize(u'''Some string of language.''') + >>> tokens.to_array((fn.WORD, fn.SUFFIX, fn.CLUSTER)) + ... + >>> tokens.count_by(fn.WORD) + +If you do need strings, you can simply iterate over the Tokens object: + + >>> for token in tokens: + ... + +I mostly use this for debugging and testing. + +spaCy returns these rich Tokens objects much faster than most other tokenizers +can give you a list of strings --- in fact, spaCy's POS tagger is *4 times +faster* than CoreNLP's tokenizer: +----------+----------+---------------+----------+ | System | Tokenize | POS Tag | | @@ -48,157 +75,7 @@ most tokenizers: | ZPar | | ~1,500s | | +----------+----------+---------------+----------+ -spaCy is designed to **make the right thing easy**, where the right thing is to: -* **Use rich distributional and orthographic features**. Without these, your model - will be very brittle and domain dependent. - -* **Compute features per type, not per token**. Because of Zipf's law, you can - expect this to be exponentially more efficient. - -* **Minimize string processing**, and instead compute with arrays of ID ints. - -Tokenization done right -======================= - -Most tokenizers rely on complicated regular expressions. Often, they leave you -with no way to align the tokens back to the original string --- a vital feature -if you want to display some mark-up, such as spelling correction. The regular -expressions also interact, making it hard to accommodate special cases. - -spaCy introduces a **novel tokenization algorithm** that's much faster and much -more flexible: - -.. code-block:: python - - def tokenize(string, prefixes={}, suffixes={}, specials={}): - '''Sketch of spaCy's tokenization algorithm.''' - tokens = [] - cache = {} - for chunk in string.split(): - # Because of Zipf's law, the cache serves the majority of "chunks". - if chunk in cache: - tokens.extend(cache[chunl]) - continue - key = chunk - - subtokens = [] - # Process a chunk by splitting off prefixes e.g. ( " { and suffixes e.g. , . : - # If we split one off, check whether we're left with a special-case, - # e.g. contractions (can't, won't, etc), emoticons, abbreviations, etc. - # This makes the tokenization easy to update and customize. - while chunk: - prefix, chunk = _consume_prefix(chunk, prefixes) - if prefix: - subtokens.append(prefix) - if chunk in specials: - subtokens.extend(specials[chunk]) - break - suffix, chunk = _consume_suffix(chunk, suffixes) - if suffix: - subtokens.append(suffix) - if chunk in specials: - subtokens.extend(specials[chunk]) - break - cache[key] = subtokens - -Your data is going to have its own quirks, so it's really useful to have -a tokenizer you can easily control. To see the limitations of the standard -regex-based approach, check out `CMU's recent work on tokenizing tweets `_. Despite a lot of careful attention, they can't handle all of their -known emoticons correctly --- doing so would interfere with the way they -process other punctuation. This isn't a problem for spaCy: we just add them -all to the special tokenization rules. - - -Comparison with NLTK -==================== - -`NLTK `_ provides interfaces to a wide-variety of NLP -tools and resources, and its own implementations of a few algorithms. It comes -with comprehensive documentation, and a book introducing concepts in NLP. For -these reasons, it's very widely known. However, if you're trying to make money -or do cutting-edge research, NLTK is not a good choice. - -The `list of stuff in NLTK `_ looks impressive, -but almost none of it is useful for real work. You're not going to make any money, -or do top research, by using the NLTK chat bots, theorem provers, toy CCG implementation, -etc. Most of NLTK is there to assist in the explanation ideas in computational -linguistics, at roughly an undergraduate level. -But it also claims to support serious work, by wrapping external tools. - -In a pretty well known essay, Joel Spolsky discusses the pain of dealing with -`leaky abstractions `_. -An abstraction tells you to not care about implementation -details, but sometimes the implementation matters after all. When it -does, you have to waste time revising your assumptions. - -NLTK's wrappers call external tools via subprocesses, and wrap this up so -that it looks like a native API. This abstraction leaks *a lot*. The system -calls impose far more overhead than a normal Python function call, which makes -the most natural way to program against the API infeasible. - - -Case study: POS tagging ------------------------ - -Here's a quick comparison of the following POS taggers: - -* **Stanford (CLI)**: The Stanford POS tagger, invoked once as a batch process - from the command-line; -* **nltk.tag.stanford**: The Stanford tagger, invoked document-by-document via - NLTK's wrapper; -* **nltk.pos_tag**: NLTK's own POS tagger, invoked document-by-document. -* **spacy.en.pos_tag**: spaCy's POS tagger, invoked document-by-document. - - -+-------------------+-------------+--------+ -| System | Speed (w/s) | % Acc. | -+-------------------+-------------+--------+ -| spaCy | 107,000 | 96.7 | -+-------------------+-------------+--------+ -| Stanford (CLI) | 8,000 | 96.7 | -+-------------------+-------------+--------+ -| nltk.pos_tag | 543 | 94.0 | -+-------------------+-------------+--------+ -| nltk.tag.stanford | 209 | 96.7 | -+-------------------+-------------+--------+ - -Experimental details TODO. Three things are apparent from this comparison: - -1. The native NLTK tagger, nltk.pos_tag, is both slow and inaccurate; - -2. Calling the Stanford tagger document-by-document via NLTK is **40x** slower - than invoking the model once as a batch process, via the command-line; - -3. spaCy is over 10x faster than the Stanford tagger, even when called - **sentence-by-sentence**. - -The problem is that NLTK simply wraps the command-line -interfaces of these tools, so communication is via a subprocess. NLTK does not -even hold open a pipe for you --- the model is reloaded, again and again. - -To use the wrapper effectively, you should batch up your text as much as possible. -This probably isn't how you would like to structure your pipeline, and you -might not be able to batch up much text at all, e.g. if serving a single -request means processing a single document. -Technically, NLTK does give you Python functions to access lots of different -systems --- but, you can't use them as you would expect to use a normal Python -function. The abstraction leaks. - -Here's the bottom-line: the Stanford tools are written in Java, so using them -from Python sucks. You shouldn't settle for this. It's a problem that springs -purely from the tooling, rather than the domain. - -Summary -------- - -NLTK is a well-known Python library for NLP, but for the important bits, you -don't get actual Python modules. You get wrappers which throw to external -tools, via subprocesses. This is not at all the same thing. - -spaCy is implemented in Cython, just like numpy, scikit-learn, lxml and other -high-performance Python libraries. So you get a native Python API, but the -performance you expect from a program written in C. .. toctree::