diff --git a/docs/source/howworks.rst b/docs/source/howworks.rst new file mode 100644 index 000000000..a5669cf95 --- /dev/null +++ b/docs/source/howworks.rst @@ -0,0 +1,256 @@ +How spaCy Works +=============== + +The following are some hasty preliminary notes on how spaCy works. The short +story is, there are no new killer algorithms. The way that the tokenizer works +is novel and a bit neat, and the parser has a new feature set, but otherwise +the key algorithms are well known in the recent literature. + +Some might also wonder how I get Python code to run so fast. I don't --- spaCy +is written in `Cython`_, an optionally statically-typed language that compiles +to C or C++, which is then loaded as a C extension module. +This makes it `easy to achieve the performance of native C code`_, but allows the +use of Python language features, via the Python C API. The Python unicode +library was particularly useful to me. I think it would have been much more +difficult to write spaCy in another language. + +.. _Cython: http://cython.org/ + +.. _easy to achieve the performance of native C code: https://honnibal.wordpress.com/2014/10/21/writing-c-in-cython/ + +Tokenizer and Lexicon +--------------------- + +Tokenization is the task of splitting a string into meaningful pieces, called +tokens, which you can then compute with. In practice, the task is usually to +match the tokenization performed in some treebank, or other corpus. If we want +to apply a tagger, entity recogniser, parser etc, then we want our run-time +text to match the training conventions. If we want to use a model that's been +trained to expect "isn't" to be split into two tokens, ["is", "n't"], then that's +how we need to prepare our data. + +In order to train spaCy's models with the best data available, I therefore +tokenize English according to the Penn Treebank scheme. It's not perfect, but +it's what everybody is using, and it's good enough. + +What we don't do +################ + +The Penn Treebank was distributed with a script called tokenizer.sed, which +tokenizes ASCII newswire text roughly according to the Penn Treebank standard. +Almost all tokenizers are based on these regular expressions, with various +updates to account for unicode characters, and the fact that it's no longer +1986 --- today's text has URLs, emails, emoji, etc. + +Usually, the resulting regular expressions are applied in multiple passes, which +is quite inefficient. Often no care is taken to preserve indices into the original +string. If you lose these indices, it'll be difficult to calculate mark-up based +on your annotations. + +Tokenizer Algorithm +################### + +spaCy's tokenizer assumes that no tokens will cross whitespace --- there will +be no multi-word tokens. If we want these, we can post-process the +token-stream later, merging as necessary. This assumption allows us to deal +only with small chunks of text. We can cache the processing of these, and +simplify our expressions somewhat. + +Here is what the outer-loop would look like in Python. (You can see the +production implementation, in Cython, here.) + +.. code:: python + + cache = {} + def tokenize(text): + tokens = [] + for substring in text.split(' '): + if substring in cache: + tokens.extend(cache[substring]) + else: + subtokens = _tokenize_substring(substring) + tokens.extend(subtokens) + cache[substring] = subtokens + return tokens + +The actual work is performed in _tokenize_substring. For this, I divide the +tokenization rules into three pieces: + +1. A prefixes expression, which matches from the start of the string; +2. A suffixes expression, which matches from the end of the string; +3. A special-cases table, which matches the whole string. + +The algorithm then proceeds roughly like this (consider this like pseudo-code; +this was written quickly and has not been executed): + +.. code:: python + + # Tokens which can be attached at the beginning or end of another + prefix_re = _make_re([",", '"', '(', ...]) + suffix_re = _make_re(s[",", "'", ":", "'s", ...]) + + # Contractions etc are simply enumerated, since they're a finite set. We + # can also specify anything we like here, which is nice --- different data + # has different quirks, so we want to be able to add ad hoc exceptions. + special_cases = { + "can't": ("ca", "n't"), + "won't": ("wo", "n't"), + "he'd've": ("he", "'d", "'ve"), + ... + ":)": (":)",) # We can add any arbitrary thing to this list. + } + + def _tokenize_substring(substring): + prefixes = [] + suffixes = [] + while substring not in special_cases: + prefix, substring = _apply_re(substring, prefix_re) + if prefix: + prefixes.append(prefix) + else: + suffix, substring = _apply_re(substring, suffix_re) + if suffix: + suffixes.append(suffix) + else: + break + + +This procedure splits off tokens from the start and end of the string, at each +point checking whether the remaining string is in our special-cases table. If +it is, we stop splitting, and return the tokenization at that point. + +The advantage of this design is that the prefixes, suffixes and special-cases +can be declared separately, in easy-to-understand files. If a new entry is +added to the special-cases, you can be sure that it won't have some unforeseen +consequence to a complicated regular-expression grammar. + +Coupling the Tokenizer and Lexicon +################################## + +As mentioned above, the tokenizer is designed to support easy caching. If all +we were caching were the matched substrings, this would not be so advantageous. +Instead, what we do is create a struct which houses all of our lexical +features, and cache *that*. The tokens are then simply pointers to these rich +lexical types. + +In a sample of text, vocabulary size grows exponentially slower than word +count. So any computations we can perform over the vocabulary and apply to the +word count are very efficient. + + +Part-of-speech Tagger +--------------------- + +.. _how to write a good part of speech tagger: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/ . + +In 2013, I wrote a blog post describing `how to write a good part of speech +tagger`_. +My recommendation then was to use greedy decoding with the averaged perceptron. +I think this is still the best approach, so it's what I implemented in spaCy. + +The tutorial also recommends the use of Brown cluster features, and case +normalization features, as these make the model more robust and domain +independent. spaCy's tagger makes heavy use of these features. + +Dependency Parser +----------------- + +.. _2014 blog post: https://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/ + +The parser uses the algorithm described in my `2014 blog post`_. +This algorithm, shift-reduce dependency parsing, is becoming widely adopted due +to its compelling speed/accuracy trade-off. + +Some quick details about spaCy's take on this, for those who happen to know +these models well. I'll write up a better description shortly. + +1. I use greedy decoding, not beam search; +2. I use the arc-eager transition system; +3. I use the Goldberg and Nivre (2012) dynamic oracle. +4. I use the non-monotonic update from my CoNLL 2013 paper (Honnibal, Goldberg + and Johnson 2013). + +So far, this is exactly the configuration from the CoNLL 2013 paper, which +scored 91.0. So how have I gotten it to 92.4? The following tweaks: + +1. I use Brown cluster features --- these help a lot; +2. I redesigned the feature set. I've long known that the Zhang and Nivre + (2011) feature set was suboptimal, but a few features don't make a very + compelling publication. Still, they're important. +3. When I do the dynamic oracle training, I also make + the upate cost-sensitive: if the oracle determines that the move the parser + took has a cost of N, then the weights for the gold class are incremented by + +N, and the weights for the predicted class are incremented by -N. This + only made a small (0.1-0.2%) difference. + +Implementation +############## + +I don't do anything algorithmically novel to improve the efficiency of the +parser. However, I was very careful in the implementation. + +A greedy shift-reduce parser with a linear model boils down to the following +loop: + +.. code:: python + + def parse(words, model, feature_funcs, n_classes): + state = init_state(words) + for _ in range(len(words) * 2): + features = [templ(state) for templ in feature_funcs] + scores = [0 for _ in range(n_classes)] + for feat in features: + weights = model[feat] + for i, weight in enumerate(weights): + scores[i] += weight + class_ = max(enumerate(scores), key=lambda item: item[1]) + transition(state, class_) + +The parser makes 2N transitions for a sentence of length N. In order to select +the transition, it extracts a vector of K features from the state. Each feature +is used as a key into a hash table managed by the model. The features map to +a vector of weights, of length C. We then dot product the feature weights to the +scores vector we are building for that instance. + +The inner-most loop here is not so bad: we only have a few dozen classes, so +it's just a short dot product. Both of the vectors are in the cache, so this +is a snack to a modern CPU. + +The bottle-neck in this algorithm is the 2NK look-ups into the hash-table that +we must make, as these almost always have to hit main memory. The feature-set +is enormously large, because all of our features are one-hot boolean +indicators. Some of the features will be common, so they'll lurk around in the +CPU's cache hierarchy. But a lot of them won't be, and accessing main memory +takes a lot of cycles. + +.. _Jeff Preshing's excellent post: http://preshing.com/20130107/this-hash-table-is-faster-than-a-judy-array/ . + +I used to use the Google dense_hash_map implementation. This seemed a solid +choice: it came from a big brand, it was in C++, and it seemed very +complicated. Later, I read `Jeff Preshing's excellent post`_ on open-addressing +with linear probing. +This really spoke to me. I had assumed that a fast hash table implementation +would necessarily be very complicated, but no --- this is another situation +where the simple strategy wins. + +I've packaged my Cython implementation separately from spaCy, in the package +`preshed`_ --- for "pre-hashed", but also as a nod to Preshing. I've also taken +great care over the feature extraction and perceptron code, which I'm distributing +in a package named `thinc`_ (since it's for learning very sparse models with +Cython). + +.. _preshed: https://github.com/syllog1sm/preshed + +.. _thinc: https://github.com/honnibal/thinc + +By the way: from comparing notes with a few people, it seems common to +implement linear models in a way that's suboptimal for multi-class +classification. The mistake is to store in the hash-table one weight per +(feature, class) pair, rather than mapping the feature to a vector of weights, +for all of the classes. This is bad because it means you need to hit the table +C times, one per class, as you always need to evaluate a feature against all of +the classes. In the case of the parser, this means the hash table is accessed +2NKC times, instead of the 2NK times if you have a weights vector. You should +also be careful to store the weights contiguously in memory --- you don't want +a linked list here. I use a block-sparse format, because my problems tend to +have a few dozen classes.