From ea19850a69291df678aefc4bab4deb7a143aa42b Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Tue, 2 Dec 2014 04:39:12 +1100 Subject: [PATCH] * Add tokenizer section --- docs/source/index.rst | 74 ++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 69 insertions(+), 5 deletions(-) diff --git a/docs/source/index.rst b/docs/source/index.rst index dbadd9fc3..b0dd08417 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -8,11 +8,11 @@ spaCy NLP Tokenizer and Lexicon ================================ spaCy is a library for industrial-strength NLP in Python and Cython. It -assumes that NLP is mostly about solving machine learning problems, and that +assumes that NLP is mostly about solving large machine learning problems, and that solving these problems is mostly about feature extraction. So, spaCy helps you -do feature extraction --- it helps you represent a linguistic context as -a vector of numbers. It's also a great way to create an inverted index, -particularly if you want to index documents on fancier properties. +do feature extraction --- it includes an excellent set of distributional and +orthographic features, memoizes them efficiently, and maps strings to +consecutive integer values. For commercial users, a trial license costs $0, with a one-time license fee of $1,000 to use spaCy in production. For non-commercial users, a GPL license is @@ -20,6 +20,70 @@ available. To quickly get the gist of the license terms, check out the license user stories. +Tokenization done right +======================= + +Most tokenizers rely on complicated regular expressions. Often, they leave you +with no way to align the tokens back to the original string --- a vital feature +if you want to display some mark-up, such as spelling correction. The regular +expressions also interact, making it hard to accommodate special cases. + +spaCy introduces a **novel tokenization algorithm** that's much faster and much +more flexible: + +.. code-block:: python + + def tokenize(string, prefixes={}, suffixes={}, specials={}): + '''Sketch of spaCy's tokenization algorithm.''' + tokens = [] + cache = {} + for chunk in string.split(): + # Because of Zipf's law, the cache serves the majority of "chunks". + if chunk in cache: + tokens.extend(cache[chunl]) + continue + key = chunk + + subtokens = [] + # Process a chunk by splitting off prefixes e.g. ( " { and suffixes e.g. , . : + # If we split one off, check whether we're left with a special-case, + # e.g. contractions (can't, won't, etc), emoticons, abbreviations, etc. + # This makes the tokenization easy to update and customize. + while chunk: + prefix, chunk = _consume_prefix(chunk, prefixes) + if prefix: + subtokens.append(prefix) + if chunk in specials: + subtokens.extend(specials[chunk]) + break + suffix, chunk = _consume_suffix(chunk, suffixes) + if suffix: + subtokens.append(suffix) + if chunk in specials: + subtokens.extend(specials[chunk]) + break + cache[key] = subtokens + +Your data is going to have its own quirks, so it's really useful to have +a tokenizer you can easily control. To see the limitations of the standard +regex-based approach, check out `CMU's recent work on tokenizing tweets `_. Despite a lot of careful attention, they can't handle all of their +known emoticons correctly --- doing so would interfere with the way they +process other punctuation. This isn't a problem for spaCy: we just add them +all to the special tokenization rules. + +spaCy's tokenizer is also incredibly efficient: + ++--------+---------------+--------------+ +| System | Tokens/second | Speed Factor | ++--------+---------------+--------------+ +| NLTK | 89 000 | 1.00 | ++--------+---------------+--------------+ +| spaCy | 3 093 000 | 38.30 | ++--------+---------------+--------------+ + +spaCy can create an inverted index of the 1.8 billion word Gigaword corpus, +keyed by lemmas, in under half an hour --- on a Macbook Air. + Unique Lexicon-centric design ============================= @@ -114,7 +178,7 @@ Here's a quick comparison of the following POS taggers: | nltk.tag.stanford | 209 | 96.7 | +-------------------+-------------+--------+ -Experimental details here. Three things are apparent from this comparison: +Experimental details TODO. Three things are apparent from this comparison: 1. The native NLTK tagger, nltk.pos_tag, is both slow and inaccurate;