From 77dd7a212a3b0012047a47913345cf6d3f4f0da9 Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Mon, 15 Dec 2014 09:19:29 +1100 Subject: [PATCH] * More thoughts on intro --- docs/source/index.rst | 89 +++++++++++++++++-------------------------- 1 file changed, 34 insertions(+), 55 deletions(-) diff --git a/docs/source/index.rst b/docs/source/index.rst index fb738aa32..62987ae03 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -4,59 +4,49 @@ contain the root `toctree` directive. ================================ -spaCy NLP Tokenizer and Lexicon +spaCy: Industrial-strength NLP ================================ -spaCy is a library for industrial-strength NLP in Python and Cython. spaCy's -take on NLP is that it's mostly about feature extraction --- that's the part -that's specific to NLP, so that's what an NLP library should focus on. +spaCy is a library for industrial-strength text processing in Python and Cython. +It features extremely efficient, up-to-date algorithms, and a rethink of how those +algorithms should be accessed. -spaCy also believes that for NLP, **efficiency is critical**. If you're -running batch jobs, you probably have an enormous amount of data; if you're -serving requests one-by-one, you want lower latency and fewer servers. Even if -you're doing exploratory research on relatively small samples, you should still -value efficiency, because it means you can run more experiments. +Most text-processing libraries give you APIs that look like this: -Depending on the task, spaCy is between 10 and 200 times faster than NLTK, -often with much better accuracy. See Benchmarks for details, and -Why is spaCy so fast? for a discussion of the algorithms and implementation -that makes this possible. + >>> import nltk + >>> nltk.pos_tag(nltk.word_tokenize('''Some string of language.''')) + [('Some', 'DT'), ('string', 'VBG'), ('of', 'IN'), ('language', 'NN'), ('.', '.')] -+---------+----------+-------------+----------+ -| System | Tokenize | --> Counts | --> Stem | -+---------+----------+-------------+----------+ -| spaCy | 1m42s | 1m59s | 1m59s | -+---------+----------+-------------+----------+ -| NLTK | 20m2s | 28m24s | 52m28 | -+---------+----------+-------------+----------+ +A list of strings is good for poking around, or for printing the annotation to +evaluate it. But to actually *use* the output, you have to jump through some +hoops. If you're doing some machine learning, all the strings have to be +mapped to integers, and you have to save and load the mapping at training and +runtime. If you want to display mark-up based on the annotation, you have to +realign the tokens to your original string. -Times for 100m words of text. - - -Unique Lexicon-centric design -============================= - -spaCy helps you build models that generalise better, by making it easy to use -more robust features. Instead of a list of strings, the tokenizer returns -references to rich lexical types. Features which ask about the word's Brown cluster, -its typical part-of-speech tag, how it's usually cased etc require no extra effort: +With spaCy, you should never have to do any string processing at all: >>> from spacy.en import EN - >>> from spacy.feature_names import * - >>> feats = ( - SIC, # ID of the original word form - STEM, # ID of the stemmed word form - CLUSTER, # ID of the word's Brown cluster - IS_TITLE, # Was the word title-cased? - POS_TYPE # A cluster ID describing what POS tags the word is usually assigned - ) - >>> tokens = EN.tokenize(u'Split words, punctuation, emoticons etc.! ^_^') - >>> tokens.to_array(feats)[:5] - array([[ 1, 2, 3, 4], - [...], - [...], - [...]]) + >>> from spacy.en import feature_names as fn + >>> tokens = EN.tokenize('''Some string of language.''') + >>> tokens.to_array((fn.WORD, fn.SUFFIX, fn.CLUSTER, fn.POS, fn.LEMMA)) +A range of excellent features are pre-computed for you, and by default the +words are part-of-speech tagged and lemmatized. We do this by default because +even with these extra processes, spaCy is still several times faster than +most tokenizers: + ++----------+----------+---------------+----------+ +| System | Tokenize | POS Tag | | ++----------+----------+---------------+----------+ +| spaCy | 37s | 98s | | ++----------+----------+---------------+----------+ +| NLTK | 626s | 44,310s (12h) | | ++----------+----------+---------------+----------+ +| CoreNLP | 420s | 1,300s (22m) | | ++----------+----------+---------------+----------+ +| ZPar | | ~1,500s | | ++----------+----------+---------------+----------+ spaCy is designed to **make the right thing easy**, where the right thing is to: @@ -67,10 +57,6 @@ spaCy is designed to **make the right thing easy**, where the right thing is to: expect this to be exponentially more efficient. * **Minimize string processing**, and instead compute with arrays of ID ints. - -For the current list of lexical features, see `Lexical Features`_. - -.. _lexical features: features.html Tokenization done right ======================= @@ -123,13 +109,6 @@ known emoticons correctly --- doing so would interfere with the way they process other punctuation. This isn't a problem for spaCy: we just add them all to the special tokenization rules. -spaCy's tokenizer is also incredibly efficient: - -spaCy can create an inverted index of the 1.8 billion word Gigaword corpus, -in under half an hour --- on a Macbook Air. See the `inverted -index tutorial`_. - -.. _inverted index tutorial: index_tutorial.html Comparison with NLTK ====================