* More thoughts on intro

This commit is contained in:
Matthew Honnibal 2014-12-15 09:19:29 +11:00
parent 792802b2b9
commit 77dd7a212a
1 changed files with 34 additions and 55 deletions

View File

@ -4,59 +4,49 @@
contain the root `toctree` directive. contain the root `toctree` directive.
================================ ================================
spaCy NLP Tokenizer and Lexicon spaCy: Industrial-strength NLP
================================ ================================
spaCy is a library for industrial-strength NLP in Python and Cython. spaCy's spaCy is a library for industrial-strength text processing in Python and Cython.
take on NLP is that it's mostly about feature extraction --- that's the part It features extremely efficient, up-to-date algorithms, and a rethink of how those
that's specific to NLP, so that's what an NLP library should focus on. algorithms should be accessed.
spaCy also believes that for NLP, **efficiency is critical**. If you're Most text-processing libraries give you APIs that look like this:
running batch jobs, you probably have an enormous amount of data; if you're
serving requests one-by-one, you want lower latency and fewer servers. Even if
you're doing exploratory research on relatively small samples, you should still
value efficiency, because it means you can run more experiments.
Depending on the task, spaCy is between 10 and 200 times faster than NLTK, >>> import nltk
often with much better accuracy. See Benchmarks for details, and >>> nltk.pos_tag(nltk.word_tokenize('''Some string of language.'''))
Why is spaCy so fast? for a discussion of the algorithms and implementation [('Some', 'DT'), ('string', 'VBG'), ('of', 'IN'), ('language', 'NN'), ('.', '.')]
that makes this possible.
+---------+----------+-------------+----------+ A list of strings is good for poking around, or for printing the annotation to
| System | Tokenize | --> Counts | --> Stem | evaluate it. But to actually *use* the output, you have to jump through some
+---------+----------+-------------+----------+ hoops. If you're doing some machine learning, all the strings have to be
| spaCy | 1m42s | 1m59s | 1m59s | mapped to integers, and you have to save and load the mapping at training and
+---------+----------+-------------+----------+ runtime. If you want to display mark-up based on the annotation, you have to
| NLTK | 20m2s | 28m24s | 52m28 | realign the tokens to your original string.
+---------+----------+-------------+----------+
Times for 100m words of text. With spaCy, you should never have to do any string processing at all:
Unique Lexicon-centric design
=============================
spaCy helps you build models that generalise better, by making it easy to use
more robust features. Instead of a list of strings, the tokenizer returns
references to rich lexical types. Features which ask about the word's Brown cluster,
its typical part-of-speech tag, how it's usually cased etc require no extra effort:
>>> from spacy.en import EN >>> from spacy.en import EN
>>> from spacy.feature_names import * >>> from spacy.en import feature_names as fn
>>> feats = ( >>> tokens = EN.tokenize('''Some string of language.''')
SIC, # ID of the original word form >>> tokens.to_array((fn.WORD, fn.SUFFIX, fn.CLUSTER, fn.POS, fn.LEMMA))
STEM, # ID of the stemmed word form
CLUSTER, # ID of the word's Brown cluster
IS_TITLE, # Was the word title-cased?
POS_TYPE # A cluster ID describing what POS tags the word is usually assigned
)
>>> tokens = EN.tokenize(u'Split words, punctuation, emoticons etc.! ^_^')
>>> tokens.to_array(feats)[:5]
array([[ 1, 2, 3, 4],
[...],
[...],
[...]])
A range of excellent features are pre-computed for you, and by default the
words are part-of-speech tagged and lemmatized. We do this by default because
even with these extra processes, spaCy is still several times faster than
most tokenizers:
+----------+----------+---------------+----------+
| System | Tokenize | POS Tag | |
+----------+----------+---------------+----------+
| spaCy | 37s | 98s | |
+----------+----------+---------------+----------+
| NLTK | 626s | 44,310s (12h) | |
+----------+----------+---------------+----------+
| CoreNLP | 420s | 1,300s (22m) | |
+----------+----------+---------------+----------+
| ZPar | | ~1,500s | |
+----------+----------+---------------+----------+
spaCy is designed to **make the right thing easy**, where the right thing is to: spaCy is designed to **make the right thing easy**, where the right thing is to:
@ -68,10 +58,6 @@ spaCy is designed to **make the right thing easy**, where the right thing is to:
* **Minimize string processing**, and instead compute with arrays of ID ints. * **Minimize string processing**, and instead compute with arrays of ID ints.
For the current list of lexical features, see `Lexical Features`_.
.. _lexical features: features.html
Tokenization done right Tokenization done right
======================= =======================
@ -123,13 +109,6 @@ known emoticons correctly --- doing so would interfere with the way they
process other punctuation. This isn't a problem for spaCy: we just add them process other punctuation. This isn't a problem for spaCy: we just add them
all to the special tokenization rules. all to the special tokenization rules.
spaCy's tokenizer is also incredibly efficient:
spaCy can create an inverted index of the 1.8 billion word Gigaword corpus,
in under half an hour --- on a Macbook Air. See the `inverted
index tutorial`_.
.. _inverted index tutorial: index_tutorial.html
Comparison with NLTK Comparison with NLTK
==================== ====================