mirror of https://github.com/explosion/spaCy.git
* More thoughts on intro
This commit is contained in:
parent
792802b2b9
commit
77dd7a212a
|
@ -4,59 +4,49 @@
|
||||||
contain the root `toctree` directive.
|
contain the root `toctree` directive.
|
||||||
|
|
||||||
================================
|
================================
|
||||||
spaCy NLP Tokenizer and Lexicon
|
spaCy: Industrial-strength NLP
|
||||||
================================
|
================================
|
||||||
|
|
||||||
spaCy is a library for industrial-strength NLP in Python and Cython. spaCy's
|
spaCy is a library for industrial-strength text processing in Python and Cython.
|
||||||
take on NLP is that it's mostly about feature extraction --- that's the part
|
It features extremely efficient, up-to-date algorithms, and a rethink of how those
|
||||||
that's specific to NLP, so that's what an NLP library should focus on.
|
algorithms should be accessed.
|
||||||
|
|
||||||
spaCy also believes that for NLP, **efficiency is critical**. If you're
|
Most text-processing libraries give you APIs that look like this:
|
||||||
running batch jobs, you probably have an enormous amount of data; if you're
|
|
||||||
serving requests one-by-one, you want lower latency and fewer servers. Even if
|
|
||||||
you're doing exploratory research on relatively small samples, you should still
|
|
||||||
value efficiency, because it means you can run more experiments.
|
|
||||||
|
|
||||||
Depending on the task, spaCy is between 10 and 200 times faster than NLTK,
|
>>> import nltk
|
||||||
often with much better accuracy. See Benchmarks for details, and
|
>>> nltk.pos_tag(nltk.word_tokenize('''Some string of language.'''))
|
||||||
Why is spaCy so fast? for a discussion of the algorithms and implementation
|
[('Some', 'DT'), ('string', 'VBG'), ('of', 'IN'), ('language', 'NN'), ('.', '.')]
|
||||||
that makes this possible.
|
|
||||||
|
|
||||||
+---------+----------+-------------+----------+
|
A list of strings is good for poking around, or for printing the annotation to
|
||||||
| System | Tokenize | --> Counts | --> Stem |
|
evaluate it. But to actually *use* the output, you have to jump through some
|
||||||
+---------+----------+-------------+----------+
|
hoops. If you're doing some machine learning, all the strings have to be
|
||||||
| spaCy | 1m42s | 1m59s | 1m59s |
|
mapped to integers, and you have to save and load the mapping at training and
|
||||||
+---------+----------+-------------+----------+
|
runtime. If you want to display mark-up based on the annotation, you have to
|
||||||
| NLTK | 20m2s | 28m24s | 52m28 |
|
realign the tokens to your original string.
|
||||||
+---------+----------+-------------+----------+
|
|
||||||
|
|
||||||
Times for 100m words of text.
|
With spaCy, you should never have to do any string processing at all:
|
||||||
|
|
||||||
|
|
||||||
Unique Lexicon-centric design
|
|
||||||
=============================
|
|
||||||
|
|
||||||
spaCy helps you build models that generalise better, by making it easy to use
|
|
||||||
more robust features. Instead of a list of strings, the tokenizer returns
|
|
||||||
references to rich lexical types. Features which ask about the word's Brown cluster,
|
|
||||||
its typical part-of-speech tag, how it's usually cased etc require no extra effort:
|
|
||||||
|
|
||||||
>>> from spacy.en import EN
|
>>> from spacy.en import EN
|
||||||
>>> from spacy.feature_names import *
|
>>> from spacy.en import feature_names as fn
|
||||||
>>> feats = (
|
>>> tokens = EN.tokenize('''Some string of language.''')
|
||||||
SIC, # ID of the original word form
|
>>> tokens.to_array((fn.WORD, fn.SUFFIX, fn.CLUSTER, fn.POS, fn.LEMMA))
|
||||||
STEM, # ID of the stemmed word form
|
|
||||||
CLUSTER, # ID of the word's Brown cluster
|
|
||||||
IS_TITLE, # Was the word title-cased?
|
|
||||||
POS_TYPE # A cluster ID describing what POS tags the word is usually assigned
|
|
||||||
)
|
|
||||||
>>> tokens = EN.tokenize(u'Split words, punctuation, emoticons etc.! ^_^')
|
|
||||||
>>> tokens.to_array(feats)[:5]
|
|
||||||
array([[ 1, 2, 3, 4],
|
|
||||||
[...],
|
|
||||||
[...],
|
|
||||||
[...]])
|
|
||||||
|
|
||||||
|
A range of excellent features are pre-computed for you, and by default the
|
||||||
|
words are part-of-speech tagged and lemmatized. We do this by default because
|
||||||
|
even with these extra processes, spaCy is still several times faster than
|
||||||
|
most tokenizers:
|
||||||
|
|
||||||
|
+----------+----------+---------------+----------+
|
||||||
|
| System | Tokenize | POS Tag | |
|
||||||
|
+----------+----------+---------------+----------+
|
||||||
|
| spaCy | 37s | 98s | |
|
||||||
|
+----------+----------+---------------+----------+
|
||||||
|
| NLTK | 626s | 44,310s (12h) | |
|
||||||
|
+----------+----------+---------------+----------+
|
||||||
|
| CoreNLP | 420s | 1,300s (22m) | |
|
||||||
|
+----------+----------+---------------+----------+
|
||||||
|
| ZPar | | ~1,500s | |
|
||||||
|
+----------+----------+---------------+----------+
|
||||||
|
|
||||||
spaCy is designed to **make the right thing easy**, where the right thing is to:
|
spaCy is designed to **make the right thing easy**, where the right thing is to:
|
||||||
|
|
||||||
|
@ -68,10 +58,6 @@ spaCy is designed to **make the right thing easy**, where the right thing is to:
|
||||||
|
|
||||||
* **Minimize string processing**, and instead compute with arrays of ID ints.
|
* **Minimize string processing**, and instead compute with arrays of ID ints.
|
||||||
|
|
||||||
For the current list of lexical features, see `Lexical Features`_.
|
|
||||||
|
|
||||||
.. _lexical features: features.html
|
|
||||||
|
|
||||||
Tokenization done right
|
Tokenization done right
|
||||||
=======================
|
=======================
|
||||||
|
|
||||||
|
@ -123,13 +109,6 @@ known emoticons correctly --- doing so would interfere with the way they
|
||||||
process other punctuation. This isn't a problem for spaCy: we just add them
|
process other punctuation. This isn't a problem for spaCy: we just add them
|
||||||
all to the special tokenization rules.
|
all to the special tokenization rules.
|
||||||
|
|
||||||
spaCy's tokenizer is also incredibly efficient:
|
|
||||||
|
|
||||||
spaCy can create an inverted index of the 1.8 billion word Gigaword corpus,
|
|
||||||
in under half an hour --- on a Macbook Air. See the `inverted
|
|
||||||
index tutorial`_.
|
|
||||||
|
|
||||||
.. _inverted index tutorial: index_tutorial.html
|
|
||||||
|
|
||||||
Comparison with NLTK
|
Comparison with NLTK
|
||||||
====================
|
====================
|
||||||
|
|
Loading…
Reference in New Issue