* Impove index docs

This commit is contained in:
Matthew Honnibal 2015-01-16 07:08:35 +11:00
parent e8dbac8a0c
commit e28b224b80
1 changed files with 158 additions and 61 deletions

View File

@ -3,87 +3,184 @@
You can adapt this file completely to your liking, but it should at least You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive. contain the root `toctree` directive.
=================================== ==============================
spaCy: Text-processing for products spaCy: Industrial-strength NLP
=================================== ==============================
spaCy is a library for industrial-strength text processing in Python and Cython. spaCy is a library for industrial-strength text processing in Python and Cython.
Its core values are efficiency, accuracy and minimalism: you get a fast pipeline of It is commercial open source software, with a dual (AGPL or commercial)
state-of-the-art components, a nice API, and no clutter: license.
>>> from spacy.en import English If you're a small company doing NLP, spaCy might seem like a minor miracle.
>>> nlp = English() It's by far the fastest NLP software available. The full processing pipeline
>>> tokens = nlp(u'An example sentence', tag=True, parse=True) completes in 7ms, including state-of-the-art part-of-speech tagging and
>>> for token in tokens: dependency parsing. All strings are mapped to integer IDs, tokens
... print token.lemma, token.pos, bin(token.cluster) are linked to word vectors and other lexical resources, and a range of useful
an DT Xx 0b111011110 features are pre-calculated and cached.
example NN xxxx 0b111110001
sentence NN xxxx 0b1101111110010
spaCy is particularly good for feature extraction, because it pre-loads lexical If none of that made any sense to you, here's the gist of it. Computers don't
resources, maps strings to integer IDs, and supports output of numpy arrays: understand text. This is unfortunate, because that's what the web almost entirely
consists of. We want to recommend people text based on other text they liked.
We want to shorten text to display it on a mobile screen. We want to aggregate
it, link it, filter it, categorise it, generate it and correct it.
>>> from spacy.en import attrs spaCy provides a set of utility functions that help programmers build such
>>> tokens.to_array((attrs.LEMMA, attrs.POS, attrs.SHAPE, attrs.CLUSTER)) products. It's an NLP engine, analogous to the 3d engines commonly licensed
array([[ 1265, 14, 76, 478], for game development.
[ 1545, 24, 262, 497],
[ 3385, 24, 262, 14309]])
spaCy also makes it easy to add in-line mark up. Let's say you're convinced by Example functionality
Stephen King's advice that `adverbs are not your friend <http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs/>`_, so you want to mark ---------------------
them in red. We'll use one of the examples he finds particularly egregious:
>>> tokens = nlp(u"Give it back, he pleaded abjectly, its mine.") Let's say you're developing a proofreading tool, or possibly an IDE for
>>> red = lambda string: u'\033[91m{0}\033[0m'.format(string) writers. You're convinced by Stephen King's advice that `adverbs are not your
>>> red = lambda string: unicode(string).upper() # TODO -- make red work on website... friend <http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs/>`_, so
>>> print u''.join(red(t) if t.is_adverb else unicode(t) for t in tokens) you want to **mark adverbs in red**. We'll use one of the examples he finds
particularly egregious:
>>> import spacy.en
>>> from spacy.enums import ADVERB
>>> # Load the pipeline, and call it with some text.
>>> nlp = spacy.en.English()
>>> tokens = nlp("Give it back, he pleaded abjectly, its mine.",
tag=True, parse=True)
>>> output = ''
>>> for tok in tokens:
... # Token.string preserves whitespace, making it easy to
... # reconstruct the original string.
... output += tok.string.upper() if tok.is_pos(ADVERB) else tok.string
>>> print(output)
Give it BACK, he pleaded ABJECTLY, its mine. Give it BACK, he pleaded ABJECTLY, its mine.
Easy --- except, "back" isn't the sort of word we're looking for, even though Easy enough --- but the problem is that we've also highlighted "back", when probably
it's undeniably an adverb. Let's search refine the logic a little, and only we only wanted to highlight "abjectly". This is undoubtedly an adverb, but it's
highlight adverbs that modify verbs: not the sort of adverb King is talking about. This is a persistent problem when
dealing with linguistic categories: the prototypical examples, the ones whic
spring to your mind, are often not the most common cases.
>>> print u''.join(red(t) if t.is_adverb and t.head.is_verb else unicode(t) for t in tokens) There are lots of ways we might refine our logic, depending on just what words
we want to flag. The simplest way to filter out adverbs like "back" and "not"
is by word frequency: these words are much more common than the manner adverbs
the style guides are worried about.
The prob attribute of a Lexeme or Token object gives a log probability estimate
of the word, based on smoothed counts from a 3bn word corpus:
>>> nlp.vocab[u'back'].prob
-7.403977394104004
>>> nlp.vocab[u'not'].prob
-5.407193660736084
>>> nlp.vocab[u'quietly'].prob
-11.07155704498291
So we can easily exclude the N most frequent words in English from our adverb
marker. Let's try N=1000 for now:
>>> import spacy.en
>>> from spacy.enums import ADVERB
>>> nlp = spacy.en.English()
>>> # Find log probability of Nth most frequent word
>>> probs = [lex.prob for lex in nlp.vocab]
>>> is_adverb = lambda tok: tok.is_pos(ADVERB) and tok.prob < probs[-1000]
>>> tokens = nlp("Give it back, he pleaded abjectly, its mine.",
tag=True, parse=True)
>>> print(''.join(tok.string.upper() if is_adverb(tok) else tok.string))
Give it back, he pleaded ABJECTLY, its mine. Give it back, he pleaded ABJECTLY, its mine.
spaCy is also very efficient --- much more efficient than any other language There are lots of ways to refine the logic, depending on just what words we
processing tools available. The table below compares the time to tokenize, POS want to flag. Let's define this narrowly, and only flag adverbs applied to
tag and parse a document (amortized over 100k samples). It also shows accuracy verbs of communication or perception:
on the standard evaluation, from the Wall Street Journal:
+----------+----------+---------+----------+----------+------------+ >>> from spacy.enums import VERB, WN_V_COMMUNICATION, WN_V_COGNITION
| System | Tokenize | POS Tag | Parse | POS Acc. | Parse Acc. | >>> def is_say_verb(tok):
+----------+----------+---------+----------+----------+------------+ ... return tok.is_pos(VERB) and (tok.check_flag(WN_V_COMMUNICATION) or
| spaCy | 0.37ms | 0.98ms | 10ms | 97.3% | 92.4% | tok.check_flag(WN_V_COGNITION))
+----------+----------+---------+----------+----------+------------+ >>> print(''.join(tok.string.upper() if is_adverb(tok) and is_say_verb(tok.head)
| NLTK | 6.2ms | 443ms | n/a | 94.0% | n/a | else tok.string))
+----------+----------+---------+----------+----------+------------+ Give it back, he pleaded ABJECTLY, its mine.
| CoreNLP | 4.2ms | 13ms | todo | 96.97% | 92.2% |
+----------+----------+---------+----------+----------+------------+
| ZPar | n/a | 15ms | 850ms | 97.3% | 92.9% |
+----------+----------+---------+----------+----------+------------+
(The CoreNLP results refer to their recently published shift-reduce neural The two flags refer to the 45 top-level categories in the WordNet ontology.
network parser.) spaCy stores membership in these categories as a bit set, because
words can have multiple senses. We only need one 64
bit flag variable per word in the vocabulary, so this useful data requires only
2.4mb of memory.
I wrote spaCy so that startups and other small companies could take advantage spaCy packs all sorts of other goodies into its lexicon.
of the enormous progress being made by NLP academics. Academia is competitive, Words are mapped to one these rich lexical types immediately, during
and what you're competing to do is write papers --- so it's very hard to write tokenization --- and spaCy's tokenizer is *fast*.
software useful to non-academics. Seeing this gap, I resigned from my post-doc,
and wrote spaCy. Efficiency
----------
.. table:: Efficiency comparison. See `Benchmarks`_ for details.
+--------------+---------------------------+--------------------------------+
| | Absolute (ms per doc) | Relative (to spaCy) |
+--------------+----------+--------+-------+----------+---------+-----------+
| System | Tokenize | Tag | Parse | Tokenize | Tag | Parse |
+--------------+----------+--------+-------+----------+---------+-----------+
| spaCy | 0.2ms | 1ms | 7ms | 1x | 1x | 1x |
+--------------+----------+--------+-------+----------+---------+-----------+
| CoreNLP | 2ms | 10ms | 49ms | 10x | 10x | 7x |
+--------------+----------+--------+-------+----------+---------+-----------+
| ZPar | 1ms | 8ms | 850ms | 5x | 8x | 121x |
+--------------+----------+--------+-------+----------+---------+-----------+
| NLTK | 4ms | 443ms | n/a | 20x | 443x | n/a |
+--------------+----------+--------+-------+----------+---------+-----------+
Efficiency is a major concern for NLP applications. It is very common to hear
people say that they cannot afford more detailed processing, because their
datasets are too large. This is a bad position to be in. If you can't apply
detailed processing, you generally have to cobble together various heuristics.
This normally takes a few iterations, and what you come up with will usually be
brittle and difficult to reason about.
spaCy's parser is faster than most taggers, and its tokenizer is fast enough
for truly web-scale processing. And the tokenizer doesn't just give you a list
of strings. A spaCy token is a pointer to a Lexeme struct, from which you can
access a wide range of pre-computed features.
.. I wrote spaCy because I think existing commercial NLP engines are crap.
Alchemy API are a typical example. Check out this part of their terms of
service:
publish or perform any benchmark or performance tests or analysis relating to
the Service or the use thereof without express authorization from AlchemyAPI;
.. Did you get that? You're not allowed to evaluate how well their system works,
unless you're granted a special exception. Their system must be pretty
terrible to motivate such an embarrassing restriction.
They must know this makes them look bad, but they apparently believe allowing
you to evaluate their product would make them look even worse!
.. spaCy is based on science, not alchemy. It's open source, and I am happy to
clarify any detail of the algorithms I've implemented.
It's evaluated against the current best published systems, following the standard
methodologies. These evaluations show that it performs extremely well.
Accuracy
--------
.. table:: Accuracy comparison, on the standard benchmark data from the Wall Street Journal. See `Benchmarks`_ for details.
+--------------+----------+------------+
| System | POS acc. | Parse acc. |
+--------------+----------+------------+
| spaCy | 97.2 | 92.4 |
+--------------+----------+------------+
| CoreNLP | 96.9 | 92.2 |
+--------------+----------+------------+
| ZPar | 97.3 | 92.9 |
+--------------+----------+------------+
| NLTK | 94.3 | n/a |
+--------------+----------+------------+
spaCy is dual-licensed: you can either use it under the GPL, or pay a one-time
fee of $5000 for a commercial license. I think this is excellent value:
you'll find NLTK etc much more expensive, because what you save on license
cost, you'll lose many times over in lost productivity. $5000 does not buy you
much developer time.
.. toctree:: .. toctree::
:hidden:
:maxdepth: 3 :maxdepth: 3
license.rst
quickstart.rst
features.rst features.rst
license_stories.rst
api.rst api.rst