From 69e3a07fa1265159e82dfb00a2e1e113ded62a95 Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Sun, 21 Dec 2014 17:40:12 +1100 Subject: [PATCH] * More index.rst fiddling --- docs/source/index.rst | 79 +++++++++++++++++-------------------------- 1 file changed, 31 insertions(+), 48 deletions(-) diff --git a/docs/source/index.rst b/docs/source/index.rst index 47d728956..e1a0b0112 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -8,60 +8,35 @@ spaCy: Industrial-strength NLP ================================ spaCy is a library for industrial-strength text processing in Python and Cython. -It features extremely efficient, up-to-date algorithms, and a rethink of how those -algorithms should be accessed. +Its core values are efficiency, accuracy and minimalism: you get a fast pipeline of +state-of-the-art components, a nice API, and no clutter. -A typical text-processing API looks something like this: +spaCy is particularly good for feature extraction, because it pre-loads lexical +resources, maps strings to integer IDs, and supports output of numpy arrays: - >>> import nltk - >>> nltk.pos_tag(nltk.word_tokenize('''Some string of language.''')) - [('Some', 'DT'), ('string', 'VBG'), ('of', 'IN'), ('language', 'NN'), ('.', '.')] + >>> from spacy.en import English + >>> from spacy.en import attrs + >>> nlp = English() + >>> tokens = nlp(u'An example sentence', pos_tag=True, parse=True) + >>> tokens.to_array((attrs.LEMMA, attrs.POS, attrs.SHAPE, attrs.CLUSTER)) -This API often leaves you with a lot of busy-work. If you're doing some machine -learning or information extraction, all the strings have to be mapped to integers, -and you have to save and load the mapping at training and runtime. If you want -to display mark-up based on the annotation, you have to realign the tokens to your -original string. +spaCy also makes it easy to add in-line mark up. Let's say you want to mark all +adverbs in red: -I've been writing NLP systems for almost ten years now, so I've done these -things dozens of times. When designing spaCy, I thought carefully about how to -make the right thing easy. + >>> from spacy.defs import ADVERB + >>> color = lambda t: u'\033[91m' % t if t.pos == ADVERB else u'%s' + >>> print u''.join(color(t) + unicode(t) for t in tokens) -We begin by initializing a global vocabulary store: +Tokens.__iter__ produces a sequence of Token objects. The Token.__unicode__ +method --- invoked by unicode(t) --- pads each token with any whitespace that +followed it. So, u''.join(unicode(t) for t in tokens) is guaranteed to restore +the original string. - >>> from spacy.en import EN - >>> EN.load() +spaCy is also very efficient --- much more efficient than any other language +processing tools available. The table below compares the time to tokenize, POS +tag and parse 100m words of text; it also shows accuracy on the standard +evaluation, from the Wall Street Journal: -The vocabulary reads in a data file with all sorts of pre-computed lexical -features. You can load anything you like here, but by default I give you: - -* String IDs for the word's string, its prefix, suffix and "shape"; -* Length (in unicode code-points) -* A cluster ID, representing distributional similarity; -* A cluster ID, representing its typical POS tag distribution; -* Good-turing smoothed unigram probability; -* 64 boolean features, for assorted orthographic and distributional features. - -With so many features pre-computed, you usually don't have to do any string -processing at all. You give spaCy your string, and tell it to give you either -a numpy array, or a counts dictionary: - - >>> from spacy.en import feature_names as fn - >>> tokens = EN.tokenize(u'''Some string of language.''') - >>> tokens.to_array((fn.WORD, fn.SUFFIX, fn.CLUSTER)) - ... - >>> tokens.count_by(fn.WORD) - -If you do need strings, you can simply iterate over the Tokens object: - - >>> for token in tokens: - ... - -I mostly use this for debugging and testing. - -spaCy returns these rich Tokens objects much faster than most other tokenizers -can give you a list of strings --- in fact, spaCy's POS tagger is *4 times -faster* than CoreNLP's tokenizer: +----------+----------+---------------+----------+ | System | Tokenize | POS Tag | | @@ -75,8 +50,16 @@ faster* than CoreNLP's tokenizer: | ZPar | | ~1,500s | | +----------+----------+---------------+----------+ +spaCy completes its whole pipeline faster than some of the other libraries can +tokenize the text. Its POS tag accuracy is as good as any system available. +For parsing, I chose an algorithm that sacrificed some accuracy, in favour of +efficiency. - +I wrote spaCy so that startups and other small companies could take advantage +of the enormous progress being made by NLP academics. Academia is competitive, +and what you're competing to do is write papers --- so it's very hard to write +software useful to non-academics. Seeing this gap, I resigned from my post-doc, +and wrote spaCy. .. toctree:: :hidden: