* Impove index docs

2015-01-16 07:08:35 +11:00 · 2015-01-16 07:08:35 +11:00 · e28b224b80
parent e8dbac8a0c
commit e28b224b80
1 changed files with 158 additions and 61 deletions
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -3,87 +3,184 @@
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.
-===================================
+==============================
-spaCy: Text-processing for products
+spaCy: Industrial-strength NLP
-===================================
+==============================
 spaCy is a library for industrial-strength text processing in Python and Cython.
-Its core values are efficiency, accuracy and minimalism: you get a fast pipeline of
+It is commercial open source software, with a dual (AGPL or commercial)
-state-of-the-art components, a nice API, and no clutter:
+license.
-    >>> from spacy.en import English
+If you're a small company doing NLP, spaCy might seem like a minor miracle.
-    >>> nlp = English()
+It's by far the fastest NLP software available.  The full processing pipeline
-    >>> tokens = nlp(u'An example sentence', tag=True, parse=True)
+completes in 7ms, including state-of-the-art part-of-speech tagging and
-    >>> for token in tokens:
+dependency parsing.  All strings are mapped to integer IDs, tokens
-    ...   print token.lemma, token.pos, bin(token.cluster)
+are linked to word vectors and other lexical resources, and a range of useful
-    an DT Xx 0b111011110
+features are pre-calculated and cached.
    example NN xxxx 0b111110001
    sentence NN xxxx 0b1101111110010
-spaCy is particularly good for feature extraction, because it pre-loads lexical
+If none of that made any sense to you, here's the gist of it.  Computers don't
-resources, maps strings to integer IDs, and supports output of numpy arrays:
+understand text. This is unfortunate, because that's what the web almost entirely
 consists of.  We want to recommend people text based on other text they liked.
 We want to shorten text to display it on a mobile screen.  We want to aggregate
 it, link it, filter it, categorise it, generate it and correct it.
-    >>> from spacy.en import attrs
+spaCy provides a set of utility functions that help programmers build such
-    >>> tokens.to_array((attrs.LEMMA, attrs.POS, attrs.SHAPE, attrs.CLUSTER))
+products.  It's an NLP engine, analogous to the 3d engines commonly licensed
-    array([[ 1265,    14,    76,   478],
+for game development.
       [ 1545,    24,   262,   497],
       [ 3385,    24,   262, 14309]])
-spaCy also makes it easy to add in-line mark up. Let's say you're convinced by
+Example functionality
-Stephen King's advice that `adverbs are not your friend <http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs/>`_, so you want to mark
+---------------------
 them in red. We'll use one of the examples he finds particularly egregious:
-    >>> tokens = nlp(u"‘Give it back,’ he pleaded abjectly, ‘it’s mine.’")
+Let's say you're developing a proofreading tool, or possibly an IDE for
-    >>> red = lambda string: u'\033[91m{0}\033[0m'.format(string)
+writers.  You're convinced by Stephen King's advice that `adverbs are not your
-    >>> red = lambda string: unicode(string).upper() # TODO -- make red work on website...
+friend <http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs/>`_, so
-    >>> print u''.join(red(t) if t.is_adverb else unicode(t) for t in tokens)
+you want to **mark adverbs in red**.  We'll use one of the examples he finds
 particularly egregious:
    >>> import spacy.en
    >>> from spacy.enums import ADVERB
    >>> # Load the pipeline, and call it with some text.
    >>> nlp = spacy.en.English()
    >>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’",
                     tag=True, parse=True)
    >>> output = ''
    >>> for tok in tokens:
    ...     # Token.string preserves whitespace, making it easy to
    ...     # reconstruct the original string.
    ...     output += tok.string.upper() if tok.is_pos(ADVERB) else tok.string
    >>> print(output)
    ‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’
-Easy --- except, "back" isn't the sort of word we're looking for, even though
+Easy enough --- but the problem is that we've also highlighted "back", when probably
-it's undeniably an adverb.  Let's search refine the logic a little, and only
+we only wanted to highlight "abjectly". This is undoubtedly an adverb, but it's
-highlight adverbs that modify verbs:
+not the sort of adverb King is talking about.  This is a persistent problem when
 dealing with linguistic categories: the prototypical examples, the ones whic
 spring to your mind, are often not the most common cases.
-    >>> print u''.join(red(t) if t.is_adverb and t.head.is_verb else unicode(t) for t in tokens)
+There are lots of ways we might refine our logic, depending on just what words
 we want to flag.  The simplest way to filter out adverbs like "back" and "not"
 is by word frequency: these words are much more common than the manner adverbs
 the style guides are worried about.
 The prob attribute of a Lexeme or Token object gives a log probability estimate
 of the word, based on smoothed counts from a 3bn word corpus:
   >>> nlp.vocab[u'back'].prob
   -7.403977394104004
   >>> nlp.vocab[u'not'].prob
   -5.407193660736084
   >>> nlp.vocab[u'quietly'].prob
   -11.07155704498291
 So we can easily exclude the N most frequent words in English from our adverb
 marker.  Let's try N=1000 for now:
    >>> import spacy.en
    >>> from spacy.enums import ADVERB
    >>> nlp = spacy.en.English()
    >>> # Find log probability of Nth most frequent word
    >>> probs = [lex.prob for lex in nlp.vocab]
    >>> is_adverb = lambda tok: tok.is_pos(ADVERB) and tok.prob < probs[-1000]
    >>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’",
                     tag=True, parse=True)
    >>> print(''.join(tok.string.upper() if is_adverb(tok) else tok.string))
    ‘Give it back,’ he pleaded ABJECTLY, ‘it’s mine.’
-spaCy is also very efficient --- much more efficient than any other language
+There are lots of ways to refine the logic, depending on just what words we
-processing tools available.  The table below compares the time to tokenize, POS
+want to flag.  Let's define this narrowly, and only flag adverbs applied to
-tag and parse a document (amortized over 100k samples).  It also shows accuracy
+verbs of communication or perception:
 on the standard evaluation, from the Wall Street Journal:
-+----------+----------+---------+----------+----------+------------+
+    >>> from spacy.enums import VERB, WN_V_COMMUNICATION, WN_V_COGNITION
-| System   | Tokenize | POS Tag | Parse    | POS Acc. | Parse Acc. |
+    >>> def is_say_verb(tok):
-+----------+----------+---------+----------+----------+------------+
+    ...   return tok.is_pos(VERB) and (tok.check_flag(WN_V_COMMUNICATION) or
-| spaCy    | 0.37ms   | 0.98ms  | 10ms     | 97.3%    | 92.4%      |
+                                       tok.check_flag(WN_V_COGNITION))
-+----------+----------+---------+----------+----------+------------+
+    >>> print(''.join(tok.string.upper() if is_adverb(tok) and is_say_verb(tok.head)
-| NLTK     | 6.2ms    | 443ms   | n/a      | 94.0%    | n/a        |
+                      else tok.string))
-+----------+----------+---------+----------+----------+------------+
+    ‘Give it back,’ he pleaded ABJECTLY, ‘it’s mine.’
 | CoreNLP  | 4.2ms    | 13ms    | todo     | 96.97%   | 92.2%      |
 +----------+----------+---------+----------+----------+------------+
 | ZPar     | n/a      | 15ms    | 850ms    | 97.3%    | 92.9%      |
 +----------+----------+---------+----------+----------+------------+
-(The CoreNLP results refer to their recently published shift-reduce neural
+The two flags refer to the 45 top-level categories in the WordNet ontology.
-network parser.)
+spaCy stores membership in these categories as a bit set, because
 words can have multiple senses.  We only need one 64
 bit flag variable per word in the vocabulary, so this useful data requires only
 2.4mb of memory.
-I wrote spaCy so that startups and other small companies could take advantage
+spaCy packs all sorts of other goodies into its lexicon.
-of the enormous progress being made by NLP academics.  Academia is competitive,
+Words are mapped to one these rich lexical types immediately, during
-and what you're competing to do is write papers --- so it's very hard to write
+tokenization --- and spaCy's tokenizer is *fast*.
-software useful to non-academics. Seeing this gap, I resigned from my post-doc,
+
-and wrote spaCy.
+Efficiency
 ----------
 .. table:: Efficiency comparison. See `Benchmarks`_ for details.
  +--------------+---------------------------+--------------------------------+
  |              | Absolute (ms per doc)     | Relative (to spaCy)            |
  +--------------+----------+--------+-------+----------+---------+-----------+
  | System       | Tokenize | Tag    | Parse | Tokenize | Tag     | Parse     |
  +--------------+----------+--------+-------+----------+---------+-----------+
  | spaCy        | 0.2ms    | 1ms    | 7ms   | 1x       | 1x      | 1x        |
  +--------------+----------+--------+-------+----------+---------+-----------+
  | CoreNLP      | 2ms      | 10ms   | 49ms  | 10x      | 10x     | 7x        |
  +--------------+----------+--------+-------+----------+---------+-----------+
  | ZPar         | 1ms      | 8ms    | 850ms | 5x       | 8x      | 121x      |
  +--------------+----------+--------+-------+----------+---------+-----------+
  | NLTK         | 4ms      | 443ms  | n/a   | 20x      | 443x    |  n/a      |
  +--------------+----------+--------+-------+----------+---------+-----------+
 Efficiency is a major concern for NLP applications.  It is very common to hear
 people say that they cannot afford more detailed processing, because their
 datasets are too large.  This is a bad position to be in.  If you can't apply
 detailed processing, you generally have to cobble together various heuristics.
 This normally takes a few iterations, and what you come up with will usually be
 brittle and difficult to reason about.
 spaCy's parser is faster than most taggers, and its tokenizer is fast enough
 for truly web-scale processing.  And the tokenizer doesn't just give you a list
 of strings.  A spaCy token is a pointer to a Lexeme struct, from which you can
 access a wide range of pre-computed features.
 .. I wrote spaCy because I think existing commercial NLP engines are crap.
  Alchemy API are a typical example.  Check out this part of their terms of
  service:
  publish or perform any benchmark or performance tests or analysis relating to
  the Service or the use thereof without express authorization from AlchemyAPI;
 .. Did you get that? You're not allowed to evaluate how well their system works,
  unless you're granted a special exception.  Their system must be pretty
  terrible to motivate such an embarrassing restriction.
  They must know this makes them look bad, but they apparently believe allowing
  you to evaluate their product would make them look even worse!
 .. spaCy is based on science, not alchemy.  It's open source, and I am happy to
  clarify any detail of the algorithms I've implemented.
  It's evaluated against the current best published systems, following the standard
  methodologies.  These evaluations show that it performs extremely well.  
 Accuracy
 --------
 .. table:: Accuracy comparison, on the standard benchmark data from the Wall Street Journal. See `Benchmarks`_ for details.
  +--------------+----------+------------+
  | System       | POS acc. | Parse acc. |
  +--------------+----------+------------+
  | spaCy        | 97.2     | 92.4       |
  +--------------+----------+------------+
  | CoreNLP      | 96.9     | 92.2       | 
  +--------------+----------+------------+
  | ZPar         | 97.3     | 92.9       |
  +--------------+----------+------------+
  | NLTK         | 94.3     | n/a        |
  +--------------+----------+------------+
 spaCy is dual-licensed: you can either use it under the GPL, or pay a one-time
 fee of $5000 for a commercial license.  I think this is excellent value:
 you'll find NLTK etc much more expensive, because what you save on license
 cost, you'll lose many times over in lost productivity. $5000 does not buy you
 much developer time.
 .. toctree::
    :hidden:
    :maxdepth: 3
    license.rst 
    quickstart.rst
    features.rst
    license_stories.rst 
    api.rst