* Revise intro copy. Add NLTK comparison

2014-12-01 22:55:13 +11:00 · 2014-12-01 22:55:13 +11:00 · 3430d5f629
parent 33dfb4933c
commit 3430d5f629
1 changed files with 140 additions and 27 deletions
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -3,45 +3,158 @@
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

+================================
 spaCy NLP Tokenizer and Lexicon
 ================================

-spaCy is a library for industrial strength NLP in Python.  Its core
-values are:
+spaCy is a library for industrial-strength NLP in Python and Cython.  It
+assumes that NLP is mostly about solving machine learning problems, and that
+solving these problems is mostly about feature extraction.  So, spaCy helps you
+do feature extraction --- it helps you represent a linguistic context as
+a vector of numbers.  It's also a great way to create an inverted index,
+particularly if you want to index documents on fancier properties.

-* **Efficiency**: You won't find faster NLP tools. For shallow analysis, it's 10x
-  faster than Stanford Core NLP, and over 200x faster than NLTK.  Its parser is
-  over 100x faster than Stanford's.
+For commercial users, a trial license costs $0, with a one-time license fee of
+$1,000 to use spaCy in production.  For non-commercial users, a GPL license is
+available.  To quickly get the gist of the license terms, check out the license
+user stories.

-* **Accuracy**:  All spaCy tools are within 0.5% of the current published
-  state-of-the-art, on both news and web text. NLP moves fast, so always check
-  the numbers --- and don't settle for tools that aren't backed by
-  rigorous recent evaluation.

-* **Minimalism**:  This isn't a library that covers 43 known algorithms to do X. You
-  get 1 --- the best one --- with a simple, low-level interface. This keeps the
-  code-base small and concrete.  Our Python APIs use lists and
-  dictionaries, and our C/Cython APIs use arrays and simple structs.
+Unique Lexicon-centric design
+=============================
+
+spaCy takes care of all string-processing, efficiently and accurately.  This
+makes a night-and-day difference to your feature extraction code.
+Instead of a list of strings, spaCy's tokenizer gives you references to feature-rich
+lexeme objects:
+
+    >>> from spacy.en import EN
+    >>> from spacy.feature_names import SIC, NORM, SHAPE, ASCIIED, PREFIX, SUFFIX, \
+            LENGTH, CLUSTER, POS_TYPE, SENSE_TYPE, \
+            IS_ALPHA, IS_ASCII, IS_DIGIT, IS_PUNCT, IS_SPACE, IS_TITLE, IS_UPPER, \
+            LIKE_URL, LIKE_NUMBER
+    >>> feats = (
+            SIC, # ID of the original word form
+            NORM, # ID of the normalized word form
+            CLUSTER, # ID of the word's Brown cluster
+            IS_TITLE, # Was the word title-cased?
+            POS_TYPE # A cluster ID describing what POS tags the word is usually assigned
+        )
+    >>> tokens = EN.tokenize(u'Split words, punctuation, emoticons etc.! ^_^')
+    >>> tokens.to_strings()
+    [u'Split', u'words', u',', u'punctuation', u',', u'emoticons', u'etc.', u'!', u'^_^']
+    >>> tokens.to_array(feats)[:5]
+        array([[    1,  2,  3,  4],
+               [...],
+               [...],
+               [...]])
+
+
+spaCy is designed to **make the right thing easy**, where the right thing is to:
+
+* **Use rich distributional and orthographic features**. Without these, your model
+  will be very brittle and domain dependent.
+
+* **Compute features per type, not per token**. Because of Zipf's law, you can
+  expect this to be exponentially more efficient.
+
+* **Minimize string processing**, and instead compute with arrays of ID ints.
  

-Comparison
----------
+Comparison with NLTK
+====================

-+----------------+-------------+--------+---------------+--------------+
-| Tokenize & Tag | Speed (w/s) | Memory | % Acc. (news) | % Acc. (web) |
-+----------------+-------------+--------+---------------+--------------+
-| spaCy          | 107,000     |  1.3gb | 96.7          |              |
-+----------------+-------------+--------+---------------+--------------+
-| Stanford       | 8,000       |  1.5gb | 96.7          |              |
-+----------------+-------------+--------+---------------+--------------+
-| NLTK           | 543         |  61mb  | 94.0          |              |
-+----------------+-------------+--------+---------------+--------------+
+`NLTK <http://nltk.org>`_ provides interfaces to a wide-variety of NLP
+tools and resources, and its own implementations of a few algorithms.  It comes
+with comprehensive documentation, and a book introducing concepts in NLP.  For
+these reasons, it's very widely known.  However, if you're trying to make money
+or do cutting-edge research, NLTK is not a good choice.
+
+The `list of stuff in NLTK <http://www.nltk.org/py-modindex.html>`_ looks impressive,
+but almost none of it is useful for real work.  You're not going to make any money,
+or do top research, by using the NLTK chat bots, theorem provers, toy CCG implementation,
+etc.  Most of NLTK is there to assist in the explanation ideas in computational
+linguistics, at roughly an undergraduate level.
+But it also claims to support serious work, by wrapping external tools.
+
+In a pretty well known essay, Joel Spolsky discusses the pain of dealing with 
+`leaky abstractions <http://www.joelonsoftware.com/articles/LeakyAbstractions.html>`_.
+An abstraction tells you to not care about implementation
+details, but sometimes the implementation matters after all. When it
+does, you have to waste time revising your assumptions.
+
+NLTK's wrappers call external tools via subprocesses, and wrap this up so
+that it looks like a native API.  This abstraction leaks *a lot*.  The system
+calls impose far more overhead than a normal Python function call, which makes
+the most natural way to program against the API infeasible. 
+
+
+Case study: POS tagging
+-----------------------
+
+Here's a quick comparison of the following POS taggers:
+
+* **Stanford (CLI)**: The Stanford POS tagger, invoked once as a batch process
+  from the command-line;
+* **nltk.tag.stanford**: The Stanford tagger, invoked document-by-document via
+  NLTK's wrapper;
+* **nltk.pos_tag**: NLTK's own POS tagger, invoked document-by-document.
+* **spacy.en.pos_tag**: spaCy's POS tagger, invoked document-by-document.
+
+
+-------------------+-------------+--------+
+| System            | Speed (w/s) | % Acc. |
+-------------------+-------------+--------+
+| spaCy             | 107,000     | 96.7   |
+-------------------+-------------+--------+
+| Stanford (CLI)    | 8,000       | 96.7   |
+-------------------+-------------+--------+
+| nltk.pos_tag      | 543         | 94.0   |
+-------------------+-------------+--------+
+| nltk.tag.stanford | 209         | 96.7   |
+-------------------+-------------+--------+
+
+Experimental details here.  Three things are apparent from this comparison:
+
+1. The native NLTK tagger, nltk.pos_tag, is both slow and inaccurate;
+
+2. Calling the Stanford tagger document-by-document via NLTK is **40x** slower
+   than invoking the model once as a batch process, via the command-line;
+
+3. spaCy is over 10x faster than the Stanford tagger, even when called
+   **sentence-by-sentence**.
+
+The problem is that NLTK simply wraps the command-line
+interfaces of these tools, so communication is via a subprocess.  NLTK does not
+even hold open a pipe for you --- the model is reloaded, again and again.
+
+To use the wrapper effectively, you should batch up your text as much as possible.
+This probably isn't how you would like to structure your pipeline, and you
+might not be able to batch up much text at all, e.g. if serving a single
+request means processing a single document.
+Technically, NLTK does give you Python functions to access lots of different
+systems --- but, you can't use them as you would expect to use a normal Python
+function.  The abstraction leaks.
+
+Here's the bottom-line: the Stanford tools are written in Java, so using them
+from Python sucks.  You shouldn't settle for this.  It's a problem that springs
+purely from the tooling, rather than the domain.
+
+Summary
+-------
+
+NLTK is a well-known Python library for NLP, but for the important bits, you
+don't get actual Python modules.  You get wrappers which throw to external
+tools, via subprocesses.  This is not at all the same thing.
+
+spaCy is implemented in Cython, just like numpy, scikit-learn, lxml and other
+high-performance Python libraries.  So you get a native Python API, but the
+performance you expect from a program written in C.


 .. toctree::
    :hidden:
    :maxdepth: 3
+
+    features.rst
    
-    what/index.rst
-    why/index.rst
-    how/index.rst