diff --git a/docs/source/index.rst b/docs/source/index.rst
index 62987ae03..47d728956 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -11,30 +11,57 @@ spaCy is a library for industrial-strength text processing in Python and Cython.
It features extremely efficient, up-to-date algorithms, and a rethink of how those
algorithms should be accessed.
-Most text-processing libraries give you APIs that look like this:
+A typical text-processing API looks something like this:
>>> import nltk
>>> nltk.pos_tag(nltk.word_tokenize('''Some string of language.'''))
[('Some', 'DT'), ('string', 'VBG'), ('of', 'IN'), ('language', 'NN'), ('.', '.')]
-A list of strings is good for poking around, or for printing the annotation to
-evaluate it. But to actually *use* the output, you have to jump through some
-hoops. If you're doing some machine learning, all the strings have to be
-mapped to integers, and you have to save and load the mapping at training and
-runtime. If you want to display mark-up based on the annotation, you have to
-realign the tokens to your original string.
+This API often leaves you with a lot of busy-work. If you're doing some machine
+learning or information extraction, all the strings have to be mapped to integers,
+and you have to save and load the mapping at training and runtime. If you want
+to display mark-up based on the annotation, you have to realign the tokens to your
+original string.
-With spaCy, you should never have to do any string processing at all:
+I've been writing NLP systems for almost ten years now, so I've done these
+things dozens of times. When designing spaCy, I thought carefully about how to
+make the right thing easy.
+
+We begin by initializing a global vocabulary store:
>>> from spacy.en import EN
- >>> from spacy.en import feature_names as fn
- >>> tokens = EN.tokenize('''Some string of language.''')
- >>> tokens.to_array((fn.WORD, fn.SUFFIX, fn.CLUSTER, fn.POS, fn.LEMMA))
+ >>> EN.load()
-A range of excellent features are pre-computed for you, and by default the
-words are part-of-speech tagged and lemmatized. We do this by default because
-even with these extra processes, spaCy is still several times faster than
-most tokenizers:
+The vocabulary reads in a data file with all sorts of pre-computed lexical
+features. You can load anything you like here, but by default I give you:
+
+* String IDs for the word's string, its prefix, suffix and "shape";
+* Length (in unicode code-points)
+* A cluster ID, representing distributional similarity;
+* A cluster ID, representing its typical POS tag distribution;
+* Good-turing smoothed unigram probability;
+* 64 boolean features, for assorted orthographic and distributional features.
+
+With so many features pre-computed, you usually don't have to do any string
+processing at all. You give spaCy your string, and tell it to give you either
+a numpy array, or a counts dictionary:
+
+ >>> from spacy.en import feature_names as fn
+ >>> tokens = EN.tokenize(u'''Some string of language.''')
+ >>> tokens.to_array((fn.WORD, fn.SUFFIX, fn.CLUSTER))
+ ...
+ >>> tokens.count_by(fn.WORD)
+
+If you do need strings, you can simply iterate over the Tokens object:
+
+ >>> for token in tokens:
+ ...
+
+I mostly use this for debugging and testing.
+
+spaCy returns these rich Tokens objects much faster than most other tokenizers
+can give you a list of strings --- in fact, spaCy's POS tagger is *4 times
+faster* than CoreNLP's tokenizer:
+----------+----------+---------------+----------+
| System | Tokenize | POS Tag | |
@@ -48,157 +75,7 @@ most tokenizers:
| ZPar | | ~1,500s | |
+----------+----------+---------------+----------+
-spaCy is designed to **make the right thing easy**, where the right thing is to:
-* **Use rich distributional and orthographic features**. Without these, your model
- will be very brittle and domain dependent.
-
-* **Compute features per type, not per token**. Because of Zipf's law, you can
- expect this to be exponentially more efficient.
-
-* **Minimize string processing**, and instead compute with arrays of ID ints.
-
-Tokenization done right
-=======================
-
-Most tokenizers rely on complicated regular expressions. Often, they leave you
-with no way to align the tokens back to the original string --- a vital feature
-if you want to display some mark-up, such as spelling correction. The regular
-expressions also interact, making it hard to accommodate special cases.
-
-spaCy introduces a **novel tokenization algorithm** that's much faster and much
-more flexible:
-
-.. code-block:: python
-
- def tokenize(string, prefixes={}, suffixes={}, specials={}):
- '''Sketch of spaCy's tokenization algorithm.'''
- tokens = []
- cache = {}
- for chunk in string.split():
- # Because of Zipf's law, the cache serves the majority of "chunks".
- if chunk in cache:
- tokens.extend(cache[chunl])
- continue
- key = chunk
-
- subtokens = []
- # Process a chunk by splitting off prefixes e.g. ( " { and suffixes e.g. , . :
- # If we split one off, check whether we're left with a special-case,
- # e.g. contractions (can't, won't, etc), emoticons, abbreviations, etc.
- # This makes the tokenization easy to update and customize.
- while chunk:
- prefix, chunk = _consume_prefix(chunk, prefixes)
- if prefix:
- subtokens.append(prefix)
- if chunk in specials:
- subtokens.extend(specials[chunk])
- break
- suffix, chunk = _consume_suffix(chunk, suffixes)
- if suffix:
- subtokens.append(suffix)
- if chunk in specials:
- subtokens.extend(specials[chunk])
- break
- cache[key] = subtokens
-
-Your data is going to have its own quirks, so it's really useful to have
-a tokenizer you can easily control. To see the limitations of the standard
-regex-based approach, check out `CMU's recent work on tokenizing tweets `_. Despite a lot of careful attention, they can't handle all of their
-known emoticons correctly --- doing so would interfere with the way they
-process other punctuation. This isn't a problem for spaCy: we just add them
-all to the special tokenization rules.
-
-
-Comparison with NLTK
-====================
-
-`NLTK `_ provides interfaces to a wide-variety of NLP
-tools and resources, and its own implementations of a few algorithms. It comes
-with comprehensive documentation, and a book introducing concepts in NLP. For
-these reasons, it's very widely known. However, if you're trying to make money
-or do cutting-edge research, NLTK is not a good choice.
-
-The `list of stuff in NLTK `_ looks impressive,
-but almost none of it is useful for real work. You're not going to make any money,
-or do top research, by using the NLTK chat bots, theorem provers, toy CCG implementation,
-etc. Most of NLTK is there to assist in the explanation ideas in computational
-linguistics, at roughly an undergraduate level.
-But it also claims to support serious work, by wrapping external tools.
-
-In a pretty well known essay, Joel Spolsky discusses the pain of dealing with
-`leaky abstractions `_.
-An abstraction tells you to not care about implementation
-details, but sometimes the implementation matters after all. When it
-does, you have to waste time revising your assumptions.
-
-NLTK's wrappers call external tools via subprocesses, and wrap this up so
-that it looks like a native API. This abstraction leaks *a lot*. The system
-calls impose far more overhead than a normal Python function call, which makes
-the most natural way to program against the API infeasible.
-
-
-Case study: POS tagging
------------------------
-
-Here's a quick comparison of the following POS taggers:
-
-* **Stanford (CLI)**: The Stanford POS tagger, invoked once as a batch process
- from the command-line;
-* **nltk.tag.stanford**: The Stanford tagger, invoked document-by-document via
- NLTK's wrapper;
-* **nltk.pos_tag**: NLTK's own POS tagger, invoked document-by-document.
-* **spacy.en.pos_tag**: spaCy's POS tagger, invoked document-by-document.
-
-
-+-------------------+-------------+--------+
-| System | Speed (w/s) | % Acc. |
-+-------------------+-------------+--------+
-| spaCy | 107,000 | 96.7 |
-+-------------------+-------------+--------+
-| Stanford (CLI) | 8,000 | 96.7 |
-+-------------------+-------------+--------+
-| nltk.pos_tag | 543 | 94.0 |
-+-------------------+-------------+--------+
-| nltk.tag.stanford | 209 | 96.7 |
-+-------------------+-------------+--------+
-
-Experimental details TODO. Three things are apparent from this comparison:
-
-1. The native NLTK tagger, nltk.pos_tag, is both slow and inaccurate;
-
-2. Calling the Stanford tagger document-by-document via NLTK is **40x** slower
- than invoking the model once as a batch process, via the command-line;
-
-3. spaCy is over 10x faster than the Stanford tagger, even when called
- **sentence-by-sentence**.
-
-The problem is that NLTK simply wraps the command-line
-interfaces of these tools, so communication is via a subprocess. NLTK does not
-even hold open a pipe for you --- the model is reloaded, again and again.
-
-To use the wrapper effectively, you should batch up your text as much as possible.
-This probably isn't how you would like to structure your pipeline, and you
-might not be able to batch up much text at all, e.g. if serving a single
-request means processing a single document.
-Technically, NLTK does give you Python functions to access lots of different
-systems --- but, you can't use them as you would expect to use a normal Python
-function. The abstraction leaks.
-
-Here's the bottom-line: the Stanford tools are written in Java, so using them
-from Python sucks. You shouldn't settle for this. It's a problem that springs
-purely from the tooling, rather than the domain.
-
-Summary
--------
-
-NLTK is a well-known Python library for NLP, but for the important bits, you
-don't get actual Python modules. You get wrappers which throw to external
-tools, via subprocesses. This is not at all the same thing.
-
-spaCy is implemented in Cython, just like numpy, scikit-learn, lxml and other
-high-performance Python libraries. So you get a native Python API, but the
-performance you expect from a program written in C.
.. toctree::