spaCy/docs/source/index.rst

.. spaCy documentation master file, created by
   sphinx-quickstart on Tue Aug 19 16:27:38 2014.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

===================================
spaCy: Text-processing for products
===================================

spaCy is a library for industrial-strength text processing in Python and Cython.
Its core values are efficiency, accuracy and minimalism: you get a fast pipeline of
state-of-the-art components, a nice API, and no clutter.

spaCy is particularly good for feature extraction, because it pre-loads lexical
resources, maps strings to integer IDs, and supports output of numpy arrays:

    >>> from spacy.en import English
    >>> nlp = English()
    >>> tokens = nlp(u'An example sentence', tag=True, parse=True)
    >>> from spacy.en import attrs
    >>> feats = tokens.to_array((attrs.LEMMA, attrs.POS, attrs.SHAPE, attrs.CLUSTER))
    >>> for lemma, pos, shape, cluster in feats:
    ...   print nlp.strings[lemma], nlp.tagger.tags[pos], nlp.strings[shape], cluster

spaCy also makes it easy to add in-line mark up. Let's say you want to mark all
adverbs in red:

    >>> from spacy.defs import ADVERB
    >>> color = lambda t: u'\033[91m' % t if t.pos == ADVERB else u'%s'
    >>> print u''.join(color(token) + unicode(token) for t in tokens)

Easy.  The trick here is that the Token objects know to pad themselves with
whitespace when you ask for their unicode representation, so you can always get
back the original string. 

spaCy is also very efficient --- much more efficient than any other language
processing tools available.  The table below compares the time to tokenize, POS
tag and parse 100m words of text; it also shows accuracy on the standard
evaluation, from the Wall Street Journal:


+----------+----------+---------------+----------+
| System   | Tokenize | POS Tag       |          |
+----------+----------+---------------+----------+
| spaCy    | 37s      | 98s           |          |
+----------+----------+---------------+----------+
| NLTK     | 626s     | 44,310s (12h) |          |
+----------+----------+---------------+----------+
| CoreNLP  | 420s     | 1,300s (22m)  |          |
+----------+----------+---------------+----------+
| ZPar     |          | ~1,500s       |          |
+----------+----------+---------------+----------+

spaCy completes its whole pipeline faster than some of the other libraries can
tokenize the text.  Its POS tag accuracy is as good as any system available.
For parsing, I chose an algorithm that sacrificed some accuracy, in favour of
efficiency.

I wrote spaCy so that startups and other small companies could take advantage
of the enormous progress being made by NLP academics.  Academia is competitive,
and what you're competing to do is write papers --- so it's very hard to write
software useful to non-academics. Seeing this gap, I resigned from my post-doc,
and wrote spaCy.

spaCy is dual-licensed: you can either use it under the GPL, or pay a one-time
fee of $5000 for a commercial license.  I think this is excellent value:
you'll find NLTK etc much more expensive, because what you save on license
cost, you'll lose many times over in lost productivity. $5000 does not buy you
much developer time.

.. toctree::
    :hidden:
    :maxdepth: 3

    features.rst
    license_stories.rst
* Re-add docs, sorting out mess from gh-pages 2014-09-25 16:42:20 +00:00			`.. spaCy documentation master file, created by`
			`sphinx-quickstart on Tue Aug 19 16:27:38 2014.`
			`You can adapt this file completely to your liking, but it should at least`
			contain the root `toctree` directive.

* Play with examples in index.rst 2014-12-23 04:17:56 +00:00			`===================================`
			`spaCy: Text-processing for products`
			`===================================`
* Re-add docs, sorting out mess from gh-pages 2014-09-25 16:42:20 +00:00
* More thoughts on intro 2014-12-14 22:19:29 +00:00			`spaCy is a library for industrial-strength text processing in Python and Cython.`
* More index.rst fiddling 2014-12-21 06:40:12 +00:00			`Its core values are efficiency, accuracy and minimalism: you get a fast pipeline of`
			`state-of-the-art components, a nice API, and no clutter.`
* Revise intro copy. Add NLTK comparison 2014-12-01 11:55:13 +00:00
* More index.rst fiddling 2014-12-21 06:40:12 +00:00			`spaCy is particularly good for feature extraction, because it pre-loads lexical`
			`resources, maps strings to integer IDs, and supports output of numpy arrays:`
* Upd docs 2014-12-09 05:08:01 +00:00
* More index.rst fiddling 2014-12-21 06:40:12 +00:00			`>>> from spacy.en import English`
			`>>> nlp = English()`
* Play with examples in index.rst 2014-12-23 04:17:56 +00:00			`>>> tokens = nlp(u'An example sentence', tag=True, parse=True)`
			`>>> from spacy.en import attrs`
			`>>> feats = tokens.to_array((attrs.LEMMA, attrs.POS, attrs.SHAPE, attrs.CLUSTER))`
			`>>> for lemma, pos, shape, cluster in feats:`
			`... print nlp.strings[lemma], nlp.tagger.tags[pos], nlp.strings[shape], cluster`
* Upd docs 2014-12-09 05:08:01 +00:00
* More index.rst fiddling 2014-12-21 06:40:12 +00:00			`spaCy also makes it easy to add in-line mark up. Let's say you want to mark all`
			`adverbs in red:`
* Upd docs 2014-12-09 05:08:01 +00:00
* More index.rst fiddling 2014-12-21 06:40:12 +00:00			`>>> from spacy.defs import ADVERB`
			`>>> color = lambda t: u'\033[91m' % t if t.pos == ADVERB else u'%s'`
* Play with examples in index.rst 2014-12-23 04:17:56 +00:00			`>>> print u''.join(color(token) + unicode(token) for t in tokens)`
* Another redraft of index.rst 2014-12-15 05:32:03 +00:00
* Play with examples in index.rst 2014-12-23 04:17:56 +00:00			`Easy. The trick here is that the Token objects know to pad themselves with`
			`whitespace when you ask for their unicode representation, so you can always get`
			`back the original string.`
* Make intro chattier, explain philosophy better 2014-12-02 04:20:18 +00:00
* More index.rst fiddling 2014-12-21 06:40:12 +00:00			`spaCy is also very efficient --- much more efficient than any other language`
			`processing tools available. The table below compares the time to tokenize, POS`
			`tag and parse 100m words of text; it also shows accuracy on the standard`
			`evaluation, from the Wall Street Journal:`
* Another redraft of index.rst 2014-12-15 05:32:03 +00:00
* More thoughts on intro 2014-12-14 22:19:29 +00:00
			`+----------+----------+---------------+----------+`
			`\| System \| Tokenize \| POS Tag \| \|`
			`+----------+----------+---------------+----------+`
			`\| spaCy \| 37s \| 98s \| \|`
			`+----------+----------+---------------+----------+`
			`\| NLTK \| 626s \| 44,310s (12h) \| \|`
			`+----------+----------+---------------+----------+`
			`\| CoreNLP \| 420s \| 1,300s (22m) \| \|`
			`+----------+----------+---------------+----------+`
			`\| ZPar \| \| ~1,500s \| \|`
			`+----------+----------+---------------+----------+`
* Make intro chattier, explain philosophy better 2014-12-02 04:20:18 +00:00
* More index.rst fiddling 2014-12-21 06:40:12 +00:00			`spaCy completes its whole pipeline faster than some of the other libraries can`
			`tokenize the text. Its POS tag accuracy is as good as any system available.`
			`For parsing, I chose an algorithm that sacrificed some accuracy, in favour of`
			`efficiency.`
* Another redraft of index.rst 2014-12-15 05:32:03 +00:00
* More index.rst fiddling 2014-12-21 06:40:12 +00:00			`I wrote spaCy so that startups and other small companies could take advantage`
			`of the enormous progress being made by NLP academics. Academia is competitive,`
			`and what you're competing to do is write papers --- so it's very hard to write`
			`software useful to non-academics. Seeing this gap, I resigned from my post-doc,`
			`and wrote spaCy.`
* Upd docs 2014-09-26 16:40:18 +00:00
* Play with examples in index.rst 2014-12-23 04:17:56 +00:00			`spaCy is dual-licensed: you can either use it under the GPL, or pay a one-time`
			`fee of $5000 for a commercial license. I think this is excellent value:`
			`you'll find NLTK etc much more expensive, because what you save on license`
			`cost, you'll lose many times over in lost productivity. $5000 does not buy you`
			`much developer time.`

* Re-add docs, sorting out mess from gh-pages 2014-09-25 16:42:20 +00:00			`.. toctree::`
* Update docs 2014-10-15 10:50:34 +00:00			`:hidden:`
* Re-add docs, sorting out mess from gh-pages 2014-09-25 16:42:20 +00:00			`:maxdepth: 3`
* Revise intro copy. Add NLTK comparison 2014-12-01 11:55:13 +00:00
			`features.rst`
* Make intro chattier, explain philosophy better 2014-12-02 04:20:18 +00:00			`license_stories.rst`