spaCy/docs/source/guide/overview.rst

Overview
========

What and Why
------------

spaCy is a lightning-fast, full-cream NLP tokenizer and lexicon.

Most tokenizers give you a sequence of strings.  That's barbaric.
Giving you strings invites you to compute on every *token*, when what
you should be doing is computing on every *type*.  Remember
`Zipf's law <http://en.wikipedia.org/wiki/Zipf's_law>`_: you'll
see exponentially fewer types than tokens.

Instead of strings, spaCy gives you references to Lexeme objects, from which you
can access an excellent set of pre-computed orthographic and distributional features:

::

    >>> from spacy import en
    >>> apples, are, nt, oranges, dots = en.EN.tokenize(u"Apples aren't oranges...")
    >>> are.prob >= oranges.prob
    True
    >>> apples.check_flag(en.IS_TITLE)
    True
    >>> apples.check_flag(en.OFT_TITLE)
    False
    >>> are.check_flag(en.CAN_NOUN)
    False

spaCy makes it easy to write efficient NLP applications, because your feature
functions have to do almost no work: almost every lexical property you'll want
is pre-computed for you.  See the tutorial for an example POS tagger.

Benchmark
---------

The tokenizer itself is also efficient:

+--------+-------+--------------+--------------+
| System | Time	 | Words/second | Speed Factor |
+--------+-------+--------------+--------------+
| NLTK	 | 6m4s  | 89,000       | 1.00         |
+--------+-------+--------------+--------------+
| spaCy	 | 9.5s	 | 3,093,000	| 38.30        |
+--------+-------+--------------+--------------+

The comparison refers to 30 million words from the English Gigaword, on
a Maxbook Air.  For context, calling string.split() on the data completes in
about 5s.

Pros and Cons
-------------

Pros:

- All tokens come with indices into the original string
- Full unicode support
- Extendable to other languages
- Batch operations computed efficiently in Cython
- Cython API
- numpy interoperability

Cons:

- It's new (released September 2014)
- Security concerns, from memory management
- Higher memory usage (up to 1gb)
- More conceptually complicated
- Tokenization rules expressed in code, not as data
* Update docs 2014-10-15 10:50:34 +00:00			`Overview`
			`========`
* Re-add docs, sorting out mess from gh-pages 2014-09-25 16:42:20 +00:00
* Update docs 2014-10-15 10:50:34 +00:00			`What and Why`
			`------------`
* Re-add docs, sorting out mess from gh-pages 2014-09-25 16:42:20 +00:00
* Update docs 2014-10-15 10:50:34 +00:00			`spaCy is a lightning-fast, full-cream NLP tokenizer and lexicon.`
* Re-add docs, sorting out mess from gh-pages 2014-09-25 16:42:20 +00:00
Use consistent sentence spacing within files 2015-04-19 08:43:46 +00:00			`Most tokenizers give you a sequence of strings. That's barbaric.`
* Update docs 2014-10-15 10:50:34 +00:00			`Giving you strings invites you to compute on every token, when what`
			`you should be doing is computing on every type. Remember`
			`Zipf's law <http://en.wikipedia.org/wiki/Zipf's_law>`_: you'll
			`see exponentially fewer types than tokens.`
* Re-add docs, sorting out mess from gh-pages 2014-09-25 16:42:20 +00:00
* Update docs 2014-10-15 10:50:34 +00:00			`Instead of strings, spaCy gives you references to Lexeme objects, from which you`
			`can access an excellent set of pre-computed orthographic and distributional features:`
* Re-add docs, sorting out mess from gh-pages 2014-09-25 16:42:20 +00:00
* Update docs 2014-10-15 10:50:34 +00:00			`::`
* Re-add docs, sorting out mess from gh-pages 2014-09-25 16:42:20 +00:00
* Update docs 2014-10-15 10:50:34 +00:00			`>>> from spacy import en`
			`>>> apples, are, nt, oranges, dots = en.EN.tokenize(u"Apples aren't oranges...")`
			`>>> are.prob >= oranges.prob`
			`True`
			`>>> apples.check_flag(en.IS_TITLE)`
			`True`
			`>>> apples.check_flag(en.OFT_TITLE)`
			`False`
			`>>> are.check_flag(en.CAN_NOUN)`
			`False`
* Re-add docs, sorting out mess from gh-pages 2014-09-25 16:42:20 +00:00
Minor copyediting 2015-04-19 08:56:32 +00:00			`spaCy makes it easy to write efficient NLP applications, because your feature`
* Update docs 2014-10-15 10:50:34 +00:00			`functions have to do almost no work: almost every lexical property you'll want`
			`is pre-computed for you. See the tutorial for an example POS tagger.`
* Re-add docs, sorting out mess from gh-pages 2014-09-25 16:42:20 +00:00
* Update docs 2014-10-15 10:50:34 +00:00			`Benchmark`
			`---------`
* Re-add docs, sorting out mess from gh-pages 2014-09-25 16:42:20 +00:00
Minor copyediting 2015-04-19 08:56:32 +00:00			`The tokenizer itself is also efficient:`
* Re-add docs, sorting out mess from gh-pages 2014-09-25 16:42:20 +00:00
			`+--------+-------+--------------+--------------+`
			`\| System \| Time \| Words/second \| Speed Factor \|`
			`+--------+-------+--------------+--------------+`
			`\| NLTK \| 6m4s \| 89,000 \| 1.00 \|`
			`+--------+-------+--------------+--------------+`
* Update docs 2014-10-15 10:50:34 +00:00			`\| spaCy \| 9.5s \| 3,093,000 \| 38.30 \|`
* Re-add docs, sorting out mess from gh-pages 2014-09-25 16:42:20 +00:00			`+--------+-------+--------------+--------------+`

* Update docs 2014-10-15 10:50:34 +00:00			`The comparison refers to 30 million words from the English Gigaword, on`
			`a Maxbook Air. For context, calling string.split() on the data completes in`
			`about 5s.`
* Re-add docs, sorting out mess from gh-pages 2014-09-25 16:42:20 +00:00
			`Pros and Cons`
			`-------------`

			`Pros:`

* Update docs 2014-10-15 10:50:34 +00:00			`- All tokens come with indices into the original string`
			`- Full unicode support`
Minor copyediting 2015-04-19 08:56:32 +00:00			`- Extendable to other languages`
* Update docs 2014-10-15 10:50:34 +00:00			`- Batch operations computed efficiently in Cython`
			`- Cython API`
			`- numpy interoperability`
* Re-add docs, sorting out mess from gh-pages 2014-09-25 16:42:20 +00:00
			`Cons:`

			`- It's new (released September 2014)`
* Update docs 2014-10-15 10:50:34 +00:00			`- Security concerns, from memory management`
* Re-add docs, sorting out mess from gh-pages 2014-09-25 16:42:20 +00:00			`- Higher memory usage (up to 1gb)`
* Update docs 2014-10-15 10:50:34 +00:00			`- More conceptually complicated`
			`- Tokenization rules expressed in code, not as data`