spaCy/docs/source/guide/overview.rst

71 lines
2.1 KiB
ReStructuredText
Raw Normal View History

2014-10-15 10:50:34 +00:00
Overview
========
2014-10-15 10:50:34 +00:00
What and Why
------------
2014-10-15 10:50:34 +00:00
spaCy is a lightning-fast, full-cream NLP tokenizer and lexicon.
Most tokenizers give you a sequence of strings. That's barbaric.
2014-10-15 10:50:34 +00:00
Giving you strings invites you to compute on every *token*, when what
you should be doing is computing on every *type*. Remember
`Zipf's law <http://en.wikipedia.org/wiki/Zipf's_law>`_: you'll
see exponentially fewer types than tokens.
2014-10-15 10:50:34 +00:00
Instead of strings, spaCy gives you references to Lexeme objects, from which you
can access an excellent set of pre-computed orthographic and distributional features:
2014-10-15 10:50:34 +00:00
::
2014-10-15 10:50:34 +00:00
>>> from spacy import en
>>> apples, are, nt, oranges, dots = en.EN.tokenize(u"Apples aren't oranges...")
>>> are.prob >= oranges.prob
True
>>> apples.check_flag(en.IS_TITLE)
True
>>> apples.check_flag(en.OFT_TITLE)
False
>>> are.check_flag(en.CAN_NOUN)
False
2015-04-19 08:56:32 +00:00
spaCy makes it easy to write efficient NLP applications, because your feature
2014-10-15 10:50:34 +00:00
functions have to do almost no work: almost every lexical property you'll want
is pre-computed for you. See the tutorial for an example POS tagger.
2014-10-15 10:50:34 +00:00
Benchmark
---------
2015-04-19 08:56:32 +00:00
The tokenizer itself is also efficient:
+--------+-------+--------------+--------------+
| System | Time | Words/second | Speed Factor |
+--------+-------+--------------+--------------+
| NLTK | 6m4s | 89,000 | 1.00 |
+--------+-------+--------------+--------------+
2014-10-15 10:50:34 +00:00
| spaCy | 9.5s | 3,093,000 | 38.30 |
+--------+-------+--------------+--------------+
2014-10-15 10:50:34 +00:00
The comparison refers to 30 million words from the English Gigaword, on
a Maxbook Air. For context, calling string.split() on the data completes in
about 5s.
Pros and Cons
-------------
Pros:
2014-10-15 10:50:34 +00:00
- All tokens come with indices into the original string
- Full unicode support
2015-04-19 08:56:32 +00:00
- Extendable to other languages
2014-10-15 10:50:34 +00:00
- Batch operations computed efficiently in Cython
- Cython API
- numpy interoperability
Cons:
- It's new (released September 2014)
2014-10-15 10:50:34 +00:00
- Security concerns, from memory management
- Higher memory usage (up to 1gb)
2014-10-15 10:50:34 +00:00
- More conceptually complicated
- Tokenization rules expressed in code, not as data