spaCy/docs/source/index.rst

48 lines
1.6 KiB
ReStructuredText

.. spaCy documentation master file, created by
sphinx-quickstart on Tue Aug 19 16:27:38 2014.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
spaCy NLP Tokenizer and Lexicon
================================
spaCy is an industrial-strength multi-language tokenizer, bristling with features
you never knew you wanted. You do want these features though --- your current
tokenizer has been doing it wrong.
Where other tokenizers give you a list of strings, spaCy gives you references
to rich lexical types, for easy, excellent and efficient feature extraction.
* **Easy**: Tokenizer returns a sequence of rich lexical types, with features
pre-computed:
>>> from spacy.en import EN
>>> for w in EN.tokenize(string):
... print w.sic, w.shape, w.cluster, w.oft_title, w.can_verb
Check out the tutorial and API docs.
* **Excellent**: Distributional and orthographic features are crucial to robust
NLP. Without them, models can only learn from tiny annotated training
corpora. Read more.
* **Efficient**: spaCy serves you rich lexical objects faster than most
tokenizers can give you a list of strings.
+--------+-------+--------------+--------------+
| System | Time | Words/second | Speed Factor |
+--------+-------+--------------+--------------+
| NLTK | 6m4s | 89,000 | 1.00 |
+--------+-------+--------------+--------------+
| spaCy | 9.5s | 3,093,000 | 38.30 |
+--------+-------+--------------+--------------+
.. toctree::
:hidden:
:maxdepth: 3
what/index.rst
why/index.rst
how/index.rst