2014-09-25 16:42:20 +00:00
|
|
|
.. spaCy documentation master file, created by
|
|
|
|
sphinx-quickstart on Tue Aug 19 16:27:38 2014.
|
|
|
|
You can adapt this file completely to your liking, but it should at least
|
|
|
|
contain the root `toctree` directive.
|
|
|
|
|
|
|
|
spaCy NLP Tokenizer and Lexicon
|
|
|
|
================================
|
|
|
|
|
2014-09-26 16:40:18 +00:00
|
|
|
spaCy splits a string of natural language into a list of references to lexical types:
|
|
|
|
|
|
|
|
>>> from spacy.en import EN
|
|
|
|
>>> tokens = EN.tokenize(u"Examples aren't easy, are they?")
|
|
|
|
>>> type(tokens[0])
|
|
|
|
spacy.word.Lexeme
|
|
|
|
>>> tokens[1] is tokens[5]
|
|
|
|
True
|
|
|
|
|
|
|
|
Other tokenizers return lists of strings, which is
|
|
|
|
`downright barbaric <guide/overview.html>`__. If you get a list of strings,
|
|
|
|
you have to write all the features yourself, and you'll probably compute them
|
|
|
|
on a per-token basis, instead of a per-type basis. At scale, that's very
|
|
|
|
inefficient.
|
|
|
|
|
|
|
|
spaCy's tokens come with the following orthographic and distributional features
|
|
|
|
pre-computed:
|
|
|
|
|
|
|
|
* Orthographic flags, such as is_alpha, is_digit, is_punct, is_title etc;
|
|
|
|
|
|
|
|
* Useful string transforms, such as canonical casing, word shape, ASCIIfied,
|
|
|
|
etc;
|
|
|
|
|
|
|
|
* Unigram log probability;
|
|
|
|
|
|
|
|
* Brown cluster;
|
|
|
|
|
|
|
|
* can_noun, can_verb etc tag-dictionary;
|
|
|
|
|
|
|
|
* oft_upper, oft_title etc case-behaviour flags.
|
|
|
|
|
|
|
|
The features are up-to-date with current NLP research, but you can replace or
|
|
|
|
augment them if you need to.
|
|
|
|
|
2014-09-25 16:42:20 +00:00
|
|
|
.. toctree::
|
|
|
|
:maxdepth: 3
|
|
|
|
|
|
|
|
guide/overview.rst
|
|
|
|
guide/install.rst
|
|
|
|
|
|
|
|
api/index.rst
|
|
|
|
|
|
|
|
modules/index.rst
|
|
|
|
|
|
|
|
License
|
2014-09-26 16:40:18 +00:00
|
|
|
=======
|
|
|
|
|
|
|
|
+------------------+------+
|
|
|
|
| Non-commercial | $0 |
|
|
|
|
+------------------+------+
|
|
|
|
| Trial commercial | $0 |
|
|
|
|
+------------------+------+
|
|
|
|
| Full commercial | $500 |
|
|
|
|
+------------------+------+
|
2014-09-25 16:42:20 +00:00
|
|
|
|
2014-09-26 16:40:18 +00:00
|
|
|
spaCy is non-free software. Its source is published, but the copyright is
|
|
|
|
retained by the author (Matthew Honnibal). Licenses are currently under preparation.
|
2014-09-25 16:42:20 +00:00
|
|
|
|
2014-09-26 16:40:18 +00:00
|
|
|
There is currently a gap between the output of academic NLP researchers, and
|
|
|
|
the needs of a small software companiess. I left academia to try to correct this.
|
|
|
|
My idea is that non-commercial and trial commercial use should "feel" just like
|
|
|
|
free software. But, if you do use the code in a commercial product, a small
|
|
|
|
fixed license-fee will apply, in order to fund development.
|
2014-09-25 16:42:20 +00:00
|
|
|
|