spaCy/docs/index.rst

84 lines
1.7 KiB
ReStructuredText

.. spaCy documentation master file, created by
sphinx-quickstart on Tue Aug 19 16:27:38 2014.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
spaCy API Reference
=================================
.. toctree::
:maxdepth: 2
api/python
api/cython
api/extending
Overview
--------
spaCy is a tokenizer for natural languages, tightly coupled to a global
vocabulary store.
Instead of a list of strings, spaCy returns references to lexical types. All
of the string-based features you might need are pre-computed for you:
::
>>> from spacy import en
>>> example = u"Apples aren't oranges..."
>>> apples, are, nt, oranges, ellipses = en.tokenize(example)
>>> en.is_punct(ellipses)
True
>>> en.get_string(en.word_shape(apples))
'Xxxx'
You also get lots of distributional features, calculated from a large
sample of text:
::
>>> en.prob_of(are) > en.prob_of(oranges)
True
>>> en.can_noun(are)
False
>>> en.is_oft_title(apples)
False
Pros and Cons
-------------
Pros:
- All tokens come with indices into the original string
- Full unicode support
- Extensible to other languages
- Batch operations computed efficiently in Cython
- Cython API
- numpy interoperability
Cons:
- It's new (released September 2014)
- Higher memory usage (up to 1gb)
- More conceptually complicated
- Tokenization rules expressed in code, not as data
Installation
------------
Installation via pip:
pip install spacy
From source, using virtualenv:
::
$ git clone http://github.com/honnibal/spaCy.git
$ cd spaCy
$ virtualenv .env
$ source .env/bin/activate
$ pip install -r requirements.txt
$ fab make
$ fab test