spaCy/docs/index.rst

.. spaCy documentation master file, created by
   sphinx-quickstart on Tue Aug 19 16:27:38 2014.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

spaCy API Reference
=================================

.. toctree::
    :maxdepth: 2

    api/python
    api/cython
    api/extending

Overview
--------

spaCy is a tokenizer for natural languages, tightly coupled to a global
vocabulary store.

Instead of a list of strings, spaCy returns references to lexical types. All
of the string-based features you might need are pre-computed for you:

::

    >>> from spacy import en
    >>> example = u"Apples aren't oranges..."
    >>> apples, are, nt, oranges, ellipses = en.tokenize(example)
    >>> en.is_punct(ellipses)
    True
    >>> en.get_string(en.word_shape(apples))
    'Xxxx'

You also get lots of distributional features, calculated from a large
sample of text:

::

    >>> en.prob_of(are) > en.prob_of(oranges)
    True
    >>> en.can_noun(are)
    False
    >>> en.is_oft_title(apples)
    False

Pros and Cons
-------------

Pros:

- All tokens come with indices into the original string
- Full unicode support
- Extensible to other languages
- Batch operations computed efficiently in Cython
- Cython API
- numpy interoperability

Cons:

- It's new (released September 2014)
- Higher memory usage (up to 1gb)
- More conceptually complicated
- Tokenization rules expressed in code, not as data

Installation
------------

Installation via pip:

    pip install spacy

From source, using virtualenv:

::

    $ git clone http://github.com/honnibal/spaCy.git
    $ cd spaCy
    $ virtualenv .env
    $ source .env/bin/activate
    $ pip install -r requirements.txt
    $ fab make
    $ fab test