mirror of https://github.com/explosion/spaCy.git
84 lines
1.7 KiB
ReStructuredText
84 lines
1.7 KiB
ReStructuredText
.. spaCy documentation master file, created by
|
|
sphinx-quickstart on Tue Aug 19 16:27:38 2014.
|
|
You can adapt this file completely to your liking, but it should at least
|
|
contain the root `toctree` directive.
|
|
|
|
spaCy API Reference
|
|
=================================
|
|
|
|
.. toctree::
|
|
:maxdepth: 2
|
|
|
|
api/python
|
|
api/cython
|
|
api/extending
|
|
|
|
Overview
|
|
--------
|
|
|
|
spaCy is a tokenizer for natural languages, tightly coupled to a global
|
|
vocabulary store.
|
|
|
|
Instead of a list of strings, spaCy returns references to lexical types. All
|
|
of the string-based features you might need are pre-computed for you:
|
|
|
|
::
|
|
|
|
>>> from spacy import en
|
|
>>> example = u"Apples aren't oranges..."
|
|
>>> apples, are, nt, oranges, ellipses = en.tokenize(example)
|
|
>>> en.is_punct(ellipses)
|
|
True
|
|
>>> en.get_string(en.word_shape(apples))
|
|
'Xxxx'
|
|
|
|
You also get lots of distributional features, calculated from a large
|
|
sample of text:
|
|
|
|
::
|
|
|
|
>>> en.prob_of(are) > en.prob_of(oranges)
|
|
True
|
|
>>> en.can_noun(are)
|
|
False
|
|
>>> en.is_oft_title(apples)
|
|
False
|
|
|
|
Pros and Cons
|
|
-------------
|
|
|
|
Pros:
|
|
|
|
- All tokens come with indices into the original string
|
|
- Full unicode support
|
|
- Extensible to other languages
|
|
- Batch operations computed efficiently in Cython
|
|
- Cython API
|
|
- numpy interoperability
|
|
|
|
Cons:
|
|
|
|
- It's new (released September 2014)
|
|
- Higher memory usage (up to 1gb)
|
|
- More conceptually complicated
|
|
- Tokenization rules expressed in code, not as data
|
|
|
|
Installation
|
|
------------
|
|
|
|
Installation via pip:
|
|
|
|
pip install spacy
|
|
|
|
From source, using virtualenv:
|
|
|
|
::
|
|
|
|
$ git clone http://github.com/honnibal/spaCy.git
|
|
$ cd spaCy
|
|
$ virtualenv .env
|
|
$ source .env/bin/activate
|
|
$ pip install -r requirements.txt
|
|
$ fab make
|
|
$ fab test
|