mirror of https://github.com/explosion/spaCy.git
* Work on quickstart
This commit is contained in:
parent
fb6f079092
commit
75feb52c5d
|
@ -5,53 +5,70 @@ Quick Start
|
||||||
Install
|
Install
|
||||||
-------
|
-------
|
||||||
|
|
||||||
|
.. code:: bash
|
||||||
|
|
||||||
$ pip install spacy
|
$ pip install spacy
|
||||||
$ python -m spacy.en.download
|
$ python -m spacy.en.download
|
||||||
|
|
||||||
The download command fetches the parser model, which is too big to host on PyPi
|
The download command fetches and installs the parser model and word representations,
|
||||||
(about 100mb). The data is installed within the spacy.en package.
|
which are too big to host on PyPi (about 100mb each). The data is installed within
|
||||||
|
the spacy.en package directory.
|
||||||
|
|
||||||
Usage
|
Usage
|
||||||
-----
|
-----
|
||||||
|
|
||||||
The main entry-point is spacy.en.English.__call__, which you use to turn
|
The main entry-point is :py:meth:`spacy.en.English.__call__`, which accepts a unicode string as an argument, and returns a :py:class:`spacy.tokens.Tokens` object:
|
||||||
a unicode string into a Tokens object:
|
|
||||||
|
|
||||||
>>> from spacy.en import English
|
>>> from spacy.en import English
|
||||||
>>> nlp = English()
|
>>> nlp = English()
|
||||||
>>> tokens = nlp(u'A fine, very fine, example sentence')
|
>>> tokens = nlp(u'A fine, very fine, example sentence', tag=True,
|
||||||
|
parse=True)
|
||||||
|
|
||||||
|
Calls to :py:meth:`English.__call__` has a side-effect: when a new
|
||||||
|
word is seen, it is added to the string-to-ID mapping table in
|
||||||
|
:py:class:`English.vocab.strings`. Because of this, you will usually only want
|
||||||
|
to create one instance of the pipeline. If you create two instances, and use
|
||||||
|
them to process different text, you'll probably get different string-to-ID
|
||||||
|
mappings. You might choose to wrap the English class as a singleton to ensure
|
||||||
|
only one instance is created, but I've left that up to you. I prefer to pass
|
||||||
|
the instance around as an explicit argument.
|
||||||
|
|
||||||
You shouldn't need to batch up your text or prepare it in any way.
|
You shouldn't need to batch up your text or prepare it in any way.
|
||||||
Processing times are linear in the length of the string, with minimal per-call
|
Processing times are linear in the length of the string, with minimal per-call
|
||||||
overhead (apart from the first call, when the tagger and parser are lazy-loaded).
|
overhead (apart from the first call, when the tagger and parser models are
|
||||||
|
lazy-loaded. This takes a few seconds on my machine.).
|
||||||
|
|
||||||
Usually, you will only want to create one instance of the pipeline, and pass it
|
:py:meth:`English.__class__` returns a :py:class:`Tokens` object, through which
|
||||||
around. Each instance maintains its own string-to-id mapping table, so if you
|
you'll access the processed text. You can access the text in three ways:
|
||||||
process a new word, it is likely to be assigned different integer IDs by the
|
|
||||||
two different instances.
|
|
||||||
|
|
||||||
The Tokens object has a sequences interface, which you can use to get
|
Iteration
|
||||||
individual tokens:
|
:py:meth:`Tokens.__iter__` and :py:meth:`Tokens.__getitem__`
|
||||||
|
|
||||||
>>> print tokens[0].lemma
|
- Most "Pythonic"
|
||||||
'a'
|
|
||||||
>>> for token in tokens:
|
|
||||||
... print token.sic, token.pos
|
|
||||||
|
|
||||||
For feature extraction, you can select a number of features to export to
|
- `spacy.tokens.Token` object, attribute access
|
||||||
a numpy.ndarray:
|
|
||||||
|
|
||||||
>>> from spacy.en import enums
|
- Inefficient: New Token object created each time.
|
||||||
>>> tokens.to_array([enums.LEMMA, enums.SIC])
|
|
||||||
|
|
||||||
Another common operation is to export the embeddings vector to a numpy array:
|
Export
|
||||||
|
:py:meth:`Tokens.count_by` and :py:meth:`Tokens.to_array`
|
||||||
|
|
||||||
>>> tokens.to_vec()
|
- `count_by`: Efficient dictionary of counts, for bag-of-words model.
|
||||||
|
|
||||||
Create a bag-of-words representation:
|
- `to_array`: Export to numpy array. One row per word, one column per
|
||||||
|
attribute.
|
||||||
|
|
||||||
>>> tokens.count_by(enums.LEMMA)
|
- Specify attributes with constants from `spacy.en.attrs`.
|
||||||
|
|
||||||
|
Cython
|
||||||
|
:py:attr:`TokenC* Tokens.data`
|
||||||
|
|
||||||
|
- Raw data is stored in contiguous array of structs
|
||||||
|
|
||||||
|
- Good syntax, C speed
|
||||||
|
|
||||||
|
- Documentation coming soon. In the meantime, see spacy/syntax/_parser.features.pyx
|
||||||
|
or spacy/en/pos.pyx
|
||||||
|
|
||||||
|
|
||||||
(Most of the) API at a glance
|
(Most of the) API at a glance
|
||||||
|
@ -61,6 +78,10 @@ Create a bag-of-words representation:
|
||||||
|
|
||||||
.. py:method:: __call__(self, text: unicode, tag=True, parse=False) --> Tokens
|
.. py:method:: __call__(self, text: unicode, tag=True, parse=False) --> Tokens
|
||||||
|
|
||||||
|
.. py:method:: vocab.__getitem__(self, text: unicode) --> Lexeme
|
||||||
|
|
||||||
|
.. py:method:: vocab.__getitem__(self, text: unicode) --> Lexeme
|
||||||
|
|
||||||
.. py:class:: spacy.tokens.Tokens via English.__call__
|
.. py:class:: spacy.tokens.Tokens via English.__call__
|
||||||
|
|
||||||
.. py:method:: __getitem__(self, i) --> Token
|
.. py:method:: __getitem__(self, i) --> Token
|
||||||
|
|
Loading…
Reference in New Issue