* Edits to quickstart

This commit is contained in:
Matthew Honnibal 2015-01-24 17:47:51 +11:00
parent a97bed9359
commit 32f58b19d1
1 changed files with 11 additions and 18 deletions

View File

@ -12,9 +12,8 @@ Install
$ pip install spacy $ pip install spacy
$ python -m spacy.en.download $ python -m spacy.en.download
The download command fetches and installs the parser model and word representations, The download command fetches and installs about 200mb of data, which it installs
which are too big to host on PyPi (about 100mb each). The data is installed within within the spacy.en package directory.
the spacy.en package directory.
Usage Usage
----- -----
@ -37,18 +36,14 @@ e.g. `pizza.orth_` and `pizza.orth` provide the integer ID and the string of
the original orthographic form of the word, with no string normalizations the original orthographic form of the word, with no string normalizations
applied. applied.
.. note:: .. note:: en.English.__call__ is stateful --- it has an important **side-effect**.
en.English.__call__ is stateful --- it has an important **side-effect**:
spaCy maps strings to sequential integers, so when it processes a new
word, the mapping table is updated.
Future releases will feature a way to reconcile :py:class:`strings.StringStore` When it processes a previously unseen word, it increments the ID counter,
mappings, but for now, you should only work with one instance of the pipeline assigns the ID to the string, and writes the mapping in
at a time. :py:data:`English.vocab.strings` (instance of
:py:class:`strings.StringStore`).
This issue only affects rare words. spaCy's pre-compiled lexicon has 260,000 Future releases will feature a way to reconcile mappings, but for now, you
words; the string IDs for these words will always be consistent. should only work with one instance of the pipeline at a time.
(Most of the) API at a glance (Most of the) API at a glance
@ -76,7 +71,7 @@ applied.
**Get dict or numpy array:** **Get dict or numpy array:**
.. py:method:: tokens.Tokens.to_array(self, attr_ids: List[int]) --> numpy.ndarray[ndim=2, dtype=int32] .. py:method:: tokens.Tokens.to_array(self, attr_ids: List[int]) --> ndarray[ndim=2, dtype=long]
.. py:method:: tokens.Tokens.count_by(self, attr_id: int) --> Dict[int, int] .. py:method:: tokens.Tokens.count_by(self, attr_id: int) --> Dict[int, int]
@ -93,7 +88,7 @@ applied.
.. py:attribute:: lexeme.Lexeme.repvec .. py:attribute:: lexeme.Lexeme.repvec
**Navigate dependency parse** **Navigate to tree- or string-neighbor tokens**
.. py:method:: nbor(self, i=1) --> Token .. py:method:: nbor(self, i=1) --> Token
@ -115,8 +110,6 @@ applied.
Length, in unicode code-points. Equal to len(self.orth_). Length, in unicode code-points. Equal to len(self.orth_).
self.string[self.length:] gets whitespace.
.. py:attribute:: idx: int .. py:attribute:: idx: int
Starting offset of word in the original string. Starting offset of word in the original string.