* Work on quickstart

2015-01-24 02:53:55 +11:00 · 2015-01-24 02:53:55 +11:00 · 75feb52c5d
parent fb6f079092
commit 75feb52c5d
1 changed files with 45 additions and 24 deletions
--- a/docs/source/quickstart.rst
+++ b/docs/source/quickstart.rst
@ -5,53 +5,70 @@ Quick Start
 Install
 -------
 .. code:: bash
    $ pip install spacy
    $ python -m spacy.en.download
-The download command fetches the parser model, which is too big to host on PyPi
+The download command fetches and installs the parser model and word representations,
-(about 100mb).  The data is installed within the spacy.en package.
+which are too big to host on PyPi (about 100mb each).  The data is installed within
 the spacy.en package directory.
 Usage
 -----
-The main entry-point is spacy.en.English.__call__, which you use to turn
+The main entry-point is :py:meth:`spacy.en.English.__call__`, which accepts a unicode string as an argument, and returns a :py:class:`spacy.tokens.Tokens` object:
 a unicode string into a Tokens object:
    >>> from spacy.en import English
    >>> nlp = English()
-    >>> tokens = nlp(u'A fine, very fine, example sentence')
+    >>> tokens = nlp(u'A fine, very fine, example sentence', tag=True,
                     parse=True)
 Calls to :py:meth:`English.__call__` has a side-effect: when a new
 word is seen, it is added to the string-to-ID mapping table in
 :py:class:`English.vocab.strings`.  Because of this, you will usually only want
 to create one instance of the pipeline.  If you create two instances, and use
 them to process different text, you'll probably get different string-to-ID
 mappings.  You might choose to wrap the English class as a singleton to ensure
 only one instance is created, but I've left that up to you.  I prefer to pass
 the instance around as an explicit argument.
 You shouldn't need to batch up your text or prepare it in any way.
 Processing times are linear in the length of the string, with minimal per-call
-overhead (apart from the first call, when the tagger and parser are lazy-loaded).
+overhead (apart from the first call, when the tagger and parser models are
 lazy-loaded. This takes a few seconds on my machine.).
-Usually, you will only want to create one instance of the pipeline, and pass it
+:py:meth:`English.__class__` returns a :py:class:`Tokens` object, through which
-around.  Each instance maintains its own string-to-id mapping table, so if you
+you'll access the processed text.  You can access the text in three ways:
 process a new word, it is likely to be assigned different integer IDs by the
 two different instances.
-The Tokens object has a sequences interface, which you can use to get
+Iteration
-individual tokens:
+  :py:meth:`Tokens.__iter__` and :py:meth:`Tokens.__getitem__`
-   >>> print tokens[0].lemma
+  - Most "Pythonic"
   'a'
   >>> for token in tokens:
   ...   print token.sic, token.pos
-For feature extraction, you can select a number of features to export to
+  - `spacy.tokens.Token` object, attribute access
 a numpy.ndarray:
-    >>> from spacy.en import enums
+  - Inefficient: New Token object created each time.
    >>> tokens.to_array([enums.LEMMA, enums.SIC])
-Another common operation is to export the embeddings vector to a numpy array:
+Export
  :py:meth:`Tokens.count_by` and :py:meth:`Tokens.to_array`
-    >>> tokens.to_vec()
+  - `count_by`: Efficient dictionary of counts, for bag-of-words model.
-Create a bag-of-words representation:
+  - `to_array`: Export to numpy array. One row per word, one column per
     attribute.
-    >>> tokens.count_by(enums.LEMMA)
+  - Specify attributes with constants from `spacy.en.attrs`.
 Cython
  :py:attr:`TokenC* Tokens.data`
  - Raw data is stored in contiguous array of structs
  - Good syntax, C speed
  - Documentation coming soon. In the meantime, see spacy/syntax/_parser.features.pyx
    or spacy/en/pos.pyx
 (Most of the) API at a glance
@ -61,6 +78,10 @@ Create a bag-of-words representation:
  .. py:method:: __call__(self, text: unicode, tag=True, parse=False) --> Tokens 
  .. py:method:: vocab.__getitem__(self, text: unicode) --> Lexeme
  .. py:method:: vocab.__getitem__(self, text: unicode) --> Lexeme
 .. py:class:: spacy.tokens.Tokens via English.__call__
  .. py:method:: __getitem__(self, i) --> Token