* Edits to docs

This commit is contained in:
Matthew Honnibal 2015-01-25 22:57:37 +11:00
parent 419fef7627
commit 140cd47e26
2 changed files with 41 additions and 29 deletions

View File

@ -7,10 +7,12 @@
spaCy: Industrial-strength NLP
==============================
spaCy is a new library for text processing in Python and Cython.
`spaCy`_ is a new library for text processing in Python and Cython.
I wrote it because I think small companies are terrible at NLP. Or rather:
small companies are using terrible NLP technology.
.. spaCy:: https://github.com/honnibal/spaCy/
To do great NLP, you have to know a little about linguistics, a lot
about machine learning, and almost everything about the latest research.
The people who fit this description seldom join small companies.
@ -68,24 +70,22 @@ you want to **highlight all adverbs**. We'll use one of the examples he finds
particularly egregious:
>>> import spacy.en
>>> from spacy.postags import ADVERB
>>> from spacy.parts_of_speech import ADV
>>> # Load the pipeline, and call it with some text.
>>> nlp = spacy.en.English()
>>> tokens = nlp("Give it back, he pleaded abjectly, its mine.",
tag=True, parse=False)
>>> output = ''
>>> for tok in tokens:
... output += tok.string.upper() if tok.pos == ADVERB else tok.string
... output += tok.whitespace
>>> print(output)
>>> print(''.join(tok.string.upper() if tok.pos == ADV else tok.string) for t in tokens)
Give it BACK, he pleaded ABJECTLY, its mine.
Easy enough --- but the problem is that we've also highlighted "back", when probably
we only wanted to highlight "abjectly". While "back" is undoubtedly an adverb,
we probably don't want to highlight it.
Easy enough --- but the problem is that we've also highlighted "back".
While "back" is undoubtedly an adverb, we probably don't want to highlight it.
If what we're trying to do is flag dubious stylistic choices, we'll need to
refine our logic. It turns out only a certain type of adverb is of interest to
us.
There are lots of ways we might refine our logic, depending on just what words
There are lots of ways we might do this, depending on just what words
we want to flag. The simplest way to exclude adverbs like "back" and "not"
is by word frequency: these words are much more common than the prototypical
manner adverbs that the style guides are worried about.
@ -93,15 +93,15 @@ manner adverbs that the style guides are worried about.
The :py:attr:`Lexeme.prob` and :py:attr:`Token.prob` attribute gives a
log probability estimate of the word:
>>> nlp.vocab[u'back'].prob
>>> nlp.vocab['back'].prob
-7.403977394104004
>>> nlp.vocab[u'not'].prob
>>> nlp.vocab['not'].prob
-5.407193660736084
>>> nlp.vocab[u'quietly'].prob
>>> nlp.vocab['quietly'].prob
-11.07155704498291
(The probability estimate is based on counts from a 3 billion word corpus,
smoothed using the Gale (2002) `Simple Good-Turing`_ method.)
smoothed using the `Simple Good-Turing`_ method.)
.. _`Simple Good-Turing`: http://www.d.umn.edu/~tpederse/Courses/CS8761-FALL02/Code/sgt-gale.pdf
@ -109,26 +109,28 @@ So we can easily exclude the N most frequent words in English from our adverb
marker. Let's try N=1000 for now:
>>> import spacy.en
>>> from spacy.postags import ADVERB
>>> from spacy.parts_of_speech import ADV
>>> nlp = spacy.en.English()
>>> # Find log probability of Nth most frequent word
>>> probs = [lex.prob for lex in nlp.vocab]
>>> is_adverb = lambda tok: tok.pos == ADVERB and tok.prob < probs[-1000]
>>> tokens = nlp("Give it back, he pleaded abjectly, its mine.",
tag=True, parse=True)
>>> print(''.join(tok.string.upper() if is_adverb(tok) else tok.string))
>>> probs.sort()
>>> is_adverb = lambda tok: tok.pos == ADV and tok.prob < probs[-1000]
>>> tokens = nlp("Give it back, he pleaded abjectly, its mine.")
>>> print(''.join(tok.string.upper() if is_adverb(tok) else tok.string for tok in tokens))
Give it back, he pleaded ABJECTLY, its mine.
There are lots of other ways we could refine the logic, depending on just what
words we want to flag. Let's say we wanted to only flag adverbs that modified words
similar to "pleaded". This is easy to do, as spaCy loads a vector-space
representation for every word (by default, the vectors produced by
`Levy and Goldberg (2014)`_. Naturally, the vector is provided as a numpy
`Levy and Goldberg (2014)`_). Naturally, the vector is provided as a numpy
array:
>>> pleaded = tokens[8]
>>> pleaded.repvec.shape
(300,)
>>> pleaded.repvec[:5]
array([ 0.04229792, 0.07459262, 0.00820188, -0.02181299, 0.07519238], dtype=float32)
.. _Levy and Goldberg (2014): https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/
@ -139,18 +141,18 @@ cosine metric:
>>> from numpy import dot
>>> from numpy.linalg import norm
>>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1), norm(v2))
>>> words = [w for w in nlp.vocab if w.is_lower and w.has_repvec]
>>> words = [w for w in nlp.vocab if w.is_lower]
>>> words.sort(key=lambda w: cosine(w, pleaded))
>>> words.reverse()
>>> print '1-20', ', '.join(w.orth_ for w in words[0:20])
>>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
1-20 pleaded, pled, plead, confessed, interceded, pleads, testified, conspired, motioned, demurred, countersued, remonstrated, begged, apologised, consented, acquiesced, petitioned, quarreled, appealed, pleading
>>> print '50-60', ', '.join(w.orth_ for w in words[50:60])
>>> print('50-60', ', '.join(w.orth_ for w in words[50:60]))
50-60 counselled, bragged, backtracked, caucused, refiled, dueled, mused, dissented, yearned, confesses
>>> print '100-110', ', '.join(w.orth_ for w in words[100:110])
>>> print('100-110', ', '.join(w.orth_ for w in words[100:110]))
cabled, ducked, sentenced, perjured, absconded, bargained, overstayed, clerked, confided, sympathizes
>>> print '1000-1010', ', '.join(w.orth_ for w in words[1000:1010])
>>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
scorned, baled, righted, requested, swindled, posited, firebombed, slimed, deferred, sagged
>>> print ', '.join(w.orth_ for w in words[50000:50010])
>>> print(', '.join(w.orth_ for w in words[50000:50010]))
fb, ford, systems, puck, anglers, ik, tabloid, dirty, rims, artists
As you can see, the similarity model that these vectors give us is excellent
@ -169,10 +171,10 @@ as our target:
... say_vector += nlp.vocab[verb].repvec
>>> words.sort(key=lambda w: cosine(w.repvec, say_vector))
>>> words.reverse()
>>> print '1-20', ', '.join(w.orth_ for w in words[0:20])
>>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
1-20 bragged, remonstrated, enquired, demurred, sighed, mused, intimated, retorted, entreated, motioned, ranted, confided, countersued, gestured, implored, interceded, muttered, marvelled, bickered, despaired
50-60 flaunted, quarrelled, ingratiated, vouched, agonized, apologised, lunched, joked, chafed, schemed
>>> print '1000-1010', ', '.join(w.orth_ for w in words[1000:1010])
>>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
1000-1010 hoarded, waded, ensnared, clamoring, abided, deploring, shriveled, endeared, rethought, berate
These definitely look like words that King might scold a writer for attaching

View File

@ -12,6 +12,16 @@ Install
$ pip install spacy
$ python -m spacy.en.download
To compile from source:
.. code:: bash
$ git clone https://github.com/honnibal/spaCy.git
$ virtualenv .env && source .env/bin/activate
$ pip install -r requirements.txt
$ python -m spacy.en.download
$ fab make test
The download command fetches and installs about 300mb of data, for the `parser model_`
and `word vectors`_, which it installs within the spacy.en package directory.