* Edits to docs

This commit is contained in:
Matthew Honnibal 2015-01-25 22:57:37 +11:00
parent 419fef7627
commit 140cd47e26
2 changed files with 41 additions and 29 deletions

View File

@ -7,10 +7,12 @@
spaCy: Industrial-strength NLP spaCy: Industrial-strength NLP
============================== ==============================
spaCy is a new library for text processing in Python and Cython. `spaCy`_ is a new library for text processing in Python and Cython.
I wrote it because I think small companies are terrible at NLP. Or rather: I wrote it because I think small companies are terrible at NLP. Or rather:
small companies are using terrible NLP technology. small companies are using terrible NLP technology.
.. spaCy:: https://github.com/honnibal/spaCy/
To do great NLP, you have to know a little about linguistics, a lot To do great NLP, you have to know a little about linguistics, a lot
about machine learning, and almost everything about the latest research. about machine learning, and almost everything about the latest research.
The people who fit this description seldom join small companies. The people who fit this description seldom join small companies.
@ -68,24 +70,22 @@ you want to **highlight all adverbs**. We'll use one of the examples he finds
particularly egregious: particularly egregious:
>>> import spacy.en >>> import spacy.en
>>> from spacy.postags import ADVERB >>> from spacy.parts_of_speech import ADV
>>> # Load the pipeline, and call it with some text. >>> # Load the pipeline, and call it with some text.
>>> nlp = spacy.en.English() >>> nlp = spacy.en.English()
>>> tokens = nlp("Give it back, he pleaded abjectly, its mine.", >>> tokens = nlp("Give it back, he pleaded abjectly, its mine.",
tag=True, parse=False) tag=True, parse=False)
>>> output = '' >>> print(''.join(tok.string.upper() if tok.pos == ADV else tok.string) for t in tokens)
>>> for tok in tokens:
... output += tok.string.upper() if tok.pos == ADVERB else tok.string
... output += tok.whitespace
>>> print(output)
Give it BACK, he pleaded ABJECTLY, its mine. Give it BACK, he pleaded ABJECTLY, its mine.
Easy enough --- but the problem is that we've also highlighted "back", when probably Easy enough --- but the problem is that we've also highlighted "back".
we only wanted to highlight "abjectly". While "back" is undoubtedly an adverb, While "back" is undoubtedly an adverb, we probably don't want to highlight it.
we probably don't want to highlight it. If what we're trying to do is flag dubious stylistic choices, we'll need to
refine our logic. It turns out only a certain type of adverb is of interest to
us.
There are lots of ways we might refine our logic, depending on just what words There are lots of ways we might do this, depending on just what words
we want to flag. The simplest way to exclude adverbs like "back" and "not" we want to flag. The simplest way to exclude adverbs like "back" and "not"
is by word frequency: these words are much more common than the prototypical is by word frequency: these words are much more common than the prototypical
manner adverbs that the style guides are worried about. manner adverbs that the style guides are worried about.
@ -93,15 +93,15 @@ manner adverbs that the style guides are worried about.
The :py:attr:`Lexeme.prob` and :py:attr:`Token.prob` attribute gives a The :py:attr:`Lexeme.prob` and :py:attr:`Token.prob` attribute gives a
log probability estimate of the word: log probability estimate of the word:
>>> nlp.vocab[u'back'].prob >>> nlp.vocab['back'].prob
-7.403977394104004 -7.403977394104004
>>> nlp.vocab[u'not'].prob >>> nlp.vocab['not'].prob
-5.407193660736084 -5.407193660736084
>>> nlp.vocab[u'quietly'].prob >>> nlp.vocab['quietly'].prob
-11.07155704498291 -11.07155704498291
(The probability estimate is based on counts from a 3 billion word corpus, (The probability estimate is based on counts from a 3 billion word corpus,
smoothed using the Gale (2002) `Simple Good-Turing`_ method.) smoothed using the `Simple Good-Turing`_ method.)
.. _`Simple Good-Turing`: http://www.d.umn.edu/~tpederse/Courses/CS8761-FALL02/Code/sgt-gale.pdf .. _`Simple Good-Turing`: http://www.d.umn.edu/~tpederse/Courses/CS8761-FALL02/Code/sgt-gale.pdf
@ -109,26 +109,28 @@ So we can easily exclude the N most frequent words in English from our adverb
marker. Let's try N=1000 for now: marker. Let's try N=1000 for now:
>>> import spacy.en >>> import spacy.en
>>> from spacy.postags import ADVERB >>> from spacy.parts_of_speech import ADV
>>> nlp = spacy.en.English() >>> nlp = spacy.en.English()
>>> # Find log probability of Nth most frequent word >>> # Find log probability of Nth most frequent word
>>> probs = [lex.prob for lex in nlp.vocab] >>> probs = [lex.prob for lex in nlp.vocab]
>>> is_adverb = lambda tok: tok.pos == ADVERB and tok.prob < probs[-1000] >>> probs.sort()
>>> tokens = nlp("Give it back, he pleaded abjectly, its mine.", >>> is_adverb = lambda tok: tok.pos == ADV and tok.prob < probs[-1000]
tag=True, parse=True) >>> tokens = nlp("Give it back, he pleaded abjectly, its mine.")
>>> print(''.join(tok.string.upper() if is_adverb(tok) else tok.string)) >>> print(''.join(tok.string.upper() if is_adverb(tok) else tok.string for tok in tokens))
Give it back, he pleaded ABJECTLY, its mine. Give it back, he pleaded ABJECTLY, its mine.
There are lots of other ways we could refine the logic, depending on just what There are lots of other ways we could refine the logic, depending on just what
words we want to flag. Let's say we wanted to only flag adverbs that modified words words we want to flag. Let's say we wanted to only flag adverbs that modified words
similar to "pleaded". This is easy to do, as spaCy loads a vector-space similar to "pleaded". This is easy to do, as spaCy loads a vector-space
representation for every word (by default, the vectors produced by representation for every word (by default, the vectors produced by
`Levy and Goldberg (2014)`_. Naturally, the vector is provided as a numpy `Levy and Goldberg (2014)`_). Naturally, the vector is provided as a numpy
array: array:
>>> pleaded = tokens[8] >>> pleaded = tokens[8]
>>> pleaded.repvec.shape >>> pleaded.repvec.shape
(300,) (300,)
>>> pleaded.repvec[:5]
array([ 0.04229792, 0.07459262, 0.00820188, -0.02181299, 0.07519238], dtype=float32)
.. _Levy and Goldberg (2014): https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/ .. _Levy and Goldberg (2014): https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/
@ -139,18 +141,18 @@ cosine metric:
>>> from numpy import dot >>> from numpy import dot
>>> from numpy.linalg import norm >>> from numpy.linalg import norm
>>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1), norm(v2)) >>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1), norm(v2))
>>> words = [w for w in nlp.vocab if w.is_lower and w.has_repvec] >>> words = [w for w in nlp.vocab if w.is_lower]
>>> words.sort(key=lambda w: cosine(w, pleaded)) >>> words.sort(key=lambda w: cosine(w, pleaded))
>>> words.reverse() >>> words.reverse()
>>> print '1-20', ', '.join(w.orth_ for w in words[0:20]) >>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
1-20 pleaded, pled, plead, confessed, interceded, pleads, testified, conspired, motioned, demurred, countersued, remonstrated, begged, apologised, consented, acquiesced, petitioned, quarreled, appealed, pleading 1-20 pleaded, pled, plead, confessed, interceded, pleads, testified, conspired, motioned, demurred, countersued, remonstrated, begged, apologised, consented, acquiesced, petitioned, quarreled, appealed, pleading
>>> print '50-60', ', '.join(w.orth_ for w in words[50:60]) >>> print('50-60', ', '.join(w.orth_ for w in words[50:60]))
50-60 counselled, bragged, backtracked, caucused, refiled, dueled, mused, dissented, yearned, confesses 50-60 counselled, bragged, backtracked, caucused, refiled, dueled, mused, dissented, yearned, confesses
>>> print '100-110', ', '.join(w.orth_ for w in words[100:110]) >>> print('100-110', ', '.join(w.orth_ for w in words[100:110]))
cabled, ducked, sentenced, perjured, absconded, bargained, overstayed, clerked, confided, sympathizes cabled, ducked, sentenced, perjured, absconded, bargained, overstayed, clerked, confided, sympathizes
>>> print '1000-1010', ', '.join(w.orth_ for w in words[1000:1010]) >>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
scorned, baled, righted, requested, swindled, posited, firebombed, slimed, deferred, sagged scorned, baled, righted, requested, swindled, posited, firebombed, slimed, deferred, sagged
>>> print ', '.join(w.orth_ for w in words[50000:50010]) >>> print(', '.join(w.orth_ for w in words[50000:50010]))
fb, ford, systems, puck, anglers, ik, tabloid, dirty, rims, artists fb, ford, systems, puck, anglers, ik, tabloid, dirty, rims, artists
As you can see, the similarity model that these vectors give us is excellent As you can see, the similarity model that these vectors give us is excellent
@ -169,10 +171,10 @@ as our target:
... say_vector += nlp.vocab[verb].repvec ... say_vector += nlp.vocab[verb].repvec
>>> words.sort(key=lambda w: cosine(w.repvec, say_vector)) >>> words.sort(key=lambda w: cosine(w.repvec, say_vector))
>>> words.reverse() >>> words.reverse()
>>> print '1-20', ', '.join(w.orth_ for w in words[0:20]) >>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
1-20 bragged, remonstrated, enquired, demurred, sighed, mused, intimated, retorted, entreated, motioned, ranted, confided, countersued, gestured, implored, interceded, muttered, marvelled, bickered, despaired 1-20 bragged, remonstrated, enquired, demurred, sighed, mused, intimated, retorted, entreated, motioned, ranted, confided, countersued, gestured, implored, interceded, muttered, marvelled, bickered, despaired
50-60 flaunted, quarrelled, ingratiated, vouched, agonized, apologised, lunched, joked, chafed, schemed 50-60 flaunted, quarrelled, ingratiated, vouched, agonized, apologised, lunched, joked, chafed, schemed
>>> print '1000-1010', ', '.join(w.orth_ for w in words[1000:1010]) >>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
1000-1010 hoarded, waded, ensnared, clamoring, abided, deploring, shriveled, endeared, rethought, berate 1000-1010 hoarded, waded, ensnared, clamoring, abided, deploring, shriveled, endeared, rethought, berate
These definitely look like words that King might scold a writer for attaching These definitely look like words that King might scold a writer for attaching

View File

@ -12,6 +12,16 @@ Install
$ pip install spacy $ pip install spacy
$ python -m spacy.en.download $ python -m spacy.en.download
To compile from source:
.. code:: bash
$ git clone https://github.com/honnibal/spaCy.git
$ virtualenv .env && source .env/bin/activate
$ pip install -r requirements.txt
$ python -m spacy.en.download
$ fab make test
The download command fetches and installs about 300mb of data, for the `parser model_` The download command fetches and installs about 300mb of data, for the `parser model_`
and `word vectors`_, which it installs within the spacy.en package directory. and `word vectors`_, which it installs within the spacy.en package directory.