mirror of https://github.com/explosion/spaCy.git
201 lines
9.9 KiB
Plaintext
201 lines
9.9 KiB
Plaintext
doctype html
|
||
html(lang='en')
|
||
head
|
||
meta(charset='utf-8')
|
||
title spaCy Blog
|
||
meta(name='description', content='')
|
||
meta(name='author', content='Matthew Honnibal')
|
||
link(rel='stylesheet', href='css/style.css')
|
||
//if lt IE 9
|
||
script(src='http://html5shiv.googlecode.com/svn/trunk/html5.js')
|
||
body#blog
|
||
header(role='banner')
|
||
h1.logo spaCy Blog
|
||
.slogan Blog
|
||
main#content(role='main')
|
||
article.post
|
||
|
||
|
||
:markdown-it
|
||
# Adverbs
|
||
|
||
Let's say you're developing a proofreading tool, or possibly an IDE for
|
||
writers. You're convinced by Stephen King's advice that `adverbs are
|
||
not your friend <http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs/>`_,
|
||
so you want to **highlight all adverbs**. We'll use one of the examples
|
||
he finds particularly egregious:
|
||
|
||
pre.language-python
|
||
code
|
||
| import spacy.en
|
||
| >>> from spacy.parts_of_speech import ADV
|
||
| >>> # Load the pipeline, and call it with some text.
|
||
| >>> nlp = spacy.en.English()
|
||
| >>> tokens = nlp(u"‘Give it back,’ he pleaded abjectly, ‘it’s mine.’", tag=True, parse=False)
|
||
| >>> print u''.join(tok.string.upper() if tok.pos == ADV else tok.string for tok in tokens)
|
||
| u‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’
|
||
|
||
:markdown-it
|
||
Easy enough --- but the problem is that we've also highlighted "back".
|
||
While "back" is undoubtedly an adverb, we probably don't want to highlight
|
||
it. If what we're trying to do is flag dubious stylistic choices, we'll
|
||
need to refine our logic. It turns out only a certain type of adverb
|
||
is of interest to us.
|
||
|
||
|
||
:markdown-it
|
||
There are lots of ways we might do this, depending on just what words
|
||
we want to flag. The simplest way to exclude adverbs like "back" and
|
||
"not" is by word frequency: these words are much more common than the
|
||
prototypical manner adverbs that the style guides are worried about.
|
||
|
||
:markdown-it
|
||
The :py:attr:`Lexeme.prob` and :py:attr:`Token.prob` attribute gives a
|
||
log probability estimate of the word:
|
||
|
||
pre.language-python
|
||
code
|
||
| >>> nlp.vocab[u'back'].prob
|
||
| -7.403977394104004
|
||
| >>> nlp.vocab[u'not'].prob
|
||
| -5.407193660736084
|
||
| >>> nlp.vocab[u'quietly'].prob
|
||
| -11.07155704498291
|
||
|
||
:markdown-it
|
||
(The probability estimate is based on counts from a 3 billion word corpus,
|
||
smoothed using the `Simple Good-Turing`_ method.)
|
||
|
||
So we can easily exclude the N most frequent words in English from our
|
||
adverb marker. Let's try N=1000 for now:
|
||
|
||
pre.language-python
|
||
code
|
||
| >>> import spacy.en
|
||
| >>> from spacy.parts_of_speech import ADV
|
||
| >>> nlp = spacy.en.English()
|
||
| >>> # Find log probability of Nth most frequent word
|
||
| >>> probs = [lex.prob for lex in nlp.vocab]
|
||
| >>> probs.sort()
|
||
| >>> is_adverb = lambda tok: tok.pos == ADV and tok.prob < probs[-1000]
|
||
| >>> tokens = nlp(u"‘Give it back,’ he pleaded abjectly, ‘it’s mine.’")
|
||
| >>> print u''.join(tok.string.upper() if is_adverb(tok) else tok.string for tok in tokens)
|
||
| ‘Give it back,’ he pleaded ABJECTLY, ‘it’s mine.’
|
||
|
||
:markdown-it
|
||
There are lots of other ways we could refine the logic, depending on
|
||
just what words we want to flag. Let's say we wanted to only flag
|
||
adverbs that modified words similar to "pleaded". This is easy to do,
|
||
as spaCy loads a vector-space representation for every word (by default,
|
||
the vectors produced by `Levy and Goldberg (2014)`_). Naturally, the
|
||
vector is provided as a numpy array:
|
||
|
||
pre.language-python
|
||
code
|
||
| >>> pleaded = tokens[7]
|
||
| >>> pleaded.repvec.shape
|
||
| (300,)
|
||
| >>> pleaded.repvec[:5]
|
||
| array([ 0.04229792, 0.07459262, 0.00820188, -0.02181299, 0.07519238], dtype=float32)
|
||
|
||
:markdown-it
|
||
We want to sort the words in our vocabulary by their similarity to
|
||
"pleaded". There are lots of ways to measure the similarity of two
|
||
vectors. We'll use the cosine metric:
|
||
|
||
pre.language-python
|
||
code
|
||
| >>> from numpy import dot
|
||
| >>> from numpy.linalg import norm
|
||
|
||
| >>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
|
||
| >>> words = [w for w in nlp.vocab if w.has_repvec]
|
||
| >>> words.sort(key=lambda w: cosine(w.repvec, pleaded.repvec))
|
||
| >>> words.reverse()
|
||
| >>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
|
||
| 1-20 pleaded, pled, plead, confessed, interceded, pleads, testified, conspired, motioned, demurred, countersued, remonstrated, begged, apologised, consented, acquiesced, petitioned, quarreled, appealed, pleading
|
||
| >>> print('50-60', ', '.join(w.orth_ for w in words[50:60]))
|
||
| 50-60 counselled, bragged, backtracked, caucused, refiled, dueled, mused, dissented, yearned, confesses
|
||
| >>> print('100-110', ', '.join(w.orth_ for w in words[100:110]))
|
||
| 100-110 cabled, ducked, sentenced, perjured, absconded, bargained, overstayed, clerked, confided, sympathizes
|
||
| >>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
|
||
| 1000-1010 scorned, baled, righted, requested, swindled, posited, firebombed, slimed, deferred, sagged
|
||
| >>> print('50000-50010', ', '.join(w.orth_ for w in words[50000:50010]))
|
||
| 50000-50010, fb, ford, systems, puck, anglers, ik, tabloid, dirty, rims, artists
|
||
|
||
:markdown-it
|
||
As you can see, the similarity model that these vectors give us is excellent
|
||
--- we're still getting meaningful results at 1000 words, off a single
|
||
prototype! The only problem is that the list really contains two clusters of
|
||
words: one associated with the legal meaning of "pleaded", and one for the more
|
||
general sense. Sorting out these clusters is an area of active research.
|
||
|
||
A simple work-around is to average the vectors of several words, and use that
|
||
as our target:
|
||
|
||
pre.language-python
|
||
code
|
||
| >>> say_verbs = ['pleaded', 'confessed', 'remonstrated', 'begged', 'bragged', 'confided', 'requested']
|
||
| >>> say_vector = sum(nlp.vocab[verb].repvec for verb in say_verbs) / len(say_verbs)
|
||
| >>> words.sort(key=lambda w: cosine(w.repvec * say_vector))
|
||
| >>> words.reverse()
|
||
| >>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
|
||
| 1-20 bragged, remonstrated, enquired, demurred, sighed, mused, intimated, retorted, entreated, motioned, ranted, confided, countersued, gestured, implored, interceded, muttered, marvelled, bickered, despaired
|
||
| >>> print('50-60', ', '.join(w.orth_ for w in words[50:60]))
|
||
| 50-60 flaunted, quarrelled, ingratiated, vouched, agonized, apologised, lunched, joked, chafed, schemed
|
||
| >>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
|
||
| 1000-1010 hoarded, waded, ensnared, clamoring, abided, deploring, shriveled, endeared, rethought, berate
|
||
|
||
:markdown-it
|
||
These definitely look like words that King might scold a writer for attaching
|
||
adverbs to. Recall that our original adverb highlighting function looked like
|
||
this:
|
||
|
||
pre.language-python
|
||
code
|
||
| >>> import spacy.en
|
||
| >>> from spacy.parts_of_speech import ADV
|
||
| >>> # Load the pipeline, and call it with some text.
|
||
| >>> nlp = spacy.en.English()
|
||
| >>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’",
|
||
| tag=True, parse=False)
|
||
| >>> print(''.join(tok.string.upper() if tok.pos == ADV else tok.string for tok in tokens))
|
||
| ‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’
|
||
|
||
|
||
:markdown-it
|
||
We wanted to refine the logic so that only adverbs modifying evocative
|
||
verbs of communication, like "pleaded", were highlighted. We've now
|
||
built a vector that represents that type of word, so now we can highlight
|
||
adverbs based on subtle logic, honing in on adverbs that seem the most
|
||
stylistically problematic, given our starting assumptions:
|
||
|
||
pre.language-python
|
||
code
|
||
| >>> import numpy
|
||
| >>> from numpy import dot
|
||
| >>> from numpy.linalg import norm
|
||
| >>> import spacy.en
|
||
| >>> from spacy.parts_of_speech import ADV, VERB
|
||
| >>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
|
||
| >>> def is_bad_adverb(token, target_verb, tol):
|
||
| ... if token.pos != ADV
|
||
| ... return False
|
||
| ... elif token.head.pos != VERB:
|
||
| ... return False
|
||
| ... elif cosine(token.head.repvec, target_verb) < tol:
|
||
| ... return False
|
||
| ... else:
|
||
| ... return True
|
||
|
||
:markdown-it
|
||
This example was somewhat contrived --- and, truth be told, I've never
|
||
really bought the idea that adverbs were a grave stylistic sin. But
|
||
hopefully it got the message across: the state-of-the-art NLP technologies
|
||
are very powerful. spaCy gives you easy and efficient access to them,
|
||
which lets you build all sorts of useful products and features that
|
||
were previously impossible.
|
||
|
||
footer(role='contentinfo')
|
||
script(src='js/prism.js')
|