spaCy/docs/redesign/tute_adverbs.jade

doctype html
html(lang='en')
  head
    meta(charset='utf-8')
    title spaCy Blog
    meta(name='description', content='')
    meta(name='author', content='Matthew Honnibal')
    link(rel='stylesheet', href='css/style.css')
    //if lt IE 9
      script(src='http://html5shiv.googlecode.com/svn/trunk/html5.js')
  body#blog
    header(role='banner')
      h1.logo spaCy Blog
      .slogan Blog
    main#content(role='main')
      article.post


        :markdown-it
          # Adverbs

          Let's say you're developing a proofreading tool, or possibly an IDE for
          writers.  You're convinced by Stephen King's advice that `adverbs are
          not your friend <http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs/>`_,
          so you want to **highlight all adverbs**.  We'll use one of the examples
          he finds particularly egregious:

        pre.language-python
          code
            | import spacy.en
            | >>> from spacy.parts_of_speech import ADV
            | >>> # Load the pipeline, and call it with some text.
            | >>> nlp = spacy.en.English()
            | >>> tokens = nlp(u"‘Give it back,’ he pleaded abjectly, ‘it’s mine.’", tag=True, parse=False)
            | >>> print u''.join(tok.string.upper() if tok.pos == ADV else tok.string for tok in tokens)
            | u‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’

        :markdown-it
          Easy enough --- but the problem is that we've also highlighted "back".
          While "back" is undoubtedly an adverb, we probably don't want to highlight
          it. If what we're trying to do is flag dubious stylistic choices, we'll
          need to refine our logic.  It turns out only a certain type of adverb
          is of interest to us.


        :markdown-it
          There are lots of ways we might do this, depending on just what words
          we want to flag.  The simplest way to exclude adverbs like "back" and
          "not" is by word frequency: these words are much more common than the
          prototypical manner adverbs that the style guides are worried about.

        :markdown-it
          The :py:attr:`Lexeme.prob` and :py:attr:`Token.prob` attribute gives a
          log probability estimate of the word:

        pre.language-python
          code
            | >>> nlp.vocab[u'back'].prob
            | -7.403977394104004
            | >>> nlp.vocab[u'not'].prob
            | -5.407193660736084
            | >>> nlp.vocab[u'quietly'].prob
            | -11.07155704498291

        :markdown-it
          (The probability estimate is based on counts from a 3 billion word corpus,
          smoothed using the `Simple Good-Turing`_ method.)

          So we can easily exclude the N most frequent words in English from our
          adverb marker.  Let's try N=1000 for now:

        pre.language-python
          code
            | >>> import spacy.en
            | >>> from spacy.parts_of_speech import ADV
            | >>> nlp = spacy.en.English()
            | >>> # Find log probability of Nth most frequent word
            | >>> probs = [lex.prob for lex in nlp.vocab]
            | >>> probs.sort()
            | >>> is_adverb = lambda tok: tok.pos == ADV and tok.prob < probs[-1000]
            | >>> tokens = nlp(u"‘Give it back,’ he pleaded abjectly, ‘it’s mine.’")
            | >>> print u''.join(tok.string.upper() if is_adverb(tok) else tok.string for tok in tokens)
            | ‘Give it back,’ he pleaded ABJECTLY, ‘it’s mine.’

        :markdown-it
          There are lots of other ways we could refine the logic, depending on
          just what words we want to flag.  Let's say we wanted to only flag
          adverbs that modified words similar to "pleaded".  This is easy to do,
          as spaCy loads a vector-space representation for every word (by default,
          the vectors produced by `Levy and Goldberg (2014)`_).  Naturally, the
          vector is provided as a numpy array:

        pre.language-python
          code
            | >>> pleaded = tokens[7]
            | >>> pleaded.repvec.shape
            | (300,)
            | >>> pleaded.repvec[:5]
            | array([ 0.04229792,  0.07459262,  0.00820188, -0.02181299,  0.07519238], dtype=float32)

        :markdown-it
          We want to sort the words in our vocabulary by their similarity to
          "pleaded".  There are lots of ways to measure the similarity of two
          vectors.  We'll use the cosine metric:

        pre.language-python
          code
            | >>> from numpy import dot
            | >>> from numpy.linalg import norm

            | >>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
            | >>> words = [w for w in nlp.vocab if w.has_repvec]
            | >>> words.sort(key=lambda w: cosine(w.repvec, pleaded.repvec))
            | >>> words.reverse()
            | >>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
            | 1-20 pleaded, pled, plead, confessed, interceded, pleads, testified, conspired, motioned, demurred, countersued, remonstrated, begged, apologised, consented, acquiesced, petitioned, quarreled, appealed, pleading
            | >>> print('50-60', ', '.join(w.orth_ for w in words[50:60]))
            | 50-60 counselled, bragged, backtracked, caucused, refiled, dueled, mused, dissented, yearned, confesses
            | >>> print('100-110', ', '.join(w.orth_ for w in words[100:110]))
            | 100-110 cabled, ducked, sentenced, perjured, absconded, bargained, overstayed, clerked, confided, sympathizes
            | >>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
            | 1000-1010 scorned, baled, righted, requested, swindled, posited, firebombed, slimed, deferred, sagged
            | >>> print('50000-50010', ', '.join(w.orth_ for w in words[50000:50010]))
            | 50000-50010, fb, ford, systems, puck, anglers, ik, tabloid, dirty, rims, artists

        :markdown-it
          As you can see, the similarity model that these vectors give us is excellent
          --- we're still getting meaningful results at 1000 words, off a single
          prototype!  The only problem is that the list really contains two clusters of
          words: one associated with the legal meaning of "pleaded", and one for the more
          general sense.  Sorting out these clusters is an area of active research.

          A simple work-around is to average the vectors of several words, and use that
          as our target:

        pre.language-python
          code
            | >>> say_verbs = ['pleaded', 'confessed', 'remonstrated', 'begged', 'bragged', 'confided', 'requested']
            | >>> say_vector = sum(nlp.vocab[verb].repvec for verb in say_verbs) / len(say_verbs)
            | >>> words.sort(key=lambda w: cosine(w.repvec * say_vector))
            | >>> words.reverse()
            | >>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
            | 1-20 bragged, remonstrated, enquired, demurred, sighed, mused, intimated, retorted, entreated, motioned, ranted, confided, countersued, gestured, implored, interceded, muttered, marvelled, bickered, despaired
            | >>> print('50-60', ', '.join(w.orth_ for w in words[50:60]))
            | 50-60 flaunted, quarrelled, ingratiated, vouched, agonized, apologised, lunched, joked, chafed, schemed
            | >>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
            | 1000-1010 hoarded, waded, ensnared, clamoring, abided, deploring, shriveled, endeared, rethought, berate

        :markdown-it
          These definitely look like words that King might scold a writer for attaching
          adverbs to.  Recall that our original adverb highlighting function looked like
          this:

        pre.language-python
          code
            | >>> import spacy.en
            | >>> from spacy.parts_of_speech import ADV
            | >>> # Load the pipeline, and call it with some text.
            | >>> nlp = spacy.en.English()
            | >>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’",
            |                  tag=True, parse=False)
            | >>> print(''.join(tok.string.upper() if tok.pos == ADV else tok.string for tok in tokens))
            | ‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’


        :markdown-it
          We wanted to refine the logic so that only adverbs modifying evocative
          verbs of communication, like "pleaded", were highlighted.  We've now
          built a vector that represents that type of word, so now we can highlight
          adverbs based on subtle logic, honing in on adverbs that seem the most
          stylistically problematic, given our starting assumptions:

        pre.language-python
          code
            | >>> import numpy
            | >>> from numpy import dot
            | >>> from numpy.linalg import norm
            | >>> import spacy.en
            | >>> from spacy.parts_of_speech import ADV, VERB
            | >>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
            | >>> def is_bad_adverb(token, target_verb, tol):
            | ...   if token.pos != ADV
            | ...     return False
            | ...   elif token.head.pos != VERB:
            | ...     return False
            | ...   elif cosine(token.head.repvec, target_verb) < tol:
            | ...     return False
            | ...   else:
            | ...     return True

        :markdown-it
          This example was somewhat contrived --- and, truth be told, I've never
          really bought the idea that adverbs were a grave stylistic sin.  But
          hopefully it got the message across: the state-of-the-art NLP technologies
          are very powerful. spaCy gives you easy and efficient access to them,
          which lets you build all sorts of useful products and features that
          were previously impossible.

  footer(role='contentinfo')
  script(src='js/prism.js')