mirror of https://github.com/explosion/spaCy.git
924 lines
39 KiB
Plaintext
924 lines
39 KiB
Plaintext
-
|
||
var urls = {
|
||
'pos_post': 'https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/',
|
||
'google_ngrams': "http://googleresearch.blogspot.com.au/2013/05/syntactic-ngrams-over-time.html",
|
||
'implementation': 'https://gist.github.com/syllog1sm/10343947',
|
||
'redshift': 'http://github.com/syllog1sm/redshift',
|
||
'tasker': 'https://play.google.com/store/apps/details?id=net.dinglisch.android.taskerm',
|
||
'acl_anthology': 'http://aclweb.org/anthology/',
|
||
'share_twitter': 'http://twitter.com/share?text=[ARTICLE HEADLINE]&url=[ARTICLE LINK]&via=honnibal'
|
||
}
|
||
|
||
|
||
doctype html
|
||
html(lang='en')
|
||
head
|
||
meta(charset='utf-8')
|
||
title spaCy Blog
|
||
meta(name='description', content='')
|
||
meta(name='author', content='Matthew Honnibal')
|
||
link(rel='stylesheet', href='css/style.css')
|
||
//if lt IE 9
|
||
script(src='http://html5shiv.googlecode.com/svn/trunk/html5.js')
|
||
body#blog
|
||
header(role='banner')
|
||
h1.logo spaCy Blog
|
||
.slogan Blog
|
||
main#content(role='main')
|
||
article.post
|
||
header
|
||
h2 Parsing English with 500 lines of Python
|
||
.subhead
|
||
| by
|
||
a(href='#', rel='author') Matthew Honnibal
|
||
| on
|
||
time(datetime='2013-12-18') December 18, 2013
|
||
p
|
||
| A
|
||
a(href=urls.google_ngrams) syntactic parser
|
||
| describes a sentence’s grammatical structure, to help another
|
||
| application reason about it. Natural languages introduce many unexpected
|
||
| ambiguities, which our world-knowledge immediately filters out. A
|
||
| favourite example:
|
||
|
||
p.example They ate the pizza with anchovies
|
||
|
||
p
|
||
img(src='img/blog01.png', alt='Eat-with pizza-with ambiguity')
|
||
p
|
||
| A correct parse links “with” to “pizza”, while an incorrect parse
|
||
| links “with” to “eat”:
|
||
|
||
.displacy
|
||
iframe(src='displacy/anchovies_bad.html', height='275')
|
||
|
||
.displacy
|
||
iframe.displacy(src='displacy/anchovies_good.html', height='275')
|
||
a.view-displacy(href='#') View on displaCy
|
||
p.caption
|
||
| The Natural Language Processing (NLP) community has made big progress
|
||
| in syntactic parsing over the last few years.
|
||
|
||
p
|
||
| The Natural Language Processing (NLP) community has made big progress
|
||
| in syntactic parsing over the last few years. It’s now possible for
|
||
| a tiny Python implementation to perform better than the widely-used
|
||
| Stanford PCFG parser.
|
||
|
||
p
|
||
strong Update!
|
||
| The Stanford CoreNLP library now includes a greedy transition-based
|
||
| dependency parser, similar to the one described in this post, but with
|
||
| an improved learning strategy. It is much faster and more accurate
|
||
| than this simple Python implementation.
|
||
|
||
table
|
||
thead
|
||
tr
|
||
th Parser
|
||
th Accuracy
|
||
th Speed (w/s)
|
||
th Language
|
||
th LOC
|
||
tbody
|
||
tr
|
||
td Stanford
|
||
td 89.6%
|
||
td 19
|
||
td Java
|
||
td
|
||
| > 50,000
|
||
sup
|
||
a(href='#note-1') [1]
|
||
tr
|
||
td
|
||
strong parser.py
|
||
td 89.8%
|
||
td 2,020
|
||
td Python
|
||
td
|
||
strong ~500
|
||
tr
|
||
td Redshift
|
||
td
|
||
strong 93.6%
|
||
td
|
||
strong 2,580
|
||
td Cython
|
||
td ~4,000
|
||
p
|
||
| The rest of the post sets up the problem, and then takes you through
|
||
a(href=urls.implementation) a concise implementation
|
||
| , prepared for this post. The first 200 lines of parser.py, the
|
||
| part-of-speech tagger and learner, are described
|
||
a(href=pos_tagger_url) here. You should probably at least skim that
|
||
| post before reading this one, unless you’re very familiar with NLP
|
||
| research.
|
||
p
|
||
| The Cython system, Redshift, was written for my current research. I
|
||
| plan to improve it for general use in June, after my contract ends
|
||
| at Macquarie University. The current version is
|
||
a(href=urls.redshift) hosted on GitHub
|
||
| .
|
||
h3 Problem Description
|
||
|
||
p It’d be nice to type an instruction like this into your phone:
|
||
|
||
p.example
|
||
Set volume to zero when I’m in a meeting, unless John’s school calls.
|
||
p
|
||
| And have it set the appropriate policy. On Android you can do this
|
||
| sort of thing with
|
||
a(href=urls.tasker) Tasker
|
||
| , but an NL interface would be much better. It’d be especially nice
|
||
| to receive a meaning representation you could edit, so you could see
|
||
| what it thinks you said, and correct it.
|
||
p
|
||
| There are lots of problems to solve to make that work, but some sort
|
||
| of syntactic representation is definitely necessary. We need to know that:
|
||
|
||
p.example
|
||
Unless John’s school calls, when I’m in a meeting, set volume to zero
|
||
|
||
p is another way of phrasing the first instruction, while:
|
||
|
||
p.example
|
||
Unless John’s school, call when I’m in a meeting
|
||
|
||
p means something completely different.
|
||
|
||
p
|
||
| A dependency parser returns a graph of word-word relationships,
|
||
| intended to make such reasoning easier. Our graphs will be trees –
|
||
| edges will be directed, and every node (word) will have exactly one
|
||
| incoming arc (one dependency, with its head), except one.
|
||
|
||
h4 Example usage
|
||
|
||
pre.language-python.
|
||
|
||
p.
|
||
The idea is that it should be slightly easier to reason from the parse,
|
||
than it was from the string. The parse-to-meaning mapping is hopefully
|
||
simpler than the string-to-meaning mapping.
|
||
|
||
p.
|
||
The most confusing thing about this problem area is that “correctness”
|
||
is defined by convention — by annotation guidelines. If you haven’t
|
||
read the guidelines and you’re not a linguist, you can’t tell whether
|
||
the parse is “wrong” or “right”, which makes the whole task feel weird
|
||
and artificial.
|
||
|
||
p.
|
||
For instance, there’s a mistake in the parse above: “John’s school
|
||
calls” is structured wrongly, according to the Stanford annotation
|
||
guidelines. The structure of that part of the sentence is how the
|
||
annotators were instructed to parse an example like “John’s school
|
||
clothes”.
|
||
|
||
p
|
||
| It’s worth dwelling on this point a bit. We could, in theory, have
|
||
| written our guidelines so that the “correct” parses were reversed.
|
||
| There’s good reason to believe the parsing task will be harder if we
|
||
| reversed our convention, as it’d be less consistent with the rest of
|
||
| the grammar.
|
||
sup: a(href='#note-2') [2]
|
||
| But we could test that empirically, and we’d be pleased to gain an
|
||
| advantage by reversing the policy.
|
||
|
||
p
|
||
| We definitely do want that distinction in the guidelines — we don’t
|
||
| want both to receive the same structure, or our output will be less
|
||
| useful. The annotation guidelines strike a balance between what
|
||
| distinctions downstream applications will find useful, and what
|
||
| parsers will be able to predict easily.
|
||
|
||
h4 Projective trees
|
||
|
||
p
|
||
| There’s a particularly useful simplification that we can make, when
|
||
| deciding what we want the graph to look like: we can restrict the
|
||
| graph structures we’ll be dealing with. This doesn’t just give us a
|
||
| likely advantage in learnability; it can have deep algorithmic
|
||
| implications. We follow most work on English in constraining the
|
||
| dependency graphs to be
|
||
em projective trees
|
||
| :
|
||
|
||
ol
|
||
li Tree. Every word has exactly one head, except for the dummy ROOT symbol.
|
||
li
|
||
| Projective. For every pair of dependencies (a1, a2) and (b1, b2),
|
||
| if a1 < b2, then a2 >= b2. In other words, dependencies cannot “cross”.
|
||
| You can’t have a pair of dependencies that goes a1 b1 a2 b2, or
|
||
| b1 a1 b2 a2.
|
||
|
||
p
|
||
| There’s a rich literature on parsing non-projective trees, and a
|
||
| smaller literature on parsing DAGs. But the parsing algorithm I’ll
|
||
| be explaining deals with projective trees.
|
||
|
||
h3 Greedy transition-based parsing
|
||
|
||
p
|
||
| Our parser takes as input a list of string tokens, and outputs a
|
||
| list of head indices, representing edges in the graph. If the
|
||
|
||
em i
|
||
|
||
| th member of heads is
|
||
|
||
em j
|
||
|
||
| , the dependency parse contains an edge (j, i). A transition-based
|
||
| parser is a finite-state transducer; it maps an array of N words
|
||
| onto an output array of N head indices:
|
||
|
||
table.center
|
||
tbody
|
||
tr
|
||
td
|
||
em start
|
||
td MSNBC
|
||
td reported
|
||
td that
|
||
td Facebook
|
||
td bought
|
||
td WhatsApp
|
||
td for
|
||
td $16bn
|
||
td
|
||
em root
|
||
tr
|
||
td 0
|
||
td 2
|
||
td 9
|
||
td 2
|
||
td 4
|
||
td 2
|
||
td 4
|
||
td 4
|
||
td 7
|
||
td 0
|
||
p
|
||
| The heads array denotes that the head of
|
||
em MSNBC
|
||
| is
|
||
em reported
|
||
| :
|
||
em MSNBC
|
||
| is word 1, and
|
||
em reported
|
||
| is word 2, and
|
||
code.language-python heads[1] == 2
|
||
| . You can already see why parsing a tree is handy — this data structure
|
||
| wouldn’t work if we had to output a DAG, where words may have multiple
|
||
| heads.
|
||
|
||
p
|
||
| Although
|
||
code.language-python heads
|
||
| can be represented as an array, we’d actually like to maintain some
|
||
| alternate ways to access the parse, to make it easy and efficient to
|
||
| extract features. Our
|
||
|
||
code.language-python Parse
|
||
| class looks like this:
|
||
|
||
pre.language-python
|
||
code
|
||
| class Parse(object):
|
||
| def __init__(self, n):
|
||
| self.n = n
|
||
| self.heads = [None] * (n-1)
|
||
| self.lefts = []
|
||
| self.rights = []
|
||
| for i in range(n+1):
|
||
| self.lefts.append(DefaultList(0))
|
||
| self.rights.append(DefaultList(0))
|
||
|
|
||
| def add_arc(self, head, child):
|
||
| self.heads[child] = head
|
||
| if child < head:
|
||
| self.lefts[head].append(child)
|
||
| else:
|
||
| self.rights[head].append(child)
|
||
|
||
p
|
||
| As well as the parse, we also have to keep track of where we’re up
|
||
| to in the sentence. We’ll do this with an index into the
|
||
code.language-python words
|
||
| array, and a stack, to which we’ll push words, before popping them
|
||
| once their head is set. So our state data structure is fundamentally:
|
||
|
||
ul
|
||
li An index, i, into the list of tokens;
|
||
li The dependencies added so far, in Parse
|
||
li
|
||
| A stack, containing words that occurred before i, for which we’re
|
||
| yet to assign a head.
|
||
|
||
p Each step of the parsing process applies one of three actions to the state:
|
||
|
||
pre.language-python
|
||
code
|
||
| SHIFT = 0; RIGHT = 1; LEFT = 2
|
||
| MOVES = [SHIFT, RIGHT, LEFT]
|
||
|
|
||
| def transition(move, i, stack, parse):
|
||
| global SHIFT, RIGHT, LEFT
|
||
| if move == SHIFT:
|
||
| stack.append(i)
|
||
| return i + 1
|
||
| elif move == RIGHT:
|
||
| parse.add_arc(stack[-2], stack.pop())
|
||
| return i
|
||
| elif move == LEFT:
|
||
| parse.add_arc(i, stack.pop())
|
||
| return i
|
||
| raise GrammarError("Unknown move: %d" % move)
|
||
|
||
|
||
|
||
p
|
||
| The
|
||
code.language-python LEFT
|
||
| and
|
||
code.language-python RIGHT
|
||
| actions add dependencies and pop the stack, while
|
||
code.language-python SHIFT
|
||
| pushes the stack and advances i into the buffer.
|
||
p.
|
||
So, the parser starts with an empty stack, and a buffer index at 0, with
|
||
no dependencies recorded. It chooses one of the (valid) actions, and
|
||
applies it to the state. It continues choosing actions and applying
|
||
them until the stack is empty and the buffer index is at the end of
|
||
the input. (It’s hard to understand this sort of algorithm without
|
||
stepping through it. Try coming up with a sentence, drawing a projective
|
||
parse tree over it, and then try to reach the parse tree by choosing
|
||
the right sequence of transitions.)
|
||
|
||
p Here’s what the parsing loop looks like in code:
|
||
|
||
pre.language-python
|
||
code
|
||
| class Parser(object):
|
||
| ...
|
||
| def parse(self, words):
|
||
| tags = self.tagger(words)
|
||
| n = len(words)
|
||
| idx = 1
|
||
| stack = [0]
|
||
| deps = Parse(n)
|
||
| while stack or idx < n:
|
||
| features = extract_features(words, tags, idx, n, stack, deps)
|
||
| scores = self.model.score(features)
|
||
| valid_moves = get_valid_moves(i, n, len(stack))
|
||
| next_move = max(valid_moves, key=lambda move: scores[move])
|
||
| idx = transition(next_move, idx, stack, parse)
|
||
| return tags, parse
|
||
|
|
||
| def get_valid_moves(i, n, stack_depth):
|
||
| moves = []
|
||
| if i < n:
|
||
| moves.append(SHIFT)
|
||
| if stack_depth <= 2:
|
||
| moves.append(RIGHT)
|
||
| if stack_depth <= 1:
|
||
| moves.append(LEFT)
|
||
| return moves
|
||
|
||
p.
|
||
We start by tagging the sentence, and initializing the state. We then
|
||
map the state to a set of features, which we score using a linear model.
|
||
We then find the best-scoring valid move, and apply it to the state.
|
||
|
||
p
|
||
| The model scoring works the same as it did in
|
||
a(href=urls.post) the POS tagger.
|
||
| If you’re confused about the idea of extracting features and scoring
|
||
| them with a linear model, you should review that post. Here’s a reminder
|
||
| of how the model scoring works:
|
||
|
||
pre.language-python
|
||
code
|
||
| class Perceptron(object)
|
||
| ...
|
||
| def score(self, features):
|
||
| all_weights = self.weights
|
||
| scores = dict((clas, 0) for clas in self.classes)
|
||
| for feat, value in features.items():
|
||
| if value == 0:
|
||
| continue
|
||
| if feat not in all_weights:
|
||
| continue
|
||
| weights = all_weights[feat]
|
||
| for clas, weight in weights.items():
|
||
| scores[clas] += value * weight
|
||
| return scores
|
||
|
||
p.
|
||
It’s just summing the class-weights for each feature. This is often
|
||
expressed as a dot-product, but when you’re dealing with multiple
|
||
classes, that gets awkward, I find.
|
||
|
||
p.
|
||
The beam parser (RedShift) tracks multiple candidates, and only decides
|
||
on the best one at the very end. We’re going to trade away accuracy
|
||
in favour of efficiency and simplicity. We’ll only follow a single
|
||
analysis. Our search strategy will be entirely greedy, as it was with
|
||
the POS tagger. We’ll lock-in our choices at every step.
|
||
|
||
p.
|
||
If you read the POS tagger post carefully, you might see the underlying
|
||
similarity. What we’ve done is mapped the parsing problem onto a
|
||
sequence-labelling problem, which we address using a “flat”, or unstructured,
|
||
learning algorithm (by doing greedy search).
|
||
|
||
h3 Features
|
||
p.
|
||
Feature extraction code is always pretty ugly. The features for the parser
|
||
refer to a few tokens from the context:
|
||
|
||
ul
|
||
li The first three words of the buffer (n0, n1, n2)
|
||
li The top three words of the stack (s0, s1, s2)
|
||
li The two leftmost children of s0 (s0b1, s0b2);
|
||
li The two rightmost children of s0 (s0f1, s0f2);
|
||
li The two leftmost children of n0 (n0b1, n0b2)
|
||
|
||
p.
|
||
For these 12 tokens, we refer to the word-form, the part-of-speech tag,
|
||
and the number of left and right children attached to the token.
|
||
|
||
p.
|
||
Because we’re using a linear model, we have our features refer to pairs
|
||
and triples of these atomic properties.
|
||
|
||
pre.language-python
|
||
code
|
||
| def extract_features(words, tags, n0, n, stack, parse):
|
||
| def get_stack_context(depth, stack, data):
|
||
| if depth >= 3:
|
||
| return data[stack[-1]], data[stack[-2]], data[stack[-3]]
|
||
| elif depth >= 2:
|
||
| return data[stack[-1]], data[stack[-2]], ''
|
||
| elif depth == 1:
|
||
| return data[stack[-1]], '', ''
|
||
| else:
|
||
| return '', '', ''
|
||
|
|
||
| def get_buffer_context(i, n, data):
|
||
| if i + 1 >= n:
|
||
| return data[i], '', ''
|
||
| elif i + 2 >= n:
|
||
| return data[i], data[i + 1], ''
|
||
| else:
|
||
| return data[i], data[i + 1], data[i + 2]
|
||
|
|
||
| def get_parse_context(word, deps, data):
|
||
| if word == -1:
|
||
| return 0, '', ''
|
||
| deps = deps[word]
|
||
| valency = len(deps)
|
||
| if not valency:
|
||
| return 0, '', ''
|
||
| elif valency == 1:
|
||
| return 1, data[deps[-1]], ''
|
||
| else:
|
||
| return valency, data[deps[-1]], data[deps[-2]]
|
||
|
|
||
| features = {}
|
||
| # Set up the context pieces --- the word, W, and tag, T, of:
|
||
| # S0-2: Top three words on the stack
|
||
| # N0-2: First three words of the buffer
|
||
| # n0b1, n0b2: Two leftmost children of the first word of the buffer
|
||
| # s0b1, s0b2: Two leftmost children of the top word of the stack
|
||
| # s0f1, s0f2: Two rightmost children of the top word of the stack
|
||
|
|
||
| depth = len(stack)
|
||
| s0 = stack[-1] if depth else -1
|
||
|
|
||
| Ws0, Ws1, Ws2 = get_stack_context(depth, stack, words)
|
||
| Ts0, Ts1, Ts2 = get_stack_context(depth, stack, tags)
|
||
|
|
||
| Wn0, Wn1, Wn2 = get_buffer_context(n0, n, words)
|
||
| Tn0, Tn1, Tn2 = get_buffer_context(n0, n, tags)
|
||
|
|
||
| Vn0b, Wn0b1, Wn0b2 = get_parse_context(n0, parse.lefts, words)
|
||
| Vn0b, Tn0b1, Tn0b2 = get_parse_context(n0, parse.lefts, tags)
|
||
|
|
||
| Vn0f, Wn0f1, Wn0f2 = get_parse_context(n0, parse.rights, words)
|
||
| _, Tn0f1, Tn0f2 = get_parse_context(n0, parse.rights, tags)
|
||
|
|
||
| Vs0b, Ws0b1, Ws0b2 = get_parse_context(s0, parse.lefts, words)
|
||
| _, Ts0b1, Ts0b2 = get_parse_context(s0, parse.lefts, tags)
|
||
|
|
||
| Vs0f, Ws0f1, Ws0f2 = get_parse_context(s0, parse.rights, words)
|
||
| _, Ts0f1, Ts0f2 = get_parse_context(s0, parse.rights, tags)
|
||
|
|
||
| # Cap numeric features at 5?
|
||
| # String-distance
|
||
| Ds0n0 = min((n0 - s0, 5)) if s0 != 0 else 0
|
||
|
|
||
| features['bias'] = 1
|
||
| # Add word and tag unigrams
|
||
| for w in (Wn0, Wn1, Wn2, Ws0, Ws1, Ws2, Wn0b1, Wn0b2, Ws0b1, Ws0b2, Ws0f1, Ws0f2):
|
||
| if w:
|
||
| features['w=%s' % w] = 1
|
||
| for t in (Tn0, Tn1, Tn2, Ts0, Ts1, Ts2, Tn0b1, Tn0b2, Ts0b1, Ts0b2, Ts0f1, Ts0f2):
|
||
| if t:
|
||
| features['t=%s' % t] = 1
|
||
|
|
||
| # Add word/tag pairs
|
||
| for i, (w, t) in enumerate(((Wn0, Tn0), (Wn1, Tn1), (Wn2, Tn2), (Ws0, Ts0))):
|
||
| if w or t:
|
||
| features['%d w=%s, t=%s' % (i, w, t)] = 1
|
||
|
|
||
| # Add some bigrams
|
||
| features['s0w=%s, n0w=%s' % (Ws0, Wn0)] = 1
|
||
| features['wn0tn0-ws0 %s/%s %s' % (Wn0, Tn0, Ws0)] = 1
|
||
| features['wn0tn0-ts0 %s/%s %s' % (Wn0, Tn0, Ts0)] = 1
|
||
| features['ws0ts0-wn0 %s/%s %s' % (Ws0, Ts0, Wn0)] = 1
|
||
| features['ws0-ts0 tn0 %s/%s %s' % (Ws0, Ts0, Tn0)] = 1
|
||
| features['wt-wt %s/%s %s/%s' % (Ws0, Ts0, Wn0, Tn0)] = 1
|
||
| features['tt s0=%s n0=%s' % (Ts0, Tn0)] = 1
|
||
| features['tt n0=%s n1=%s' % (Tn0, Tn1)] = 1
|
||
|
|
||
| # Add some tag trigrams
|
||
| trigrams = ((Tn0, Tn1, Tn2), (Ts0, Tn0, Tn1), (Ts0, Ts1, Tn0),
|
||
| (Ts0, Ts0f1, Tn0), (Ts0, Ts0f1, Tn0), (Ts0, Tn0, Tn0b1),
|
||
| (Ts0, Ts0b1, Ts0b2), (Ts0, Ts0f1, Ts0f2), (Tn0, Tn0b1, Tn0b2),
|
||
| (Ts0, Ts1, Ts1))
|
||
| for i, (t1, t2, t3) in enumerate(trigrams):
|
||
| if t1 or t2 or t3:
|
||
| features['ttt-%d %s %s %s' % (i, t1, t2, t3)] = 1
|
||
|
|
||
| # Add some valency and distance features
|
||
| vw = ((Ws0, Vs0f), (Ws0, Vs0b), (Wn0, Vn0b))
|
||
| vt = ((Ts0, Vs0f), (Ts0, Vs0b), (Tn0, Vn0b))
|
||
| d = ((Ws0, Ds0n0), (Wn0, Ds0n0), (Ts0, Ds0n0), (Tn0, Ds0n0),
|
||
| ('t' + Tn0+Ts0, Ds0n0), ('w' + Wn0+Ws0, Ds0n0))
|
||
| for i, (w_t, v_d) in enumerate(vw + vt + d):
|
||
| if w_t or v_d:
|
||
| features['val/d-%d %s %d' % (i, w_t, v_d)] = 1
|
||
| return features</code></pre>
|
||
|
||
|
||
h3 Training
|
||
|
||
p.
|
||
Weights are learned using the same algorithm, averaged perceptron, that
|
||
we used for part-of-speech tagging. Its key strength is that it’s an
|
||
online learning algorithm: examples stream in one-by-one, we make our
|
||
prediction, check the actual answer, and adjust our beliefs (weights)
|
||
if we were wrong.
|
||
|
||
p The training loop looks like this:
|
||
|
||
pre.language-python
|
||
code
|
||
| class Parser(object):
|
||
| ...
|
||
| def train_one(self, itn, words, gold_tags, gold_heads):
|
||
| n = len(words)
|
||
| i = 2; stack = [1]; parse = Parse(n)
|
||
| tags = self.tagger.tag(words)
|
||
| while stack or (i + 1) < n:
|
||
| features = extract_features(words, tags, i, n, stack, parse)
|
||
| scores = self.model.score(features)
|
||
| valid_moves = get_valid_moves(i, n, len(stack))
|
||
| guess = max(valid_moves, key=lambda move: scores[move])
|
||
| gold_moves = get_gold_moves(i, n, stack, parse.heads, gold_heads)
|
||
| best = max(gold_moves, key=lambda move: scores[move])
|
||
| self.model.update(best, guess, features)
|
||
| i = transition(guess, i, stack, parse)
|
||
| # Return number correct
|
||
| return len([i for i in range(n-1) if parse.heads[i] == gold_heads[i]])
|
||
|
||
|
||
|
||
p.
|
||
The most interesting part of the training process is in
|
||
code.language-python get_gold_moves.
|
||
The performance of our parser is made possible by an advance by Goldberg
|
||
and Nivre (2012), who showed that we’d been doing this wrong for years.
|
||
|
||
p
|
||
| In the POS-tagging post, I cautioned that during training you need to
|
||
| make sure you pass in the last two
|
||
em predicted
|
||
| tags as features for the current tag, not the last two
|
||
em gold
|
||
| tags. At test time you’ll only have the predicted tags, so if you
|
||
| base your features on the gold sequence during training, your training
|
||
| contexts won’t resemble your test-time contexts, so you’ll learn the
|
||
| wrong weights.
|
||
|
||
p.
|
||
In parsing, the problem was that we didn’t know
|
||
em how
|
||
| to pass in the predicted sequence! Training worked by taking the
|
||
| gold-standard tree, and finding a transition sequence that led to it.
|
||
| i.e., you got back a sequence of moves, with the guarantee that if
|
||
| you followed those moves, you’d get the gold-standard dependencies.
|
||
|
||
p
|
||
| The problem is, we didn’t know how to define the “correct” move to
|
||
| teach a parser to make if it was in any state that
|
||
em wasn’t
|
||
| along that gold-standard sequence. Once the parser had made a mistake,
|
||
| we didn’t know how to train from that example.
|
||
|
||
p
|
||
| That was a big problem, because it meant that once the parser started
|
||
| making mistakes, it would end up in states unlike any in its training
|
||
| data – leading to yet more mistakes. The problem was specific
|
||
| to greedy parsers: once you use a beam, there’s a natural way to do
|
||
| structured prediction.
|
||
p
|
||
| The solution seems obvious once you know it, like all the best breakthroughs.
|
||
| What we do is define a function that asks “How many gold-standard
|
||
| dependencies can be recovered from this state?”. If you can define
|
||
| that function, then you can apply each move in turn, and ask, “How
|
||
| many gold-standard dependencies can be recovered from
|
||
em this
|
||
| state?”. If the action you applied allows
|
||
em fewer
|
||
| gold-standard dependencies to be reached, then it is sub-optimal.
|
||
|
||
p That’s a lot to take in.
|
||
|
||
p
|
||
| So we have this function
|
||
code.language-python Oracle(state)
|
||
| :
|
||
pre
|
||
code
|
||
Oracle(state) = | gold_arcs ∩ reachable_arcs(state) |
|
||
p
|
||
| We also have a set of actions, each of which returns a new state.
|
||
| We want to know:
|
||
|
||
ul
|
||
li shift_cost = Oracle(state) – Oracle(shift(state))
|
||
li right_cost = Oracle(state) – Oracle(right(state))
|
||
li left_cost = Oracle(state) – Oracle(left(state))
|
||
|
||
p
|
||
| Now, at least one of those costs
|
||
em has
|
||
| to be zero. Oracle(state) is asking, “what’s the cost of the best
|
||
| path forward?”, and the first action of that best path has to be
|
||
| shift, right, or left.
|
||
|
||
p
|
||
| It turns out that we can derive Oracle fairly simply for many transition
|
||
| systems. The derivation for the transition system we’re using, Arc
|
||
| Hybrid, is in Goldberg and Nivre (2013).
|
||
|
||
p
|
||
| We’re going to implement the oracle as a function that returns the
|
||
| zero-cost moves, rather than implementing a function Oracle(state).
|
||
| This prevents us from doing a bunch of costly copy operations.
|
||
| Hopefully the reasoning in the code isn’t too hard to follow, but
|
||
| you can also consult Goldberg and Nivre’s papers if you’re confused
|
||
| and want to get to the bottom of this.
|
||
|
||
pre.language-python
|
||
code
|
||
| def get_gold_moves(n0, n, stack, heads, gold):
|
||
| def deps_between(target, others, gold):
|
||
| for word in others:
|
||
| if gold[word] == target or gold[target] == word:
|
||
| return True
|
||
| return False
|
||
|
|
||
| valid = get_valid_moves(n0, n, len(stack))
|
||
| if not stack or (SHIFT in valid and gold[n0] == stack[-1]):
|
||
| return [SHIFT]
|
||
| if gold[stack[-1]] == n0:
|
||
| return [LEFT]
|
||
| costly = set([m for m in MOVES if m not in valid])
|
||
| # If the word behind s0 is its gold head, Left is incorrect
|
||
| if len(stack) >= 2 and gold[stack[-1]] == stack[-2]:
|
||
| costly.add(LEFT)
|
||
| # If there are any dependencies between n0 and the stack,
|
||
| # pushing n0 will lose them.
|
||
| if SHIFT not in costly and deps_between(n0, stack, gold):
|
||
| costly.add(SHIFT)
|
||
| # If there are any dependencies between s0 and the buffer, popping
|
||
| # s0 will lose them.
|
||
| if deps_between(stack[-1], range(n0+1, n-1), gold):
|
||
| costly.add(LEFT)
|
||
| costly.add(RIGHT)
|
||
| return [m for m in MOVES if m not in costly]</code></pre>
|
||
|
||
|
||
|
||
p
|
||
| Doing this “dynamic oracle” training procedure makes a big difference
|
||
| to accuracy — typically 1-2%, with no difference to the way the run-time
|
||
| works. The old “static oracle” greedy training procedure is fully
|
||
| obsolete; there’s no reason to do it that way any more.
|
||
|
||
h3 Conclusion
|
||
|
||
p
|
||
| I have the sense that language technologies, particularly those relating
|
||
| to grammar, are particularly mysterious. I can imagine having no idea
|
||
| what the program might even do.
|
||
|
||
p
|
||
| I think it therefore seems natural to people that the best solutions
|
||
| would be over-whelmingly complicated. A 200,000 line Java package
|
||
| feels appropriate.
|
||
p
|
||
| But, algorithmic code is usually short, when only a single algorithm
|
||
| is implemented. And when you only implement one algorithm, and you
|
||
| know exactly what you want to write before you write a line, you
|
||
| also don’t pay for any unnecessary abstractions, which can have a
|
||
| big performance impact.
|
||
|
||
h3 Notes
|
||
p
|
||
a(name='note-1')
|
||
| [1] I wasn’t really sure how to count the lines of code in the Stanford
|
||
| parser. Its jar file ships over 200k, but there are a lot of different
|
||
| models in it. It’s not important, but over 50k seems safe.
|
||
|
||
p
|
||
a(name='note-2')
|
||
| [2] For instance, how would you parse, “John’s school of music calls”?
|
||
| You want to make sure the phrase “John’s school” has a consistent
|
||
| structure in both “John’s school calls” and “John’s school of music
|
||
| calls”. Reasoning about the different “slots” you can put a phrase
|
||
| into is a key way we reason about what syntactic analyses look like.
|
||
| You can think of each phrase as having a different shaped connector,
|
||
| which you need to plug into different slots — which each phrase also
|
||
| has a certain number of, each of a different shape. We’re trying to
|
||
| figure out what connectors are where, so we can figure out how the
|
||
| sentences are put together.
|
||
|
||
h3 Idle speculation
|
||
p
|
||
| For a long time, incremental language processing algorithms were
|
||
| primarily of scientific interest. If you want to write a parser to
|
||
| test a theory about how the human sentence processor might work, well,
|
||
| that parser needs to build partial interpretations. There’s a wealth
|
||
| of evidence, including commonsense introspection, that establishes
|
||
| that we don’t buffer input and analyse it once the speaker has finished.
|
||
|
||
p
|
||
| But now algorithms with that neat scientific feature are winning!
|
||
| As best as I can tell, the secret to that success is to be:
|
||
|
||
ul
|
||
li Incremental. Earlier words constrain the search.
|
||
li
|
||
| Error-driven. Training involves a working hypothesis, which is
|
||
| updated as it makes mistakes.
|
||
|
||
p
|
||
| The links to human sentence processing seem tantalising. I look
|
||
| forward to seeing whether these engineering breakthroughs lead to
|
||
| any psycholinguistic advances.
|
||
|
||
h3 Bibliography
|
||
|
||
p
|
||
| The NLP literature is almost entirely open access. All of the relavant
|
||
| papers can be found
|
||
a(href=urls.acl_anthology, rel='nofollow') here
|
||
| .
|
||
p
|
||
| The parser I’ve described is an implementation of the dynamic-oracle
|
||
| Arc-Hybrid system here:
|
||
|
||
span.bib-item
|
||
| Goldberg, Yoav; Nivre, Joakim.
|
||
em Training Deterministic Parsers with Non-Deterministic Oracles
|
||
| . TACL 2013
|
||
p
|
||
| However, I wrote my own features for it. The arc-hybrid system was
|
||
| originally described here:
|
||
|
||
span.bib-item
|
||
| Kuhlmann, Marco; Gomez-Rodriguez, Carlos; Satta, Giorgio. Dynamic
|
||
| programming algorithms for transition-based dependency parsers. ACL 2011
|
||
|
||
p
|
||
| The dynamic oracle training method was first described here:
|
||
span.bib-item
|
||
| A Dynamic Oracle for Arc-Eager Dependency Parsing. Goldberg, Yoav;
|
||
| Nivre, Joakim. COLING 2012
|
||
|
||
p
|
||
| This work depended on a big break-through in accuracy for transition-based
|
||
| parsers, when beam-search was properly explored by Zhang and Clark.
|
||
| They have several papers, but the preferred citation is:
|
||
|
||
span.bib-item
|
||
| Zhang, Yue; Clark, Steven. Syntactic Processing Using the Generalized
|
||
| Perceptron and Beam Search. Computational Linguistics 2011 (1)
|
||
p
|
||
| Another important paper was this little feature engineering paper,
|
||
| which further improved the accuracy:
|
||
|
||
span.bib-item
|
||
| Zhang, Yue; Nivre, Joakim. Transition-based Dependency Parsing with
|
||
| Rich Non-local Features. ACL 2011
|
||
|
||
p
|
||
| The generalised perceptron, which is the learning framework for these
|
||
| beam parsers, is from this paper:
|
||
span.bib-item
|
||
| Collins, Michael. Discriminative Training Methods for Hidden Markov
|
||
| Models: Theory and Experiments with Perceptron Algorithms. EMNLP 2002
|
||
|
||
h3 Experimental details
|
||
p
|
||
| The results at the start of the post refer to Section 22 of the Wall
|
||
| Street Journal corpus. The Stanford parser was run as follows:
|
||
|
||
pre.language-bash
|
||
code
|
||
| java -mx10000m -cp "$scriptdir/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser \
|
||
| -outputFormat "penn" edu/stanford/nlp/models/lexparser/englishFactored.ser.gz $*
|
||
|
||
|
||
|
||
p
|
||
| A small post-process was applied, to undo the fancy tokenisation
|
||
| Stanford adds for numbers, to make them match the PTB tokenisation:
|
||
|
||
pre.language-python
|
||
code
|
||
| """Stanford parser retokenises numbers. Split them."""
|
||
| import sys
|
||
| import re
|
||
|
|
||
| qp_re = re.compile('\xc2\xa0')
|
||
| for line in sys.stdin:
|
||
| line = line.rstrip()
|
||
| if qp_re.search(line):
|
||
| line = line.replace('(CD', '(QP (CD', 1) + ')'
|
||
| line = line.replace('\xc2\xa0', ') (CD ')
|
||
| print line
|
||
|
||
p
|
||
| The resulting PTB-format files were then converted into dependencies
|
||
| using the Stanford converter:
|
||
|
||
pre.language-bash
|
||
code
|
||
| ./scripts/train.py -x zhang+stack -k 8 -p ~/data/stanford/train.conll ~/data/parsers/tmp
|
||
| ./scripts/parse.py ~/data/parsers/tmp ~/data/stanford/devi.txt /tmp/parse/
|
||
| ./scripts/evaluate.py /tmp/parse/parses ~/data/stanford/dev.conll
|
||
p
|
||
| I can’t easily read that anymore, but it should just convert every
|
||
| .mrg file in a folder to a CoNLL-format Stanford basic dependencies
|
||
| file, using the settings common in the dependency literature.
|
||
|
||
p
|
||
| I then converted the gold-standard trees from WSJ 22, for the evaluation.
|
||
| Accuracy scores refer to unlabelled attachment score (i.e. the head index)
|
||
| of all non-punctuation tokens.
|
||
|
||
p
|
||
| To train parser.py, I fed the gold-standard PTB trees for WSJ 02-21
|
||
| into the same conversion script.
|
||
|
||
p
|
||
| In a nutshell: The Stanford model and parser.py are trained on the
|
||
| same set of sentences, and they each make their predictions on a
|
||
| held-out test set, for which we know the answers. Accuracy refers
|
||
| to how many of the words’ heads we got correct.
|
||
|
||
p
|
||
| Speeds were measured on a 2.4Ghz Xeon. I ran the experiments on a
|
||
| server, to give the Stanford parser more memory. The parser.py system
|
||
| runs fine on my MacBook Air. I used PyPy for the parser.py experiments;
|
||
| CPython was about half as fast on an early benchmark.
|
||
|
||
p
|
||
| One of the reasons parser.py is so fast is that it does unlabelled
|
||
| parsing. Based on previous experiments, a labelled parser would likely
|
||
| be about 40x slower, and about 1% more accurate. Adapting the program
|
||
| to labelled parsing would be a good exercise for the reader, if you
|
||
| have access to the data.
|
||
|
||
p
|
||
| The result from the Redshift parser was produced from commit
|
||
code.language-python b6b624c9900f3bf
|
||
| , which was run as follows:
|
||
pre.language-python.
|
||
footer.meta(role='contentinfo')
|
||
a.button.button-twitter(href=urls.share_twitter, title='Share on Twitter', rel='nofollow') Share on Twitter
|
||
.discuss
|
||
a.button.button-hn(href='#', title='Discuss on Hacker News', rel='nofollow') Discuss on Hacker News
|
||
a.button.button-reddit(href='#', title='Discuss on Reddit', rel='nofollow') Discuss on Reddit
|
||
footer(role='contentinfo')
|
||
script(src='js/prism.js')
|
||
|