diff --git a/docs/redesign/blog_parser.jade b/docs/redesign/blog_parser.jade new file mode 100644 index 000000000..34d312a1c --- /dev/null +++ b/docs/redesign/blog_parser.jade @@ -0,0 +1,923 @@ +- + var urls = { + 'pos_post': 'https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/', + 'google_ngrams': "http://googleresearch.blogspot.com.au/2013/05/syntactic-ngrams-over-time.html", + 'implementation': 'https://gist.github.com/syllog1sm/10343947', + 'redshift': 'http://github.com/syllog1sm/redshift', + 'tasker': 'https://play.google.com/store/apps/details?id=net.dinglisch.android.taskerm', + 'acl_anthology': 'http://aclweb.org/anthology/', + 'share_twitter': 'http://twitter.com/share?text=[ARTICLE HEADLINE]&url=[ARTICLE LINK]&via=honnibal' + } + + +doctype html +html(lang='en') + head + meta(charset='utf-8') + title spaCy Blog + meta(name='description', content='') + meta(name='author', content='Matthew Honnibal') + link(rel='stylesheet', href='css/style.css') + //if lt IE 9 + script(src='http://html5shiv.googlecode.com/svn/trunk/html5.js') + body#blog + header(role='banner') + h1.logo spaCy Blog + .slogan Blog + main#content(role='main') + article.post + header + h2 Parsing English with 500 lines of Python + .subhead + | by + a(href='#', rel='author') Matthew Honnibal + | on + time(datetime='2013-12-18') December 18, 2013 + p + | A + a(href=urls.google_ngrams) syntactic parser + | describes a sentence’s grammatical structure, to help another + | application reason about it. Natural languages introduce many unexpected + | ambiguities, which our world-knowledge immediately filters out. A + | favourite example: + + p.example They ate the pizza with anchovies + + p + img(src='img/blog01.png', alt='Eat-with pizza-with ambiguity') + p + | A correct parse links “with” to “pizza”, while an incorrect parse + | links “with” to “eat”: + + .displacy + iframe(src='displacy/anchovies_bad.html', height='275') + + .displacy + iframe.displacy(src='displacy/anchovies_good.html', height='275') + a.view-displacy(href='#') View on displaCy + p.caption + | The Natural Language Processing (NLP) community has made big progress + | in syntactic parsing over the last few years. + + p + | The Natural Language Processing (NLP) community has made big progress + | in syntactic parsing over the last few years. It’s now possible for + | a tiny Python implementation to perform better than the widely-used + | Stanford PCFG parser. + + p + strong Update! + | The Stanford CoreNLP library now includes a greedy transition-based + | dependency parser, similar to the one described in this post, but with + | an improved learning strategy. It is much faster and more accurate + | than this simple Python implementation. + + table + thead + tr + th Parser + th Accuracy + th Speed (w/s) + th Language + th LOC + tbody + tr + td Stanford + td 89.6% + td 19 + td Java + td + | > 50,000 + sup + a(href='#note-1') [1] + tr + td + strong parser.py + td 89.8% + td 2,020 + td Python + td + strong ~500 + tr + td Redshift + td + strong 93.6% + td + strong 2,580 + td Cython + td ~4,000 + p + | The rest of the post sets up the problem, and then takes you through + a(href=urls.implementation) a concise implementation + | , prepared for this post. The first 200 lines of parser.py, the + | part-of-speech tagger and learner, are described + a(href=pos_tagger_url) here. You should probably at least skim that + | post before reading this one, unless you’re very familiar with NLP + | research. + p + | The Cython system, Redshift, was written for my current research. I + | plan to improve it for general use in June, after my contract ends + | at Macquarie University. The current version is + a(href=urls.redshift) hosted on GitHub + | . + h3 Problem Description + + p It’d be nice to type an instruction like this into your phone: + + p.example + Set volume to zero when I’m in a meeting, unless John’s school calls. + p + | And have it set the appropriate policy. On Android you can do this + | sort of thing with + a(href=urls.tasker) Tasker + | , but an NL interface would be much better. It’d be especially nice + | to receive a meaning representation you could edit, so you could see + | what it thinks you said, and correct it. + p + | There are lots of problems to solve to make that work, but some sort + | of syntactic representation is definitely necessary. We need to know that: + + p.example + Unless John’s school calls, when I’m in a meeting, set volume to zero + + p is another way of phrasing the first instruction, while: + + p.example + Unless John’s school, call when I’m in a meeting + + p means something completely different. + + p + | A dependency parser returns a graph of word-word relationships, + | intended to make such reasoning easier. Our graphs will be trees – + | edges will be directed, and every node (word) will have exactly one + | incoming arc (one dependency, with its head), except one. + + h4 Example usage + + pre.language-python. + + p. + The idea is that it should be slightly easier to reason from the parse, + than it was from the string. The parse-to-meaning mapping is hopefully + simpler than the string-to-meaning mapping. + + p. + The most confusing thing about this problem area is that “correctness” + is defined by convention — by annotation guidelines. If you haven’t + read the guidelines and you’re not a linguist, you can’t tell whether + the parse is “wrong” or “right”, which makes the whole task feel weird + and artificial. + + p. + For instance, there’s a mistake in the parse above: “John’s school + calls” is structured wrongly, according to the Stanford annotation + guidelines. The structure of that part of the sentence is how the + annotators were instructed to parse an example like “John’s school + clothes”. + + p + | It’s worth dwelling on this point a bit. We could, in theory, have + | written our guidelines so that the “correct” parses were reversed. + | There’s good reason to believe the parsing task will be harder if we + | reversed our convention, as it’d be less consistent with the rest of + | the grammar. + sup: a(href='#note-2') [2] + | But we could test that empirically, and we’d be pleased to gain an + | advantage by reversing the policy. + + p + | We definitely do want that distinction in the guidelines — we don’t + | want both to receive the same structure, or our output will be less + | useful. The annotation guidelines strike a balance between what + | distinctions downstream applications will find useful, and what + | parsers will be able to predict easily. + + h4 Projective trees + + p + | There’s a particularly useful simplification that we can make, when + | deciding what we want the graph to look like: we can restrict the + | graph structures we’ll be dealing with. This doesn’t just give us a + | likely advantage in learnability; it can have deep algorithmic + | implications. We follow most work on English in constraining the + | dependency graphs to be + em projective trees + | : + + ol + li Tree. Every word has exactly one head, except for the dummy ROOT symbol. + li + | Projective. For every pair of dependencies (a1, a2) and (b1, b2), + | if a1 < b2, then a2 >= b2. In other words, dependencies cannot “cross”. + | You can’t have a pair of dependencies that goes a1 b1 a2 b2, or + | b1 a1 b2 a2. + + p + | There’s a rich literature on parsing non-projective trees, and a + | smaller literature on parsing DAGs. But the parsing algorithm I’ll + | be explaining deals with projective trees. + + h3 Greedy transition-based parsing + + p + | Our parser takes as input a list of string tokens, and outputs a + | list of head indices, representing edges in the graph. If the + + em i + + | th member of heads is + + em j + + | , the dependency parse contains an edge (j, i). A transition-based + | parser is a finite-state transducer; it maps an array of N words + | onto an output array of N head indices: + + table.center + tbody + tr + td + em start + td MSNBC + td reported + td that + td Facebook + td bought + td WhatsApp + td for + td $16bn + td + em root + tr + td 0 + td 2 + td 9 + td 2 + td 4 + td 2 + td 4 + td 4 + td 7 + td 0 + p + | The heads array denotes that the head of + em MSNBC + | is + em reported + | : + em MSNBC + | is word 1, and + em reported + | is word 2, and + code.language-python heads[1] == 2 + | . You can already see why parsing a tree is handy — this data structure + | wouldn’t work if we had to output a DAG, where words may have multiple + | heads. + + p + | Although + code.language-python heads + | can be represented as an array, we’d actually like to maintain some + | alternate ways to access the parse, to make it easy and efficient to + | extract features. Our + + code.language-python Parse + | class looks like this: + + pre.language-python + code + | class Parse(object): + | def __init__(self, n): + | self.n = n + | self.heads = [None] * (n-1) + | self.lefts = [] + | self.rights = [] + | for i in range(n+1): + | self.lefts.append(DefaultList(0)) + | self.rights.append(DefaultList(0)) + | + | def add_arc(self, head, child): + | self.heads[child] = head + | if child < head: + | self.lefts[head].append(child) + | else: + | self.rights[head].append(child) + + p + | As well as the parse, we also have to keep track of where we’re up + | to in the sentence. We’ll do this with an index into the + code.language-python words + | array, and a stack, to which we’ll push words, before popping them + | once their head is set. So our state data structure is fundamentally: + + ul + li An index, i, into the list of tokens; + li The dependencies added so far, in Parse + li + | A stack, containing words that occurred before i, for which we’re + | yet to assign a head. + + p Each step of the parsing process applies one of three actions to the state: + + pre.language-python + code + | SHIFT = 0; RIGHT = 1; LEFT = 2 + | MOVES = [SHIFT, RIGHT, LEFT] + | + | def transition(move, i, stack, parse): + | global SHIFT, RIGHT, LEFT + | if move == SHIFT: + | stack.append(i) + | return i + 1 + | elif move == RIGHT: + | parse.add_arc(stack[-2], stack.pop()) + | return i + | elif move == LEFT: + | parse.add_arc(i, stack.pop()) + | return i + | raise GrammarError("Unknown move: %d" % move) + + + + p + | The + code.language-python LEFT + | and + code.language-python RIGHT + | actions add dependencies and pop the stack, while + code.language-python SHIFT + | pushes the stack and advances i into the buffer. + p. + So, the parser starts with an empty stack, and a buffer index at 0, with + no dependencies recorded. It chooses one of the (valid) actions, and + applies it to the state. It continues choosing actions and applying + them until the stack is empty and the buffer index is at the end of + the input. (It’s hard to understand this sort of algorithm without + stepping through it. Try coming up with a sentence, drawing a projective + parse tree over it, and then try to reach the parse tree by choosing + the right sequence of transitions.) + + p Here’s what the parsing loop looks like in code: + + pre.language-python + code + | class Parser(object): + | ... + | def parse(self, words): + | tags = self.tagger(words) + | n = len(words) + | idx = 1 + | stack = [0] + | deps = Parse(n) + | while stack or idx < n: + | features = extract_features(words, tags, idx, n, stack, deps) + | scores = self.model.score(features) + | valid_moves = get_valid_moves(i, n, len(stack)) + | next_move = max(valid_moves, key=lambda move: scores[move]) + | idx = transition(next_move, idx, stack, parse) + | return tags, parse + | + | def get_valid_moves(i, n, stack_depth): + | moves = [] + | if i < n: + | moves.append(SHIFT) + | if stack_depth <= 2: + | moves.append(RIGHT) + | if stack_depth <= 1: + | moves.append(LEFT) + | return moves + + p. + We start by tagging the sentence, and initializing the state. We then + map the state to a set of features, which we score using a linear model. + We then find the best-scoring valid move, and apply it to the state. + + p + | The model scoring works the same as it did in + a(href=urls.post) the POS tagger. + | If you’re confused about the idea of extracting features and scoring + | them with a linear model, you should review that post. Here’s a reminder + | of how the model scoring works: + + pre.language-python + code + | class Perceptron(object) + | ... + | def score(self, features): + | all_weights = self.weights + | scores = dict((clas, 0) for clas in self.classes) + | for feat, value in features.items(): + | if value == 0: + | continue + | if feat not in all_weights: + | continue + | weights = all_weights[feat] + | for clas, weight in weights.items(): + | scores[clas] += value * weight + | return scores + + p. + It’s just summing the class-weights for each feature. This is often + expressed as a dot-product, but when you’re dealing with multiple + classes, that gets awkward, I find. + + p. + The beam parser (RedShift) tracks multiple candidates, and only decides + on the best one at the very end. We’re going to trade away accuracy + in favour of efficiency and simplicity. We’ll only follow a single + analysis. Our search strategy will be entirely greedy, as it was with + the POS tagger. We’ll lock-in our choices at every step. + + p. + If you read the POS tagger post carefully, you might see the underlying + similarity. What we’ve done is mapped the parsing problem onto a + sequence-labelling problem, which we address using a “flat”, or unstructured, + learning algorithm (by doing greedy search). + + h3 Features + p. + Feature extraction code is always pretty ugly. The features for the parser + refer to a few tokens from the context: + + ul + li The first three words of the buffer (n0, n1, n2) + li The top three words of the stack (s0, s1, s2) + li The two leftmost children of s0 (s0b1, s0b2); + li The two rightmost children of s0 (s0f1, s0f2); + li The two leftmost children of n0 (n0b1, n0b2) + + p. + For these 12 tokens, we refer to the word-form, the part-of-speech tag, + and the number of left and right children attached to the token. + + p. + Because we’re using a linear model, we have our features refer to pairs + and triples of these atomic properties. + + pre.language-python + code + | def extract_features(words, tags, n0, n, stack, parse): + | def get_stack_context(depth, stack, data): + | if depth >= 3: + | return data[stack[-1]], data[stack[-2]], data[stack[-3]] + | elif depth >= 2: + | return data[stack[-1]], data[stack[-2]], '' + | elif depth == 1: + | return data[stack[-1]], '', '' + | else: + | return '', '', '' + | + | def get_buffer_context(i, n, data): + | if i + 1 >= n: + | return data[i], '', '' + | elif i + 2 >= n: + | return data[i], data[i + 1], '' + | else: + | return data[i], data[i + 1], data[i + 2] + | + | def get_parse_context(word, deps, data): + | if word == -1: + | return 0, '', '' + | deps = deps[word] + | valency = len(deps) + | if not valency: + | return 0, '', '' + | elif valency == 1: + | return 1, data[deps[-1]], '' + | else: + | return valency, data[deps[-1]], data[deps[-2]] + | + | features = {} + | # Set up the context pieces --- the word, W, and tag, T, of: + | # S0-2: Top three words on the stack + | # N0-2: First three words of the buffer + | # n0b1, n0b2: Two leftmost children of the first word of the buffer + | # s0b1, s0b2: Two leftmost children of the top word of the stack + | # s0f1, s0f2: Two rightmost children of the top word of the stack + | + | depth = len(stack) + | s0 = stack[-1] if depth else -1 + | + | Ws0, Ws1, Ws2 = get_stack_context(depth, stack, words) + | Ts0, Ts1, Ts2 = get_stack_context(depth, stack, tags) + | + | Wn0, Wn1, Wn2 = get_buffer_context(n0, n, words) + | Tn0, Tn1, Tn2 = get_buffer_context(n0, n, tags) + | + | Vn0b, Wn0b1, Wn0b2 = get_parse_context(n0, parse.lefts, words) + | Vn0b, Tn0b1, Tn0b2 = get_parse_context(n0, parse.lefts, tags) + | + | Vn0f, Wn0f1, Wn0f2 = get_parse_context(n0, parse.rights, words) + | _, Tn0f1, Tn0f2 = get_parse_context(n0, parse.rights, tags) + | + | Vs0b, Ws0b1, Ws0b2 = get_parse_context(s0, parse.lefts, words) + | _, Ts0b1, Ts0b2 = get_parse_context(s0, parse.lefts, tags) + | + | Vs0f, Ws0f1, Ws0f2 = get_parse_context(s0, parse.rights, words) + | _, Ts0f1, Ts0f2 = get_parse_context(s0, parse.rights, tags) + | + | # Cap numeric features at 5? + | # String-distance + | Ds0n0 = min((n0 - s0, 5)) if s0 != 0 else 0 + | + | features['bias'] = 1 + | # Add word and tag unigrams + | for w in (Wn0, Wn1, Wn2, Ws0, Ws1, Ws2, Wn0b1, Wn0b2, Ws0b1, Ws0b2, Ws0f1, Ws0f2): + | if w: + | features['w=%s' % w] = 1 + | for t in (Tn0, Tn1, Tn2, Ts0, Ts1, Ts2, Tn0b1, Tn0b2, Ts0b1, Ts0b2, Ts0f1, Ts0f2): + | if t: + | features['t=%s' % t] = 1 + | + | # Add word/tag pairs + | for i, (w, t) in enumerate(((Wn0, Tn0), (Wn1, Tn1), (Wn2, Tn2), (Ws0, Ts0))): + | if w or t: + | features['%d w=%s, t=%s' % (i, w, t)] = 1 + | + | # Add some bigrams + | features['s0w=%s, n0w=%s' % (Ws0, Wn0)] = 1 + | features['wn0tn0-ws0 %s/%s %s' % (Wn0, Tn0, Ws0)] = 1 + | features['wn0tn0-ts0 %s/%s %s' % (Wn0, Tn0, Ts0)] = 1 + | features['ws0ts0-wn0 %s/%s %s' % (Ws0, Ts0, Wn0)] = 1 + | features['ws0-ts0 tn0 %s/%s %s' % (Ws0, Ts0, Tn0)] = 1 + | features['wt-wt %s/%s %s/%s' % (Ws0, Ts0, Wn0, Tn0)] = 1 + | features['tt s0=%s n0=%s' % (Ts0, Tn0)] = 1 + | features['tt n0=%s n1=%s' % (Tn0, Tn1)] = 1 + | + | # Add some tag trigrams + | trigrams = ((Tn0, Tn1, Tn2), (Ts0, Tn0, Tn1), (Ts0, Ts1, Tn0), + | (Ts0, Ts0f1, Tn0), (Ts0, Ts0f1, Tn0), (Ts0, Tn0, Tn0b1), + | (Ts0, Ts0b1, Ts0b2), (Ts0, Ts0f1, Ts0f2), (Tn0, Tn0b1, Tn0b2), + | (Ts0, Ts1, Ts1)) + | for i, (t1, t2, t3) in enumerate(trigrams): + | if t1 or t2 or t3: + | features['ttt-%d %s %s %s' % (i, t1, t2, t3)] = 1 + | + | # Add some valency and distance features + | vw = ((Ws0, Vs0f), (Ws0, Vs0b), (Wn0, Vn0b)) + | vt = ((Ts0, Vs0f), (Ts0, Vs0b), (Tn0, Vn0b)) + | d = ((Ws0, Ds0n0), (Wn0, Ds0n0), (Ts0, Ds0n0), (Tn0, Ds0n0), + | ('t' + Tn0+Ts0, Ds0n0), ('w' + Wn0+Ws0, Ds0n0)) + | for i, (w_t, v_d) in enumerate(vw + vt + d): + | if w_t or v_d: + | features['val/d-%d %s %d' % (i, w_t, v_d)] = 1 + | return features + + + h3 Training + + p. + Weights are learned using the same algorithm, averaged perceptron, that + we used for part-of-speech tagging. Its key strength is that it’s an + online learning algorithm: examples stream in one-by-one, we make our + prediction, check the actual answer, and adjust our beliefs (weights) + if we were wrong. + + p The training loop looks like this: + + pre.language-python + code + | class Parser(object): + | ... + | def train_one(self, itn, words, gold_tags, gold_heads): + | n = len(words) + | i = 2; stack = [1]; parse = Parse(n) + | tags = self.tagger.tag(words) + | while stack or (i + 1) < n: + | features = extract_features(words, tags, i, n, stack, parse) + | scores = self.model.score(features) + | valid_moves = get_valid_moves(i, n, len(stack)) + | guess = max(valid_moves, key=lambda move: scores[move]) + | gold_moves = get_gold_moves(i, n, stack, parse.heads, gold_heads) + | best = max(gold_moves, key=lambda move: scores[move]) + | self.model.update(best, guess, features) + | i = transition(guess, i, stack, parse) + | # Return number correct + | return len([i for i in range(n-1) if parse.heads[i] == gold_heads[i]]) + + + + p. + The most interesting part of the training process is in + code.language-python get_gold_moves. + The performance of our parser is made possible by an advance by Goldberg + and Nivre (2012), who showed that we’d been doing this wrong for years. + + p + | In the POS-tagging post, I cautioned that during training you need to + | make sure you pass in the last two + em predicted + | tags as features for the current tag, not the last two + em gold + | tags. At test time you’ll only have the predicted tags, so if you + | base your features on the gold sequence during training, your training + | contexts won’t resemble your test-time contexts, so you’ll learn the + | wrong weights. + + p. + In parsing, the problem was that we didn’t know + em how + | to pass in the predicted sequence! Training worked by taking the + | gold-standard tree, and finding a transition sequence that led to it. + | i.e., you got back a sequence of moves, with the guarantee that if + | you followed those moves, you’d get the gold-standard dependencies. + + p + | The problem is, we didn’t know how to define the “correct” move to + | teach a parser to make if it was in any state that + em wasn’t + | along that gold-standard sequence. Once the parser had made a mistake, + | we didn’t know how to train from that example. + + p + | That was a big problem, because it meant that once the parser started + | making mistakes, it would end up in states unlike any in its training + | data – leading to yet more mistakes. The problem was specific + | to greedy parsers: once you use a beam, there’s a natural way to do + | structured prediction. + p + | The solution seems obvious once you know it, like all the best breakthroughs. + | What we do is define a function that asks “How many gold-standard + | dependencies can be recovered from this state?”. If you can define + | that function, then you can apply each move in turn, and ask, “How + | many gold-standard dependencies can be recovered from + em this + | state?”. If the action you applied allows + em fewer + | gold-standard dependencies to be reached, then it is sub-optimal. + + p That’s a lot to take in. + + p + | So we have this function + code.language-python Oracle(state) + | : + pre + code + Oracle(state) = | gold_arcs ∩ reachable_arcs(state) | + p + | We also have a set of actions, each of which returns a new state. + | We want to know: + + ul + li shift_cost = Oracle(state) – Oracle(shift(state)) + li right_cost = Oracle(state) – Oracle(right(state)) + li left_cost = Oracle(state) – Oracle(left(state)) + + p + | Now, at least one of those costs + em has + | to be zero. Oracle(state) is asking, “what’s the cost of the best + | path forward?”, and the first action of that best path has to be + | shift, right, or left. + + p + | It turns out that we can derive Oracle fairly simply for many transition + | systems. The derivation for the transition system we’re using, Arc + | Hybrid, is in Goldberg and Nivre (2013). + + p + | We’re going to implement the oracle as a function that returns the + | zero-cost moves, rather than implementing a function Oracle(state). + | This prevents us from doing a bunch of costly copy operations. + | Hopefully the reasoning in the code isn’t too hard to follow, but + | you can also consult Goldberg and Nivre’s papers if you’re confused + | and want to get to the bottom of this. + + pre.language-python + code + | def get_gold_moves(n0, n, stack, heads, gold): + | def deps_between(target, others, gold): + | for word in others: + | if gold[word] == target or gold[target] == word: + | return True + | return False + | + | valid = get_valid_moves(n0, n, len(stack)) + | if not stack or (SHIFT in valid and gold[n0] == stack[-1]): + | return [SHIFT] + | if gold[stack[-1]] == n0: + | return [LEFT] + | costly = set([m for m in MOVES if m not in valid]) + | # If the word behind s0 is its gold head, Left is incorrect + | if len(stack) >= 2 and gold[stack[-1]] == stack[-2]: + | costly.add(LEFT) + | # If there are any dependencies between n0 and the stack, + | # pushing n0 will lose them. + | if SHIFT not in costly and deps_between(n0, stack, gold): + | costly.add(SHIFT) + | # If there are any dependencies between s0 and the buffer, popping + | # s0 will lose them. + | if deps_between(stack[-1], range(n0+1, n-1), gold): + | costly.add(LEFT) + | costly.add(RIGHT) + | return [m for m in MOVES if m not in costly] + + + + p + | Doing this “dynamic oracle” training procedure makes a big difference + | to accuracy — typically 1-2%, with no difference to the way the run-time + | works. The old “static oracle” greedy training procedure is fully + | obsolete; there’s no reason to do it that way any more. + + h3 Conclusion + + p + | I have the sense that language technologies, particularly those relating + | to grammar, are particularly mysterious. I can imagine having no idea + | what the program might even do. + + p + | I think it therefore seems natural to people that the best solutions + | would be over-whelmingly complicated. A 200,000 line Java package + | feels appropriate. + p + | But, algorithmic code is usually short, when only a single algorithm + | is implemented. And when you only implement one algorithm, and you + | know exactly what you want to write before you write a line, you + | also don’t pay for any unnecessary abstractions, which can have a + | big performance impact. + + h3 Notes + p + a(name='note-1') + | [1] I wasn’t really sure how to count the lines of code in the Stanford + | parser. Its jar file ships over 200k, but there are a lot of different + | models in it. It’s not important, but over 50k seems safe. + + p + a(name='note-2') + | [2] For instance, how would you parse, “John’s school of music calls”? + | You want to make sure the phrase “John’s school” has a consistent + | structure in both “John’s school calls” and “John’s school of music + | calls”. Reasoning about the different “slots” you can put a phrase + | into is a key way we reason about what syntactic analyses look like. + | You can think of each phrase as having a different shaped connector, + | which you need to plug into different slots — which each phrase also + | has a certain number of, each of a different shape. We’re trying to + | figure out what connectors are where, so we can figure out how the + | sentences are put together. + + h3 Idle speculation + p + | For a long time, incremental language processing algorithms were + | primarily of scientific interest. If you want to write a parser to + | test a theory about how the human sentence processor might work, well, + | that parser needs to build partial interpretations. There’s a wealth + | of evidence, including commonsense introspection, that establishes + | that we don’t buffer input and analyse it once the speaker has finished. + + p + | But now algorithms with that neat scientific feature are winning! + | As best as I can tell, the secret to that success is to be: + + ul + li Incremental. Earlier words constrain the search. + li + | Error-driven. Training involves a working hypothesis, which is + | updated as it makes mistakes. + + p + | The links to human sentence processing seem tantalising. I look + | forward to seeing whether these engineering breakthroughs lead to + | any psycholinguistic advances. + + h3 Bibliography + + p + | The NLP literature is almost entirely open access. All of the relavant + | papers can be found + a(href=urls.acl_anthology, rel='nofollow') here + | . + p + | The parser I’ve described is an implementation of the dynamic-oracle + | Arc-Hybrid system here: + + span.bib-item + | Goldberg, Yoav; Nivre, Joakim. + em Training Deterministic Parsers with Non-Deterministic Oracles + | . TACL 2013 + p + | However, I wrote my own features for it. The arc-hybrid system was + | originally described here: + + span.bib-item + | Kuhlmann, Marco; Gomez-Rodriguez, Carlos; Satta, Giorgio. Dynamic + | programming algorithms for transition-based dependency parsers. ACL 2011 + + p + | The dynamic oracle training method was first described here: + span.bib-item + | A Dynamic Oracle for Arc-Eager Dependency Parsing. Goldberg, Yoav; + | Nivre, Joakim. COLING 2012 + + p + | This work depended on a big break-through in accuracy for transition-based + | parsers, when beam-search was properly explored by Zhang and Clark. + | They have several papers, but the preferred citation is: + + span.bib-item + | Zhang, Yue; Clark, Steven. Syntactic Processing Using the Generalized + | Perceptron and Beam Search. Computational Linguistics 2011 (1) + p + | Another important paper was this little feature engineering paper, + | which further improved the accuracy: + + span.bib-item + | Zhang, Yue; Nivre, Joakim. Transition-based Dependency Parsing with + | Rich Non-local Features. ACL 2011 + + p + | The generalised perceptron, which is the learning framework for these + | beam parsers, is from this paper: + span.bib-item + | Collins, Michael. Discriminative Training Methods for Hidden Markov + | Models: Theory and Experiments with Perceptron Algorithms. EMNLP 2002 + + h3 Experimental details + p + | The results at the start of the post refer to Section 22 of the Wall + | Street Journal corpus. The Stanford parser was run as follows: + + pre.language-bash + code + | java -mx10000m -cp "$scriptdir/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser \ + | -outputFormat "penn" edu/stanford/nlp/models/lexparser/englishFactored.ser.gz $* + + + + p + | A small post-process was applied, to undo the fancy tokenisation + | Stanford adds for numbers, to make them match the PTB tokenisation: + + pre.language-python + code + | """Stanford parser retokenises numbers. Split them.""" + | import sys + | import re + | + | qp_re = re.compile('\xc2\xa0') + | for line in sys.stdin: + | line = line.rstrip() + | if qp_re.search(line): + | line = line.replace('(CD', '(QP (CD', 1) + ')' + | line = line.replace('\xc2\xa0', ') (CD ') + | print line + + p + | The resulting PTB-format files were then converted into dependencies + | using the Stanford converter: + + pre.language-bash + code + | ./scripts/train.py -x zhang+stack -k 8 -p ~/data/stanford/train.conll ~/data/parsers/tmp + | ./scripts/parse.py ~/data/parsers/tmp ~/data/stanford/devi.txt /tmp/parse/ + | ./scripts/evaluate.py /tmp/parse/parses ~/data/stanford/dev.conll + p + | I can’t easily read that anymore, but it should just convert every + | .mrg file in a folder to a CoNLL-format Stanford basic dependencies + | file, using the settings common in the dependency literature. + + p + | I then converted the gold-standard trees from WSJ 22, for the evaluation. + | Accuracy scores refer to unlabelled attachment score (i.e. the head index) + | of all non-punctuation tokens. + + p + | To train parser.py, I fed the gold-standard PTB trees for WSJ 02-21 + | into the same conversion script. + + p + | In a nutshell: The Stanford model and parser.py are trained on the + | same set of sentences, and they each make their predictions on a + | held-out test set, for which we know the answers. Accuracy refers + | to how many of the words’ heads we got correct. + + p + | Speeds were measured on a 2.4Ghz Xeon. I ran the experiments on a + | server, to give the Stanford parser more memory. The parser.py system + | runs fine on my MacBook Air. I used PyPy for the parser.py experiments; + | CPython was about half as fast on an early benchmark. + + p + | One of the reasons parser.py is so fast is that it does unlabelled + | parsing. Based on previous experiments, a labelled parser would likely + | be about 40x slower, and about 1% more accurate. Adapting the program + | to labelled parsing would be a good exercise for the reader, if you + | have access to the data. + + p + | The result from the Redshift parser was produced from commit + code.language-python b6b624c9900f3bf + | , which was run as follows: + pre.language-python. + footer.meta(role='contentinfo') + a.button.button-twitter(href=urls.share_twitter, title='Share on Twitter', rel='nofollow') Share on Twitter + .discuss + a.button.button-hn(href='#', title='Discuss on Hacker News', rel='nofollow') Discuss on Hacker News + a.button.button-reddit(href='#', title='Discuss on Reddit', rel='nofollow') Discuss on Reddit + footer(role='contentinfo') + script(src='js/prism.js') +