spaCy/website/blog/german-model.jade

215 lines
20 KiB
Plaintext
Raw Normal View History

2016-09-30 18:29:03 +00:00
include ../_includes/_mixins
+lead Many people have asked us to make spaCy available for their language. Being based in Berlin, German was an obvious choice for our first second language. Now SpaCy can do all the cool things you use for processing English on German text too. But more importantly, teaching spaCy to speak German required us to drop some comfortable but English-specific assumptions about how language works and made spaCy fit to learn more languages in the future.
p The current release features high-accuracy syntactic dependency parsing, named entity recognition, part-of-speech tagging, token and sentence segmentation, and noun phrase chunking. It also comes with word vectors representations, produced from word2vec. As you'll see below, #[a(href="#run-spacy") installation and usage] work much the same for both German and English. However, there are some small differences, that follow from the two languages' differing linguistic structure.
+h2("german-like-english") German is like English but different
p On the evolutionary tree of languages, German and English are close cousins, on the Germanic branch of the Indo-European family. They share a relatively recent common ancestor, so they're structurally similar. And where they differ, it's mostly English that's weird, not German. The algorithmic changes needed to process German are an important step towards processing many other languages.
p English has very simple rules for word formation (aka #[b morphology]), and very strict rules for #[b word order]. This means that an English-only NLP system can get away with some very useful simplifying assumptions. German is the perfect language to unwind these. German word order and morphology are still relatively restricted, so we can make the necessary algorithmic changes without being overwhelmed by the additional complexity.
+aside("Word order and morphology") While the division between word order and morphology is useful for thinking about the problem, it is an artificial one. In reality, both phenomena are two sides of the same coin and interact heavily with each other.
+h2("word-order") Word order
p When Germans learn English in school, one of the first things they are taught to memorize is #[em Subject-Verb-Object] or SVO. In English, the subject comes first, then the verb, then the object. If you change that order, the meaning of the sentence changes. #[em The dog bites the man] means something different from #[em The man bites the dog] even though both sentences use the exact same words. In German — as in many other languages — this is not the case. German allows for any order of subject and object in a sentence and only restricts the position of the verb. In German, you can say #[em Der Hund beißt den Mann] and #[em Den Mann beißt der Hund] and both sentences mean #[em The dog bites the man].
+aside("Subject and Object") In a prototypical sentence, e.g., #[em John hits the ball], the #[a(href="https://en.wikipedia.org/wiki/Subject_(grammar)" target="_blank") grammatical subject] maps to the person/thing that acts (#[em John]) whereas the #[a(href="https://en.wikipedia.org/wiki/Object_(grammar)" target="_blank") grammatical object] maps to the person/thing that is acted upon (#[em the ball]). All languages have mechanisms to change this mapping though.
+quote("Sherlock Holmes, in <em>A Scandal in Bohemia</em>")
| Do you note the peculiar construction of the sentence &mdash; 'This account of you we have from all quarters received.'
br
br
| A Frenchman or Russian could not have written that. It is the German who is so uncourteous to his verbs.
p One of the more difficult things for people who learn German as a second language is to figure out where to put the verb. German verbs are usually at the end of a sentence, under certain circumstances, the verb or a part of it moves to the first or second position. For instance, compare the English sentence in the example below to its German counterpart. While all the parts of the English verb stay together, the German verb is distributed over the sentence. The main part (the one carrying the meaning) is at the end of the sentence and the other part (the auxiliary verb) comes in second position after the subject.
+image("german_verb_align.svg", "", "In German, verbs are put at the end of a sentence.", "small").text-center
p The fact that German verbs come at the end of the sentence, or are split as in the example above, has some implications for language understanding technologies. So far, the syntactic structures that spaCy predicted for English sentences were always #[i projective], which means that you could draw them without ever having to cross two arcs. In order to accommodate languages with less restrictive word order than English &mdash; for example German &mdash; the parser now also predicts non-projective structures, i.e., structures where arcs may cross.
+aside("Non-projectivity") Formally, an arc from word #[i h] to word #[i d] is called non-projective if there is at least one word #[i k] between #[i k] and #[i d] which is not a direct or indirect descendant of #[i h]. A tree is non-projective if it has at least one non-projective arc.
p To illustrate the difference, consider the example below. We want the syntactic structure to represent the fact that it is the flight that was booked the day before, hence we want the parser to predict an arc between #[em flight] and #[em booked]. And of course we want the parser to predict the same arc also for the German counterpart, in this case between #[em Flug] and #[em gebucht habe]. However, because the German verb comes last, there can be crossing arcs in the German structure. This is not the only type of German construction that leaves us with a non-projective parse, but it is a frequent one. Moreover, unlike some cases where you could change your linguistic theory to avoid the crossing arcs, this one is well motivated by data and very difficult to avoid without losing information.
+image("german_english_proj.svg", "", "Syntactic structures for English are usually projective.", "small").text-center
+image("german_german_nonproj.svg", "", "Syntactic structure for German. Non-projective arcs marked in blue.", "small").text-center
p To summarize the above, we need non-projective trees to represent the information that we are interested in when parsing natural language. Okay, so we want crossing arcs. What's the problem? The problem is that crossing arcs force us to give up a very useful constraint. The set of possible non-projective trees is considerably larger than the set of possible projective trees. What's more, the algorithm spaCy uses to search for a projective tree is both simpler and more efficient than the equivalent non-projective algorithms, so restricting spaCy to projective dependency parsing has given us a win on two fronts: we've been able to do less work computationally, while also encoding important prior knowledge about the problem space into the system.
p Unfortunately, this &lsquo;prior knowledge&rsquo; was never quite true. It's a simplifying assumption. In the same way that a physicist might assume a frictionless surface or a #[a(href="https://en.wikipedia.org/wiki/Spherical_cow") spherical cow], sometimes it's useful for computational linguists to assume projective trees and context-free grammars. For English, projective trees are a good-value simplification &mdash; the cow of English is not quite a perfect sphere, but it's close. The cow of German is considerably less round, and we can make our model more accurate by taking this into account.
+h2("pseudoproj-parsing") Pseudo-projective parsing
p Luckily for us, the problem of predicting non-projective structures has received a lot of attention over the last decade. One observation that was made early on is that these non-projective arcs are rare. Usually, only a few percent of the arcs in #[a(href="https://en.wikipedia.org/wiki/Treebank" target="_blank") linguistic treebanks] are non-projective, even for languages with unrestrictive word order. This means that we can afford to use approaches with a higher worst-case complexity because the worst case basically never occurs and therefore has virtually no impact on the efficiency of our NLP systems.
p Several methods have been proposed for dealing with non-projective arcs. Most change the parsing algorithm to search through the full space of possible structures directly, or at least a large part of it. In spaCy, we opted for a more indirect approach. #[a(href="http://www.aclweb.org/anthology/P05-1013" target="_blank") Nivre and Nilsson (2005)] propose a simple procedure they call #[b pseudo-projective parsing]. The parser is trained on projective structures that are produced from non-projective structures by reattaching the non-projective arcs higher in the tree until they are projective. The original attachment site is encoded in the label of the respective arc. The parser thus learns to predict projective structures with specially decorated arc labels. The output of the parser is then post-processed to reattach decorated arcs to their proper syntactic head according to their arc label, thereby re-introducing non-projective arcs.
+aside("Decoration schemes") #[a(href="http://www.aclweb.org/anthology/P05-1013" target="_blank") Nivre and Nilsson (2005)] test three different ways of decorating labels. SpaCy currently uses the #[em head] decoration scheme because it is a good compromise between the amount of encoded information and increase in the number of arc labels.
+image("german_pseudoproj.svg", "", "Pseudo-projective parsing. Training data is projectivized and decorated before training the model. Decorated arcs in parser output are re-attached to their non-projective head in a post-processing step.", "large").text-center
p Using pseudo-projective parsing allows spaCy to produce non-projective structures without having to sacrifice the efficient parsing algorithm, which is restricted to projective structures. And because non-projective arcs are rare, the post-processing step only ever has to reattach one or two arcs in every other sentence, which makes its impact on the overall parsing speed negligable even though its worst case complexity is higher than the parser's. In fact, we didn't notice any difference in speed when parsing German with this approach. And when we know that our training data is projective, we just switch it off.
+h3("accuracy") Evaluation of pseudo-projective parsing
p Pseudo-projective parsing makes a big difference in German because the parser can recover arcs that a purely projective model cannot. The numbers in the table show the percentage of arcs that were correctly attached by the parser (unlabeled attachment score (UAS) ignores the label, labeled attachment score (LAS) takes it into account). We train and evaluate the German model on the TiGer treebank (see #[a(href="#Data-sources") below]).
+table(["System", "UAS", "LAS"])
+row
+cell German, forcing projective structures
+cell 90.86%
+cell 88.60%
+row
+cell German, allowing non-projective structures
+cell 92.22%
+cell 90.14%
//- +row
//- +cell English
//- +cell 91.15%
//- +cell 89.13%
+h2("morphology") Morphology
p One other important difference between English and German is the richer morphology of German words. German words can change their form depending on their grammatical function in a sentence. English words do this too, for example by appending an #[i s] to a noun to mark plural (#[i ticket] &rarr; #[i tickets]). However, in most languages word forms of the same word show much more variety than in English, and this is also the case in German. German is also famous for its capacity to form #[a(href="https://en.wikipedia.org/wiki/Rinderkennzeichnungs-_und_Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz" target="_blank") really long words], another process that is driven by the morphological system.
p While German is clearly a language with rich morphology, it isn't the most crucial aspect for natural language processing of German (depending on the task, of course &#128521;). While processing languages like Hungarian, Turkish, or Czech is hopeless without a proper treatment of morphological processes, German can be processed reasonably well without. We therefore released the German model without a morphological component &mdash; for now. We're working on adding such a component to spaCy, not just for improving the German model but also to make the next step towards learning more languages.
+h2("run-spacy") Showtime
p As for English, spaCy now provides a pretrained model for processing German. This model currently provides functionality for tokenization, part-of-speech tagging, syntactic parsing, and named entity recognition. In addition, #[code spacy.de] also comes with pre-trained word representations, in the form of word vectors and hierarchical cluster IDs.
p Installing the German model on your machine is as easy as for English:
+code('bash','Install German model').
pip install spacy
python -m spacy.de.download
p Once installed you can use it from Python like the English model. If you've been loading spaCy using the #[code English()] class directly, now's a good time to switch over to the newer #[code spacy.load()] function:
+code('python','Parse German').
from __future__ import print_function, unicode_literals
import spacy
nlp = spacy.load('de')
doc = nlp(u'Ich bin ein Berliner.')
# show universal pos tags
print(' '.join('{word}/{tag}'.format(word=t.orth_, tag=t.pos_) for t in doc))
# output: Ich/PRON bin/AUX ein/DET Berliner/NOUN ./PUNCT
# show German specific pos tags (STTS)
print(' '.join('{word}/{tag}'.format(word.orth_, tag.tag_) for t in doc))
# output: Ich/PPER bin/VAFIN ein/ART Berliner/NN ./$.
# show dependency arcs
print('\n'.join('{child:&lt;8} &lt;{label:-^7} {head}'.format(child=t.orth_, label=t.dep_, head=t.head.orth_) for t in doc))
# output: (sb: subject, nk: noun kernel, pd: predicate)
# Ich &lt;--sb--- bin
# bin &lt;-ROOT-- bin
# ein &lt;--nk--- Berliner
# Berliner &lt;--pd--- bin
# . &lt;-punct- bin
p As for English, German provides named entities and a noun chunk iterator to extract basic information from the data. The NER model can currently distinguish persons, locations, and organizations. We are currently looking into ways of extending this to more classes.
+code('python','Named entity recognition').
# show named entities
for ent in doc.ents:
print(ent.text)
# output:
# Berliner
p The noun chunk iterator provides easy access to base noun phrases in the form of an iterator. The iterator requires the dependency structure to be present and returns all noun phrases that the parser recognized.
+code('python','Noun chunks').
# show noun chunks
for chunk in doc.noun_chunks:
print(chunk.text)
# output:
# ein Berliner
# noun chunks include so-called measure constructions ...
doc = de(u'Ich möchte gern zum Essen eine Tasse Kaffee bestellen.')
print [ chunk for chunk in doc.noun_chunks ]
# output:
# [Essen, eine Tasse Kaffee]
# ... and close appositions
doc = de(u'Der Senator vermeidet das Thema Flughafen.')
print [ chunk for chunk in doc.noun_chunks ]
# output:
# [Der Senator, das Thema Flughafen]
p The German model comes with word vectors trained on a mix of text from Wikipedia and the Open Subtitles corpus. The vectors were produced using the skip-gram with negative sampling word2vec algorithm using #[a(href="https://radimrehurek.com/gensim/" target="_blank") Gensim], with a context window of 2.
p You can use the vector representation with the #[code .vector] attribute and the #[code .similarity()] method on spaCy's #[code Lexeme], #[code Token], #[code Span] and #[code Doc] objects.
+code('python','Word vectors').
# Use word vectors
de = spacy.load('de')
doc = de(u'Der Apfel und die Orange sind ähnlich')
assert len(doc.vector) == len(doc[0].vector))
der_apfel = doc[:2]
die_orange = doc[3:5]
der_apfel.similarity(die_orange)
# output:
# 0.63665210991205579
der, apfel = der_apfel
der.similarity(apfel)
# output:
# 0.24995991403916812
p While we try to always provide good defaults in spaCy, the word2vec family of algorithms give you a lot of knobs to twiddle, so you might benefit from custom trained vectors. You can get expert help on this by #[a(href="mailto:" + email) contacting us] about consulting.
//- The developers of Gensim, #[a(href="https://rare-technologies.com/" target="_blank") RaRe Technologies], also offer excellent services to help you put word2vec into production.
//- +h2("Performance") Performance
//- +table(["", "POS acc", "Dep (UAS/LAS)", "NER (Prec/Rec/F1)"])
//- +row
//- +cell spaCy German
//- +cell 97.56
//- +cell 92.22/90.14
//- +cell 82.95/73.76/78.08
+h2("Caveats") Caveats
p With the German parser potentially returning non-projective structures, some assumptions about syntactic structures that would hold for the English parser don't hold for the German one. For example, the subtree of a particular token doesn't necessarily span a consecutive substring of the input sentence anymore. Furthermore, a token may have no direct left dependents but can still have a left edge (the left-most descendant of the token) that is further left of the token.
+code('python','Caveats with non-projectivity').
doc = nlp(u'Den Berliner hat der Hund nicht gebissen.')
# heads array: [1, 6, 2, 4, 2, 6, 2, 2] (second token is attached with a non-projective arc)
# most subtrees cover a consecutive span of the input
print [ (t.i, t.orth_) for t in doc[4].subtree ]
# output:
# [(3, u'der'), (4, u'Hund')]
# but some subtrees have gaps
print [ (t.i, t.orth_) for t in doc[6].subtree ]
# output:
# [(0, u'Den'), (1, u'Berliner'), (5, u'nicht'), (6, u'gebissen')]
# the root has no left dependents:
print doc[2].n_lefts
# output:
# 0
# but the root's left-most descendant is not the root itself but a token further left
print (doc[2].left_edge.i, doc[2].left_edge.orth_)
# output:
# (0, u'Den')
+h2("Data-sources") Data sources
p The German model is trained on the German #[a(href="http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html" target="_blank") TiGer treebank] converted to dependencies. The language-specific part-of-speech tags use the #[a(href="http://www.sfs.uni-tuebingen.de/resources/stts-1999.pdf" target="_blank") Stuttgart-Tübingen Tag Set (STTS)] (document in German). The model for named entity recognition is trained on the #[a(href="https://www.lt.tu-darmstadt.de/de/data/german-named-entity-recognition/" target="_blank") German Named Entity Recognition Data] from the TU Darmstadt. For estimating word probabilities we rely on data provided by the #[a(href="http://hpsg.fu-berlin.de/cow/" target="_blank") COW project]. Word vectors and Brown clusters are computed on a combination of the German Wikipedia and the German part of #[a(href="http://opus.lingfil.uu.se/OpenSubtitles2016.php" target="_blank") OpenSubtitles2016] which is based on data from #[a(href="http://www.opensubtitles.org/" target="_blank") opensubtitles.org]. Buy these people a beer and a cookie when you meet them :).
+h2("call") Want spaCy to speak your language?
p There is still a lot to do for German, but that doesn't mean spaCy can't start learning another language in the meantime. You can advocate for what languages should be added next on the #[a(href="https://reddit.com/r/" + profiles.reddit) spaCy subreddit], or #[a(href="mailto:" + email) get in touch] about sponsoring development.