2016-10-03 18:19:13 +00:00
|
|
|
|
//- ----------------------------------
|
|
|
|
|
//- 💫 DOCS > ANNOTATION SPECS
|
|
|
|
|
//- ----------------------------------
|
2016-03-31 14:24:48 +00:00
|
|
|
|
|
2016-10-03 18:19:13 +00:00
|
|
|
|
+section("annotation")
|
|
|
|
|
+h(2, "annotation").
|
2016-03-31 14:24:48 +00:00
|
|
|
|
Annotation Specifications
|
|
|
|
|
|
|
|
|
|
p.
|
2016-10-03 18:19:13 +00:00
|
|
|
|
This document describes the target annotations spaCy is trained to predict.
|
|
|
|
|
This is currently a work in progress. Please ask questions on the
|
|
|
|
|
#[+a("https://github.com/" + SOCIAL.github + "/spaCy/issues") issue tracker],
|
2016-03-31 14:24:48 +00:00
|
|
|
|
so that the answers can be integrated here to improve the documentation.
|
|
|
|
|
|
2016-10-03 18:19:13 +00:00
|
|
|
|
+section("annotation-tokenization")
|
|
|
|
|
+h(3, "annotation-tokenization").
|
2016-03-31 14:24:48 +00:00
|
|
|
|
Tokenization
|
|
|
|
|
|
|
|
|
|
p.
|
2016-10-03 18:19:13 +00:00
|
|
|
|
Tokenization standards are based on the OntoNotes 5 corpus. The
|
|
|
|
|
tokenizer differs from most by including tokens for significant
|
|
|
|
|
whitespace. Any sequence of whitespace characters beyond a single
|
2016-03-31 14:24:48 +00:00
|
|
|
|
space (' ') is included as a token. For instance:
|
|
|
|
|
|
|
|
|
|
+code.
|
|
|
|
|
from spacy.en import English
|
2016-10-03 18:19:13 +00:00
|
|
|
|
nlp = English(parser=False)
|
2016-03-31 14:24:48 +00:00
|
|
|
|
tokens = nlp('Some\nspaces and\ttab characters')
|
|
|
|
|
print([t.orth_ for t in tokens])
|
|
|
|
|
|
2016-10-03 18:19:13 +00:00
|
|
|
|
p Which produces:
|
2016-03-31 14:24:48 +00:00
|
|
|
|
|
|
|
|
|
+code.
|
|
|
|
|
['Some', '\n', 'spaces', ' ', 'and', '\t', 'tab', 'characters']
|
|
|
|
|
|
|
|
|
|
p.
|
2016-10-03 18:19:13 +00:00
|
|
|
|
The whitespace tokens are useful for much the same reason punctuation
|
|
|
|
|
is – it's often an important delimiter in the text. By preserving it
|
|
|
|
|
in the token output, we are able to maintain a simple alignment between
|
|
|
|
|
the tokens and the original string, and we ensure that no information
|
2016-03-31 14:24:48 +00:00
|
|
|
|
is lost during processing.
|
|
|
|
|
|
2016-10-03 18:19:13 +00:00
|
|
|
|
+section("annotation-sentence-boundary")
|
|
|
|
|
+h(3, "annotation-sentence-boundary").
|
2016-03-31 14:24:48 +00:00
|
|
|
|
Sentence boundary detection
|
|
|
|
|
|
|
|
|
|
p.
|
2016-10-03 18:19:13 +00:00
|
|
|
|
Sentence boundaries are calculated from the syntactic parse tree, so
|
|
|
|
|
features such as punctuation and capitalisation play an important but
|
|
|
|
|
non-decisive role in determining the sentence boundaries. Usually
|
|
|
|
|
this means that the sentence boundaries will at least coincide with
|
2016-03-31 14:24:48 +00:00
|
|
|
|
clause boundaries, even given poorly punctuated text.
|
|
|
|
|
|
2016-10-03 18:19:13 +00:00
|
|
|
|
+section("annotation-pos-tagging")
|
|
|
|
|
+h(3, "annotation-pos-tagging").
|
2016-03-31 14:24:48 +00:00
|
|
|
|
Part-of-speech Tagging
|
|
|
|
|
|
|
|
|
|
p.
|
2016-10-03 18:19:13 +00:00
|
|
|
|
The part-of-speech tagger uses the OntoNotes 5 version of the Penn
|
|
|
|
|
Treebank tag set. We also map the tags to the simpler Google Universal
|
|
|
|
|
POS Tag set. Details #[+a("https://github.com/" + SOCIAL.github + "/spaCy/blob/master/spacy/tagger.pyx") here].
|
2016-03-31 14:24:48 +00:00
|
|
|
|
|
2016-10-03 18:19:13 +00:00
|
|
|
|
+section("annotation-lemmatization")
|
|
|
|
|
+h(3, "annotation-lemmatization").
|
2016-03-31 14:24:48 +00:00
|
|
|
|
Lemmatization
|
|
|
|
|
|
2016-10-03 18:19:13 +00:00
|
|
|
|
p A "lemma" is the uninflected form of a word. In English, this means:
|
2016-03-31 14:24:48 +00:00
|
|
|
|
|
|
|
|
|
+list
|
|
|
|
|
+item #[strong Adjectives:] The form like "happy", not "happier" or "happiest"
|
|
|
|
|
+item #[strong Adverbs:] The form like "badly", not "worse" or "worst"
|
|
|
|
|
+item #[strong Nouns:] The form like "dog", not "dogs"; like "child", not "children"
|
|
|
|
|
+item #[strong Verbs:] The form like "write", not "writes", "writing", "wrote" or "written"
|
2016-10-03 18:19:13 +00:00
|
|
|
|
|
2016-03-31 14:24:48 +00:00
|
|
|
|
p.
|
2016-10-03 18:19:13 +00:00
|
|
|
|
The lemmatization data is taken from WordNet. However, we also add a
|
|
|
|
|
special case for pronouns: all pronouns are lemmatized to the special
|
2016-03-31 14:24:48 +00:00
|
|
|
|
token #[code -PRON-].
|
|
|
|
|
|
2016-10-03 18:19:13 +00:00
|
|
|
|
+section("annotation-dependency")
|
|
|
|
|
+h(3, "annotation-dependency").
|
2016-03-31 14:24:48 +00:00
|
|
|
|
Syntactic Dependency Parsing
|
|
|
|
|
|
|
|
|
|
p.
|
2016-10-03 18:19:13 +00:00
|
|
|
|
The parser is trained on data produced by the ClearNLP converter.
|
|
|
|
|
Details of the annotation scheme can be found
|
|
|
|
|
#[+a("http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf") here].
|
2016-03-31 14:24:48 +00:00
|
|
|
|
|
2016-10-03 18:19:13 +00:00
|
|
|
|
+section("annotation-ner")
|
|
|
|
|
+h(3, "annotation-ner").
|
2016-03-31 14:24:48 +00:00
|
|
|
|
Named Entity Recognition
|
|
|
|
|
|
2016-10-03 18:19:13 +00:00
|
|
|
|
+table(["Entity Type", "Description"])
|
2016-03-31 14:24:48 +00:00
|
|
|
|
+row
|
|
|
|
|
+cell PERSON
|
|
|
|
|
+cell People, including fictional.
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell NORP
|
|
|
|
|
+cell Nationalities or religious or political groups.
|
|
|
|
|
|
|
|
|
|
+row
|
2016-10-03 18:19:13 +00:00
|
|
|
|
+cell FAC
|
|
|
|
|
+cell Facilities, such as buildings, airports, highways, bridges, etc.
|
2016-03-31 14:24:48 +00:00
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell ORG
|
|
|
|
|
+cell Companies, agencies, institutions, etc.
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell GPE
|
|
|
|
|
+cell Countries, cities, states.
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell LOC
|
|
|
|
|
+cell Non-GPE locations, mountain ranges, bodies of water.
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell PRODUCT
|
|
|
|
|
+cell Vehicles, weapons, foods, etc. (Not services)
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell EVENT
|
|
|
|
|
+cell Named hurricanes, battles, wars, sports events, etc.
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell WORK_OF_ART
|
|
|
|
|
+cell Titles of books, songs, etc.
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell LAW
|
|
|
|
|
+cell Named documents made into laws
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell LANGUAGE
|
|
|
|
|
+cell Any named language
|
|
|
|
|
|
2016-10-03 18:19:13 +00:00
|
|
|
|
p The following values are also annotated in a style similar to names:
|
2016-03-31 14:24:48 +00:00
|
|
|
|
|
2016-10-03 18:19:13 +00:00
|
|
|
|
+table(["Entity Type", "Description"])
|
2016-03-31 14:24:48 +00:00
|
|
|
|
+row
|
|
|
|
|
+cell DATE
|
|
|
|
|
+cell Absolute or relative dates or periods
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell TIME
|
|
|
|
|
+cell Times smaller than a day
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell PERCENT
|
|
|
|
|
+cell Percentage (including “%”)
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell MONEY
|
|
|
|
|
+cell Monetary values, including unit
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell QUANTITY
|
|
|
|
|
+cell Measurements, as of weight or distance
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell ORDINAL
|
|
|
|
|
+cell "first", "second"
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell CARDINAL
|
|
|
|
|
+cell Numerals that do not fall under another type
|