spaCy/website/docs/api/annotation.jade

145 lines
4.5 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

//- 💫 DOCS > API > ANNOTATION SPECS
include ../../_includes/_mixins
p This document describes the target annotations spaCy is trained to predict.
+h(2, "tokenization") Tokenization
p
| Tokenization standards are based on the
| #[+a("https://catalog.ldc.upenn.edu/LDC2013T19") OntoNotes 5] corpus.
| The tokenizer differs from most by including tokens for significant
| whitespace. Any sequence of whitespace characters beyond a single space
| (#[code ' ']) is included as a token.
+aside-code("Example").
from spacy.en import English
nlp = English(parser=False)
tokens = nlp('Some\nspaces and\ttab characters')
print([t.orth_ for t in tokens])
# ['Some', '\n', 'spaces', ' ', 'and', '\t', 'tab', 'characters']
p
| The whitespace tokens are useful for much the same reason punctuation is
| it's often an important delimiter in the text. By preserving it in the
| token output, we are able to maintain a simple alignment between the
| tokens and the original string, and we ensure that no information is
| lost during processing.
+h(2, "sentence-boundary") Sentence boundary detection
p
| Sentence boundaries are calculated from the syntactic parse tree, so
| features such as punctuation and capitalisation play an important but
| non-decisive role in determining the sentence boundaries. Usually this
| means that the sentence boundaries will at least coincide with clause
| boundaries, even given poorly punctuated text.
+h(2, "pos-tagging") Part-of-speech Tagging
p
| The part-of-speech tagger uses the
| #[+a("https://catalog.ldc.upenn.edu/LDC2013T19") OntoNotes 5] version of
| the Penn Treebank tag set. We also map the tags to the simpler Google
| Universal POS Tag set. See
| #[+src(gh("spaCy", "spacy/tagger.pyx")) tagger.pyx] for details.
+h(2, "lemmatization") Lemmatization
p A "lemma" is the uninflected form of a word. In English, this means:
+list
+item #[strong Adjectives]: The form like "happy", not "happier" or "happiest"
+item #[strong Adverbs]: The form like "badly", not "worse" or "worst"
+item #[strong Nouns]: The form like "dog", not "dogs"; like "child", not "children"
+item #[strong Verbs]: The form like "write", not "writes", "writing", "wrote" or "written"
p
| The lemmatization data is taken from
| #[+a("https://wordnet.princeton.edu") WordNet]. However, we also add a
| special case for pronouns: all pronouns are lemmatized to the special
| token #[code -PRON-].
+h(2, "dependency-parsing") Syntactic Dependency Parsing
p
| The parser is trained on data produced by the
| #[+a("http://www.clearnlp.com") ClearNLP] converter. Details of the
| annotation scheme can be found
| #[+a("http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf") here].
+h(2, "named-entities") Named Entity Recognition
+table([ "Type", "Description" ])
+row
+cell #[code PERSON]
+cell People, including fictional
+row
+cell #[code NORP]
+cell Nationalities or religious or political groups
+row
+cell #[code FACILITY]
+cell Buildings, airports, highways, bridges, etc.
+row
+cell #[code ORG]
+cell Companies, agencies, institutions, etc.
+row
+cell #[code GPE]
+cell Countries, cities, states
+row
+cell #[code LOC]
+cell Non-GPE locations, mountain ranges, bodies of water
+row
+cell #[code PRODUCT]
+cell Objects, vehicles, foods, etc. (not services)
+row
+cell #[code EVENT]
+cell Named hurricanes, battles, wars, sports events, etc.
+row
+cell #[code WORK_OF_ART]
+cell Titles of books, songs, etc.
+row
+cell #[code LANGUAGE]
+cell Any named language
p The following values are also annotated in a style similar to names:
+table([ "Type", "Description" ])
+row
+cell #[code DATE]
+cell Absolute or relative dates or periods
+row
+cell #[code TIME]
+cell Times smaller than a day
+row
+cell #[code PERCENT]
+cell Percentage, including "%"
+row
+cell #[code MONEY]
+cell Monetary values, including unit
+row
+cell #[code QUANTITY]
+cell Measurements, as of weight or distance
+row
+cell #[code ORDINAL]
+cell "first", "second", etc.
+row
+cell #[code CARDINAL]
+cell Numerals that do not fall under another type