mirror of https://github.com/explosion/spaCy.git
* Add spec.jade
This commit is contained in:
parent
b57a3ddd7e
commit
ba00c72505
|
@ -0,0 +1,123 @@
|
||||||
|
extends ./outline.jade
|
||||||
|
|
||||||
|
mixin columns(...names)
|
||||||
|
tr
|
||||||
|
each name in names
|
||||||
|
th= name
|
||||||
|
|
||||||
|
|
||||||
|
mixin row(...cells)
|
||||||
|
tr
|
||||||
|
each cell in cells
|
||||||
|
td= cell
|
||||||
|
|
||||||
|
|
||||||
|
block body_block
|
||||||
|
article(class="page docs-page")
|
||||||
|
p.
|
||||||
|
This document describes the target annotations spaCy is trained to predict.
|
||||||
|
This is currently a work in progress. Please ask questions on the issue tracker,
|
||||||
|
so that the answers can be integrated here to improve the documentation.
|
||||||
|
|
||||||
|
h2 Tokenization
|
||||||
|
|
||||||
|
p Tokenization standards are based on the OntoNotes 5 corpus.
|
||||||
|
|
||||||
|
p.
|
||||||
|
The tokenizer differs from most by including tokens for significant
|
||||||
|
whitespace. Any sequence of whitespace characters beyond a single space
|
||||||
|
(' ') is included as a token. For instance:
|
||||||
|
|
||||||
|
pre.language-python
|
||||||
|
code
|
||||||
|
| from spacy.en import English
|
||||||
|
| nlp = English(parse=False)
|
||||||
|
| tokens = nlp('Some\nspaces and\ttab characters')
|
||||||
|
| print([t.orth_ for t in tokens])
|
||||||
|
|
||||||
|
p Which produces:
|
||||||
|
|
||||||
|
pre.language-python
|
||||||
|
code
|
||||||
|
| ['Some', '\n', 'spaces', ' ', 'and', '\t', 'tab', 'characters']
|
||||||
|
|
||||||
|
p.
|
||||||
|
The whitespace tokens are useful for much the same reason punctuation is
|
||||||
|
– it's often an important delimiter in the text. By preserving
|
||||||
|
it in the token output, we are able to maintain a simple alignment
|
||||||
|
between the tokens and the original string, and we ensure that no
|
||||||
|
information is lost during processing.
|
||||||
|
|
||||||
|
h3 Sentence boundary detection
|
||||||
|
|
||||||
|
p.
|
||||||
|
Sentence boundaries are calculated from the syntactic parse tree, so
|
||||||
|
features such as punctuation and capitalisation play an important but
|
||||||
|
non-decisive role in determining the sentence boundaries. Usually this
|
||||||
|
means that the sentence boundaries will at least coincide with clause
|
||||||
|
boundaries, even given poorly punctuated text.
|
||||||
|
|
||||||
|
h3 Part-of-speech Tagging
|
||||||
|
|
||||||
|
p.
|
||||||
|
The part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank
|
||||||
|
tag set. We also map the tags to the simpler Google Universal POS Tag set.
|
||||||
|
|
||||||
|
Details here: https://github.com/honnibal/spaCy/blob/master/spacy/en/pos.pyx#L124
|
||||||
|
|
||||||
|
h3 Lemmatization
|
||||||
|
|
||||||
|
p.
|
||||||
|
A "lemma" is the uninflected form of a word. In English, this means:
|
||||||
|
|
||||||
|
ul
|
||||||
|
li Adjectives: The form like "happy", not "happier" or "happiest"
|
||||||
|
li Adverbs: The form like "badly", not "worse" or "worst"
|
||||||
|
li Nouns: The form like "dog", not "dogs"; like "child", not "children"
|
||||||
|
li Verbs: The form like "write", not "writes", "writing", "wrote" or "written"
|
||||||
|
|
||||||
|
p.
|
||||||
|
The lemmatization data is taken from WordNet. However, we also add a
|
||||||
|
special case for pronouns: all pronouns are lemmatized to the special
|
||||||
|
token -PRON-.
|
||||||
|
|
||||||
|
|
||||||
|
h3 Syntactic Dependency Parsing
|
||||||
|
|
||||||
|
p.
|
||||||
|
The parser is trained on data produced by the ClearNLP converter. Details
|
||||||
|
of the annotation scheme can be found here: http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf
|
||||||
|
|
||||||
|
h3 Named Entity Recognition
|
||||||
|
|
||||||
|
table
|
||||||
|
thead
|
||||||
|
+columns("Entity Type", "Description")
|
||||||
|
|
||||||
|
tbody
|
||||||
|
+row("PERSON", "People, including fictional.")
|
||||||
|
+row("NORP", "Nationalities or religious or political groups.")
|
||||||
|
+row("FACILITY", "Buildings, airports, highways, bridges, etc.")
|
||||||
|
+row("ORG", "Companies, agencies, institutions, etc.")
|
||||||
|
+row("GPE", "Countries, cities, states.")
|
||||||
|
+row("LOC", "Non-GPE locations, mountain ranges, bodies of water.")
|
||||||
|
+row("PRODUCT", "Vehicles, weapons, foods, etc. (Not services")
|
||||||
|
+row("EVENT", "Named hurricanes, battles, wars, sports events, etc.")
|
||||||
|
+row("WORK_OF_ART", "Titles of books, songs, etc.")
|
||||||
|
+row("LAW", "Named documents made into laws")
|
||||||
|
+row("LANGUAGE", "Any named language")
|
||||||
|
|
||||||
|
p The following values are also annotated in a style similar to names:
|
||||||
|
|
||||||
|
table
|
||||||
|
thead
|
||||||
|
+columns("Entity Type", "Description")
|
||||||
|
|
||||||
|
tbody
|
||||||
|
+row("DATE", "Absolute or relative dates or periods")
|
||||||
|
+row("TIME", "Times smaller than a day")
|
||||||
|
+row("PERCENT", 'Percentage (including “%”)')
|
||||||
|
+row("MONEY", "Monetary values, including unit")
|
||||||
|
+row("QUANTITY", "Measurements, as of weight or distance")
|
||||||
|
+row("ORDINAL", 'first", "second"')
|
||||||
|
+row("CARDINAL", "Numerals that do not fall under another type")
|
Loading…
Reference in New Issue