mirror of https://github.com/explosion/spaCy.git
117 lines
5.2 KiB
ReStructuredText
117 lines
5.2 KiB
ReStructuredText
|
====================
|
||
|
Annotation Standards
|
||
|
====================
|
||
|
|
||
|
This document describes the target annotations spaCy is trained to predict.
|
||
|
|
||
|
This is currently a work in progress. Please ask questions on the issue tracker,
|
||
|
so that the answers can be integrated here to improve the documentation.
|
||
|
|
||
|
https://github.com/honnibal/spaCy/issues
|
||
|
|
||
|
English
|
||
|
=======
|
||
|
|
||
|
Tokenization
|
||
|
------------
|
||
|
|
||
|
Tokenization standards are based on the OntoNotes 5 corpus.
|
||
|
|
||
|
The tokenizer differs from most by including tokens for significant whitespace.
|
||
|
Any sequence of whitespace characters beyond a single space (' ') is included
|
||
|
as a token. For instance:
|
||
|
|
||
|
>>> from spacy.en import English
|
||
|
>>> nlp = English(parse=False)
|
||
|
>>> tokens = nlp(u'Some\nspaces and\ttab characters')
|
||
|
>>> print [t.orth_ for t in tokens]
|
||
|
[u'Some', u'\n', u'spaces', u' ', u'and', u'\t', u'tab', u'characters']
|
||
|
|
||
|
The whitespace tokens are useful for much the same reason punctuation is --- it's
|
||
|
often an important delimiter in the text. By preserving it in the token output,
|
||
|
we are able to maintain a simple alignment between the tokens and the original
|
||
|
string, and we ensure that the token stream does not lose information.
|
||
|
|
||
|
Sentence boundary detection
|
||
|
---------------------------
|
||
|
|
||
|
Sentence boundaries are calculated from the syntactic parse tree, so features
|
||
|
such as punctuation and capitalisation play an important but non-decisive role
|
||
|
in determining the sentence boundaries. Usually this means that the sentence
|
||
|
boundaries will at least coincide with clause boundaries, even given poorly
|
||
|
punctuated text.
|
||
|
|
||
|
Part-of-speech Tagging
|
||
|
----------------------
|
||
|
|
||
|
The part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank
|
||
|
tag set. We also map the tags to the simpler Google Universal POS Tag set.
|
||
|
|
||
|
Details here: https://github.com/honnibal/spaCy/blob/master/spacy/en/pos.pyx#L124
|
||
|
|
||
|
Lemmatization
|
||
|
-------------
|
||
|
|
||
|
A "lemma" is the uninflected form of a word. In English, this means:
|
||
|
|
||
|
* Adjectives: The form like "happy", not "happier" or "happiest"
|
||
|
* Adverbs: The form like "badly", not "worse" or "worst"
|
||
|
* Nouns: The form like "dog", not "dogs"; like "child", not "children"
|
||
|
* Verbs: The form like "write", not "writes", "writing", "wrote" or "written"
|
||
|
|
||
|
The lemmatization data is taken from WordNet. However, we also add a special
|
||
|
case for pronouns: all pronouns are lemmatized to the special token -PRON-.
|
||
|
|
||
|
Syntactic Dependency Parsing
|
||
|
----------------------------
|
||
|
|
||
|
The parser is trained on data produced by the ClearNLP converter. Details of
|
||
|
the annotation scheme can be found here:
|
||
|
|
||
|
http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf
|
||
|
|
||
|
Named Entity Recognition
|
||
|
------------------------
|
||
|
|
||
|
+--------------+-----------------------------------------------------+
|
||
|
| PERSON | People, including fictional |
|
||
|
+--------------+-----------------------------------------------------+
|
||
|
| NORP | Nationalities or religious or political groups |
|
||
|
+--------------+-----------------------------------------------------+
|
||
|
| FACILITY | Buildings, airports, highways, bridges, etc. |
|
||
|
+--------------+-----------------------------------------------------+
|
||
|
| ORGANIZATION | Companies, agencies, institutions, etc. |
|
||
|
+--------------+-----------------------------------------------------+
|
||
|
| GPE | Countries, cities, states |
|
||
|
+--------------+-----------------------------------------------------+
|
||
|
| LOCATION | Non-GPE locations, mountain ranges, bodies of water |
|
||
|
+--------------+-----------------------------------------------------+
|
||
|
| PRODUCT | Vehicles, weapons, foods, etc. (Not services) |
|
||
|
+--------------+-----------------------------------------------------+
|
||
|
| EVENT | Named hurricanes, battles, wars, sports events, etc.|
|
||
|
+--------------+-----------------------------------------------------+
|
||
|
| WORK OF ART | Titles of books, songs, etc. |
|
||
|
+--------------+-----------------------------------------------------+
|
||
|
| LAW | Named documents made into laws |
|
||
|
+--------------+-----------------------------------------------------+
|
||
|
| LANGUAGE | Any named language |
|
||
|
+--------------+-----------------------------------------------------+
|
||
|
|
||
|
The following values are also annotated in a style similar to names:
|
||
|
|
||
|
+--------------+---------------------------------------------+
|
||
|
| DATE | Absolute or relative dates or periods |
|
||
|
+--------------+---------------------------------------------+
|
||
|
| TIME | Times smaller than a day |
|
||
|
+--------------+---------------------------------------------+
|
||
|
| PERCENT | Percentage (including “%”) |
|
||
|
+--------------+---------------------------------------------+
|
||
|
| MONEY | Monetary values, including unit |
|
||
|
+--------------+---------------------------------------------+
|
||
|
| QUANTITY | Measurements, as of weight or distance |
|
||
|
+--------------+---------------------------------------------+
|
||
|
| ORDINAL | "first", "second" |
|
||
|
+--------------+---------------------------------------------+
|
||
|
| CARDINAL | Numerals that do not fall under another type|
|
||
|
+--------------+---------------------------------------------+
|