* Add draft doc describing annotation standards

2015-07-08 10:27:35 +02:00 · 2015-07-08 10:27:35 +02:00 · 99e84488da
parent 68eff957a5
commit 99e84488da
1 changed files with 116 additions and 0 deletions
--- a/docs/source/annotation.rst
+++ b/docs/source/annotation.rst
@ -0,0 +1,116 @@
+====================
+Annotation Standards
+====================
+
+This document describes the target annotations spaCy is trained to predict.
+
+This is currently a work in progress. Please ask questions on the issue tracker,
+so that the answers can be integrated here to improve the documentation.
+
+https://github.com/honnibal/spaCy/issues
+
+English
+=======
+
+Tokenization
+------------
+
+Tokenization standards are based on the OntoNotes 5 corpus.
+
+The tokenizer differs from most by including tokens for significant whitespace.
+Any sequence of whitespace characters beyond a single space (' ') is included
+as a token. For instance:
+
+    >>> from spacy.en import English
+    >>> nlp = English(parse=False)
+    >>> tokens = nlp(u'Some\nspaces  and\ttab characters')
+    >>> print [t.orth_ for t in tokens]
+    [u'Some', u'\n', u'spaces', u' ', u'and', u'\t', u'tab', u'characters']
+
+The whitespace tokens are useful for much the same reason punctuation is --- it's
+often an important delimiter in the text.  By preserving it in the token output,
+we are able to maintain a simple alignment between the tokens and the original
+string, and we ensure that the token stream does not lose information.
+
+Sentence boundary detection
+---------------------------
+
+Sentence boundaries are calculated from the syntactic parse tree, so features
+such as punctuation and capitalisation play an important but non-decisive role
+in determining the sentence boundaries.  Usually this means that the sentence
+boundaries will at least coincide with clause boundaries, even given poorly
+punctuated text.
+
+Part-of-speech Tagging
+----------------------
+
+The part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank
+tag set.  We also map the tags to the simpler Google Universal POS Tag set.
+
+Details here: https://github.com/honnibal/spaCy/blob/master/spacy/en/pos.pyx#L124
+
+Lemmatization
+-------------
+
+A "lemma" is the uninflected form of a word. In English, this means:
+
+* Adjectives: The form like "happy", not "happier" or "happiest"
+* Adverbs: The form like "badly", not "worse" or "worst"
+* Nouns: The form like "dog", not "dogs"; like "child", not "children"
+* Verbs: The form like "write", not "writes", "writing", "wrote" or "written" 
+
+The lemmatization data is taken from WordNet. However, we also add a special
+case for pronouns: all pronouns are lemmatized to the special token -PRON-.
+
+Syntactic Dependency Parsing
+----------------------------
+
+The parser is trained on data produced by the ClearNLP converter. Details of
+the annotation scheme can be found here: 
+
+http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf
+
+Named Entity Recognition
+------------------------
+
+ +--------------+-----------------------------------------------------+
+ | PERSON       | People, including fictional                         |
+ +--------------+-----------------------------------------------------+
+ | NORP         | Nationalities or religious or political groups      |
+ +--------------+-----------------------------------------------------+
+ | FACILITY     | Buildings, airports, highways, bridges, etc.        |
+ +--------------+-----------------------------------------------------+
+ | ORGANIZATION | Companies, agencies, institutions, etc.             |
+ +--------------+-----------------------------------------------------+
+ | GPE          | Countries, cities, states                           |
+ +--------------+-----------------------------------------------------+
+ | LOCATION     | Non-GPE locations, mountain ranges, bodies of water |
+ +--------------+-----------------------------------------------------+
+ | PRODUCT      | Vehicles, weapons, foods, etc. (Not services)       |
+ +--------------+-----------------------------------------------------+
+ | EVENT        | Named hurricanes, battles, wars, sports events, etc.|
+ +--------------+-----------------------------------------------------+
+ | WORK OF ART  | Titles of books, songs, etc.                        |
+ +--------------+-----------------------------------------------------+
+ | LAW          | Named documents made into laws                      |
+ +--------------+-----------------------------------------------------+
+ | LANGUAGE     | Any named language                                  |
+ +--------------+-----------------------------------------------------+
+
+The following values are also annotated in a style similar to names:
+
+ +--------------+---------------------------------------------+
+ | DATE         | Absolute or relative dates or periods       |
+ +--------------+---------------------------------------------+
+ | TIME         | Times smaller than a day                    |
+ +--------------+---------------------------------------------+
+ | PERCENT      | Percentage (including “%”)                  |
+ +--------------+---------------------------------------------+
+ | MONEY        | Monetary values, including unit             |
+ +--------------+---------------------------------------------+
+ | QUANTITY     | Measurements, as of weight or distance      |
+ +--------------+---------------------------------------------+
+ | ORDINAL      | "first", "second"                           |
+ +--------------+---------------------------------------------+
+ | CARDINAL     | Numerals that do not fall under another type|
+ +--------------+---------------------------------------------+