diff --git a/docs/redesign/spec.jade b/docs/redesign/spec.jade new file mode 100644 index 000000000..a61a4f356 --- /dev/null +++ b/docs/redesign/spec.jade @@ -0,0 +1,123 @@ +extends ./outline.jade + +mixin columns(...names) + tr + each name in names + th= name + + +mixin row(...cells) + tr + each cell in cells + td= cell + + +block body_block + article(class="page docs-page") + p. + This document describes the target annotations spaCy is trained to predict. + This is currently a work in progress. Please ask questions on the issue tracker, + so that the answers can be integrated here to improve the documentation. + + h2 Tokenization + + p Tokenization standards are based on the OntoNotes 5 corpus. + + p. + The tokenizer differs from most by including tokens for significant + whitespace. Any sequence of whitespace characters beyond a single space + (' ') is included as a token. For instance: + + pre.language-python + code + | from spacy.en import English + | nlp = English(parse=False) + | tokens = nlp('Some\nspaces and\ttab characters') + | print([t.orth_ for t in tokens]) + + p Which produces: + + pre.language-python + code + | ['Some', '\n', 'spaces', ' ', 'and', '\t', 'tab', 'characters'] + + p. + The whitespace tokens are useful for much the same reason punctuation is + – it's often an important delimiter in the text. By preserving + it in the token output, we are able to maintain a simple alignment + between the tokens and the original string, and we ensure that no + information is lost during processing. + + h3 Sentence boundary detection + + p. + Sentence boundaries are calculated from the syntactic parse tree, so + features such as punctuation and capitalisation play an important but + non-decisive role in determining the sentence boundaries. Usually this + means that the sentence boundaries will at least coincide with clause + boundaries, even given poorly punctuated text. + + h3 Part-of-speech Tagging + + p. + The part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank + tag set. We also map the tags to the simpler Google Universal POS Tag set. + + Details here: https://github.com/honnibal/spaCy/blob/master/spacy/en/pos.pyx#L124 + + h3 Lemmatization + + p. + A "lemma" is the uninflected form of a word. In English, this means: + + ul + li Adjectives: The form like "happy", not "happier" or "happiest" + li Adverbs: The form like "badly", not "worse" or "worst" + li Nouns: The form like "dog", not "dogs"; like "child", not "children" + li Verbs: The form like "write", not "writes", "writing", "wrote" or "written" + + p. + The lemmatization data is taken from WordNet. However, we also add a + special case for pronouns: all pronouns are lemmatized to the special + token -PRON-. + + + h3 Syntactic Dependency Parsing + + p. + The parser is trained on data produced by the ClearNLP converter. Details + of the annotation scheme can be found here: http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf + + h3 Named Entity Recognition + + table + thead + +columns("Entity Type", "Description") + + tbody + +row("PERSON", "People, including fictional.") + +row("NORP", "Nationalities or religious or political groups.") + +row("FACILITY", "Buildings, airports, highways, bridges, etc.") + +row("ORG", "Companies, agencies, institutions, etc.") + +row("GPE", "Countries, cities, states.") + +row("LOC", "Non-GPE locations, mountain ranges, bodies of water.") + +row("PRODUCT", "Vehicles, weapons, foods, etc. (Not services") + +row("EVENT", "Named hurricanes, battles, wars, sports events, etc.") + +row("WORK_OF_ART", "Titles of books, songs, etc.") + +row("LAW", "Named documents made into laws") + +row("LANGUAGE", "Any named language") + + p The following values are also annotated in a style similar to names: + + table + thead + +columns("Entity Type", "Description") + + tbody + +row("DATE", "Absolute or relative dates or periods") + +row("TIME", "Times smaller than a day") + +row("PERCENT", 'Percentage (including “%”)') + +row("MONEY", "Monetary values, including unit") + +row("QUANTITY", "Measurements, as of weight or distance") + +row("ORDINAL", 'first", "second"') + +row("CARDINAL", "Numerals that do not fall under another type")