spaCy/website/api/annotation.jade

//- 💫 DOCS > API > ANNOTATION SPECS

include ../_includes/_mixins

p This document describes the target annotations spaCy is trained to predict.


+section("tokenization")
    +h(2, "tokenization") Tokenization

    p
        |  Tokenization standards are based on the
        |  #[+a("https://catalog.ldc.upenn.edu/LDC2013T19") OntoNotes 5] corpus.
        |  The tokenizer differs from most by including tokens for significant
        |  whitespace. Any sequence of whitespace characters beyond a single space
        |  (#[code ' ']) is included as a token.

    +aside-code("Example").
        from spacy.lang.en import English
        nlp = English()
        tokens = nlp('Some\nspaces  and\ttab characters')
        tokens_text = [t.text for t in tokens]
        assert tokens_text == ['Some', '\n', 'spaces', ' ', 'and',
                            '\t', 'tab', 'characters']

    p
        |  The whitespace tokens are useful for much the same reason punctuation is
        |  – it's often an important delimiter in the text. By preserving it in the
        |  token output, we are able to maintain a simple alignment between the
        |  tokens and the original string, and we ensure that no information is
        |  lost during processing.

+section("sbd")
    +h(2, "sentence-boundary") Sentence boundary detection

    p
        |  Sentence boundaries are calculated from the syntactic parse tree, so
        |  features such as punctuation and capitalisation play an important but
        |  non-decisive role in determining the sentence boundaries. Usually this
        |  means that the sentence boundaries will at least coincide with clause
        |  boundaries, even given poorly punctuated text.

+section("pos-tagging")
    +h(2, "pos-tagging") Part-of-speech Tagging

    +aside("Tip: Understanding tags")
        |  You can also use #[code spacy.explain()] to get the description for the
        |  string representation of a tag. For example,
        |  #[code spacy.explain("RB")] will return "adverb".

    include _annotation/_pos-tags

+section("lemmatization")
    +h(2, "lemmatization") Lemmatization

    p A "lemma" is the uninflected form of a word. In English, this means:

    +list
        +item #[strong Adjectives]: The form like "happy", not "happier" or "happiest"
        +item #[strong Adverbs]: The form like "badly", not "worse" or "worst"
        +item #[strong Nouns]: The form like "dog", not "dogs"; like "child", not "children"
        +item #[strong Verbs]: The form like "write", not "writes", "writing", "wrote" or "written"

    p
        |  The lemmatization data is taken from
        |  #[+a("https://wordnet.princeton.edu") WordNet]. However, we also add a
        |  special case for pronouns: all pronouns are lemmatized to the special
        |  token #[code -PRON-].

    +infobox("About spaCy's custom pronoun lemma")
        |  Unlike verbs and common nouns, there's no clear base form of a personal
        |  pronoun. Should the lemma of "me" be "I", or should we normalize person
        |  as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a
        |  novel symbol, #[code -PRON-], which is used as the lemma for
        |  all personal pronouns.

+section("dependency-parsing")
    +h(2, "dependency-parsing") Syntactic Dependency Parsing

    +aside("Tip: Understanding labels")
        |  You can also use #[code spacy.explain()] to get the description for the
        |  string representation of a label. For example,
        |  #[code spacy.explain("prt")] will return "particle".

    include _annotation/_dep-labels

+section("named-entities")
    +h(2, "named-entities") Named Entity Recognition

    +aside("Tip: Understanding entity types")
        |  You can also use #[code spacy.explain()] to get the description for the
        |  string representation of an entity label. For example,
        |  #[code spacy.explain("LANGUAGE")] will return "any named language".

    include _annotation/_named-entities

    +h(3, "biluo") BILUO Scheme

    include _annotation/_biluo

+section("training")
    +h(2, "json-input") JSON input format for training

    +under-construction

    p spaCy takes training data in the following format:

    +code("Example structure").
        doc: {
            id: string,
            paragraphs: [{
                raw: string,
                sents: [int],
                tokens: [{
                    start: int,
                    tag: string,
                    head: int,
                    dep: string
                }],
                ner: [{
                    start: int,
                    end: int,
                    label: string
                }],
                brackets: [{
                    start: int,
                    end: int,
                    label: string
                }]
            }]
        }