spaCy/website/docs/usage/training.jade

include ../../_includes/_mixins

p
    |  This tutorial describes how to train new statistical models for spaCy's
    |  part-of-speech tagger, named entity recognizer and dependency parser.

p
    |  I'll start with some quick code examples, that describe how to train
    |  each model. I'll then provide a bit of background about the algorithms,
    |  and explain how the data and feature templates work.

+h(2, "train-pos-tagger") Training the part-of-speech tagger

+code.
    from spacy.vocab import Vocab
    from spacy.tagger import Tagger
    from spacy.tokens import Doc
    from spacy.gold import GoldParse


    vocab = Vocab(tag_map={'N': {'pos': 'NOUN'}, 'V': {'pos': 'VERB'}})
    tagger = Tagger(vocab)

    doc = Doc(vocab, words=['I', 'like', 'stuff'])
    gold = GoldParse(doc, tags=['N', 'V', 'N'])
    tagger.update(doc, gold)

    tagger.model.end_training()

p
    +button(gh("spaCy", "examples/training/train_tagger.py"), false, "secondary") Full example

+h(2, "train-entity") Training the named entity recognizer

+code.
    from spacy.vocab import Vocab
    from spacy.pipeline import EntityRecognizer
    from spacy.tokens import Doc

    vocab = Vocab()
    entity = EntityRecognizer(vocab, entity_types=['PERSON', 'LOC'])

    doc = Doc(vocab, words=['Who', 'is', 'Shaka', 'Khan', '?'])
    entity.update(doc, ['O', 'O', 'B-PERSON', 'L-PERSON', 'O'])

    entity.model.end_training()

p
    +button(gh("spaCy", "examples/training/train_ner.py"), false, "secondary") Full example

+h(2, "train-entity") Training the dependency parser

+code.
    from spacy.vocab import Vocab
    from spacy.pipeline import DependencyParser
    from spacy.tokens import Doc

    vocab = Vocab()
    parser = DependencyParser(vocab, labels=['nsubj', 'compound', 'dobj', 'punct'])

    doc = Doc(vocab, words=['Who', 'is', 'Shaka', 'Khan', '?'])
    parser.update(doc, [(1, 'nsubj'), (1, 'ROOT'), (3, 'compound'), (1, 'dobj'),
                        (1, 'punct')])

    parser.model.end_training()

p
    +button(gh("spaCy", "examples/training/train_parser.py"), false, "secondary") Full example

+h(2, 'feature-templates') Customizing the feature extraction

p
    |  spaCy currently uses linear models for the tagger, parser and entity
    |  recognizer, with weights learned using the
    |  #[+a("https://explosion.ai/blog/part-of-speech-pos-tagger-in-python") Averaged Perceptron algorithm].

+aside("Linear Model Feature Scheme")
    |  For a list of the available feature atoms, see the #[+a("/docs/api/features") Linear Model Feature Scheme].

p
    |  Because it's a linear model, it's important for accuracy to build
    |  conjunction features out of the atomic predictors. Let's say you have
    |  two atomic predictors asking, "What is the part-of-speech of the
    |  previous token?", and "What is the part-of-speech of the previous
    |  previous token?". These ppredictors will introduce a number of features,
    |  e.g. #[code Prev-pos=NN], #[code Prev-pos=VBZ], etc. A conjunction
    |  template introduces features such as #[code Prev-pos=NN&Prev-pos=VBZ].

p
    |  The feature extraction proceeds in two passes. In the first pass, we
    |  fill an array with the values of all of the atomic predictors. In the
    |  second pass, we iterate over the feature templates, and fill a small
    |  temporary array with the predictors that will be combined into a
    |  conjunction feature. Finally, we hash this array into a 64-bit integer,
    |  using the MurmurHash algorithm. You can see this at work in the
    |  #[+a(gh("thinc", "thinc/linear/features.pyx", "94dbe06fd3c8f24d86ab0f5c7984e52dbfcdc6cb")) #[code thinc.linear.features]] module.

p
    |  It's very easy to change the feature templates, to create novel
    |  combinations of the existing atomic predictors. There's currently no API
    |  available to add new atomic predictors, though. You'll have to create a
    |  subclass of the model, and write your own #[code set_featuresC] method.

p
    |  The feature templates are passed in using the #[code features] keyword
    |  argument to the constructors of the #[+api("tagger") #[code Tagger]],
    |  #[+api("dependencyparser") #[code DependencyParser]] and
    |  #[+api("entityrecognizer") #[code EntityRecognizer]]:

+code.
    from spacy.vocab import Vocab
    from spacy.pipeline import Tagger
    from spacy.tagger import P2_orth, P1_orth
    from spacy.tagger import P2_cluster, P1_cluster, W_orth, N1_orth, N2_orth

    vocab = Vocab(tag_map={'N': {'pos': 'NOUN'}, 'V': {'pos': 'VERB'}})
    tagger = Tagger(vocab, features=[(P2_orth, P2_cluster), (P1_orth, P1_cluster),
                                     (P2_orth,), (P1_orth,), (W_orth,),
                                     (N1_orth,), (N2_orth,)])

p
    |  Custom feature templates can be passed to the #[code DependencyParser]
    |  and #[code EntityRecognizer] as well, also using the #[code features]
    |  keyword argument of the constructor.
Add customizing tokenizer and training workflow 2016-11-05 19:40:11 +00:00			`include ../../_includes/_mixins`

			`p`
			`\| This tutorial describes how to train new statistical models for spaCy's`
			`\| part-of-speech tagger, named entity recognizer and dependency parser.`

			`p`
			`\| I'll start with some quick code examples, that describe how to train`
			`\| each model. I'll then provide a bit of background about the algorithms,`
			`\| and explain how the data and feature templates work.`

			`+h(2, "train-pos-tagger") Training the part-of-speech tagger`

			`+code.`
			`from spacy.vocab import Vocab`
update to training doc 2016-12-20 20:01:16 +00:00			`from spacy.tagger import Tagger`
Add customizing tokenizer and training workflow 2016-11-05 19:40:11 +00:00			`from spacy.tokens import Doc`
update to training doc 2016-12-20 20:01:16 +00:00			`from spacy.gold import GoldParse`

Add customizing tokenizer and training workflow 2016-11-05 19:40:11 +00:00
			`vocab = Vocab(tag_map={'N': {'pos': 'NOUN'}, 'V': {'pos': 'VERB'}})`
			`tagger = Tagger(vocab)`

			`doc = Doc(vocab, words=['I', 'like', 'stuff'])`
update to training doc 2016-12-20 20:01:16 +00:00			`gold = GoldParse(doc, tags=['N', 'V', 'N'])`
			`tagger.update(doc, gold)`
Add customizing tokenizer and training workflow 2016-11-05 19:40:11 +00:00
			`tagger.model.end_training()`

			`p`
			`+button(gh("spaCy", "examples/training/train_tagger.py"), false, "secondary") Full example`

			`+h(2, "train-entity") Training the named entity recognizer`

			`+code.`
			`from spacy.vocab import Vocab`
			`from spacy.pipeline import EntityRecognizer`
			`from spacy.tokens import Doc`

			`vocab = Vocab()`
			`entity = EntityRecognizer(vocab, entity_types=['PERSON', 'LOC'])`

			`doc = Doc(vocab, words=['Who', 'is', 'Shaka', 'Khan', '?'])`
			`entity.update(doc, ['O', 'O', 'B-PERSON', 'L-PERSON', 'O'])`

			`entity.model.end_training()`

			`p`
			`+button(gh("spaCy", "examples/training/train_ner.py"), false, "secondary") Full example`

			`+h(2, "train-entity") Training the dependency parser`

			`+code.`
			`from spacy.vocab import Vocab`
			`from spacy.pipeline import DependencyParser`
			`from spacy.tokens import Doc`

			`vocab = Vocab()`
			`parser = DependencyParser(vocab, labels=['nsubj', 'compound', 'dobj', 'punct'])`

			`doc = Doc(vocab, words=['Who', 'is', 'Shaka', 'Khan', '?'])`
			`parser.update(doc, [(1, 'nsubj'), (1, 'ROOT'), (3, 'compound'), (1, 'dobj'),`
			`(1, 'punct')])`

			`parser.model.end_training()`

			`p`
			`+button(gh("spaCy", "examples/training/train_parser.py"), false, "secondary") Full example`

			`+h(2, 'feature-templates') Customizing the feature extraction`

			`p`
			`\| spaCy currently uses linear models for the tagger, parser and entity`
			`\| recognizer, with weights learned using the`
			`\| #[+a("https://explosion.ai/blog/part-of-speech-pos-tagger-in-python") Averaged Perceptron algorithm].`

Add feature scheme to API docs (see #857, #739) 2017-02-24 17:26:29 +00:00			`+aside("Linear Model Feature Scheme")`
			`\| For a list of the available feature atoms, see the #[+a("/docs/api/features") Linear Model Feature Scheme].`

Add customizing tokenizer and training workflow 2016-11-05 19:40:11 +00:00			`p`
			`\| Because it's a linear model, it's important for accuracy to build`
			`\| conjunction features out of the atomic predictors. Let's say you have`
			`\| two atomic predictors asking, "What is the part-of-speech of the`
			`\| previous token?", and "What is the part-of-speech of the previous`
			`\| previous token?". These ppredictors will introduce a number of features,`
			`\| e.g. #[code Prev-pos=NN], #[code Prev-pos=VBZ], etc. A conjunction`
			`\| template introduces features such as #[code Prev-pos=NN&Prev-pos=VBZ].`

			`p`
			`\| The feature extraction proceeds in two passes. In the first pass, we`
			`\| fill an array with the values of all of the atomic predictors. In the`
			`\| second pass, we iterate over the feature templates, and fill a small`
			`\| temporary array with the predictors that will be combined into a`
			`\| conjunction feature. Finally, we hash this array into a 64-bit integer,`
			`\| using the MurmurHash algorithm. You can see this at work in the`
			`\| #[+a(gh("thinc", "thinc/linear/features.pyx", "94dbe06fd3c8f24d86ab0f5c7984e52dbfcdc6cb")) #[code thinc.linear.features]] module.`

			`p`
			`\| It's very easy to change the feature templates, to create novel`
			`\| combinations of the existing atomic predictors. There's currently no API`
			`\| available to add new atomic predictors, though. You'll have to create a`
			`\| subclass of the model, and write your own #[code set_featuresC] method.`

			`p`
			`\| The feature templates are passed in using the #[code features] keyword`
			`\| argument to the constructors of the #[+api("tagger") #[code Tagger]],`
			`\| #[+api("dependencyparser") #[code DependencyParser]] and`
			`\| #[+api("entityrecognizer") #[code EntityRecognizer]]:`

			`+code.`
			`from spacy.vocab import Vocab`
			`from spacy.pipeline import Tagger`
			`from spacy.tagger import P2_orth, P1_orth`
			`from spacy.tagger import P2_cluster, P1_cluster, W_orth, N1_orth, N2_orth`

			`vocab = Vocab(tag_map={'N': {'pos': 'NOUN'}, 'V': {'pos': 'VERB'}})`
			`tagger = Tagger(vocab, features=[(P2_orth, P2_cluster), (P1_orth, P1_cluster),`
			`(P2_orth,), (P1_orth,), (W_orth,),`
			`(N1_orth,), (N2_orth,)])`

			`p`
			`\| Custom feature templates can be passed to the #[code DependencyParser]`
			`\| and #[code EntityRecognizer] as well, also using the #[code features]`
			`\| keyword argument of the constructor.`