spaCy/website/docs/usage/training-ner.jade

include ../../_includes/_mixins

p
    |  All #[+a("/docs/usage/models") spaCy models] support online learning, so
    |  you can update a pre-trained model with new examples. You can even add
    |  new classes to an existing model, to recognise a new entity type,
    |  part-of-speech, or syntactic relation. Updating an existing model is
    |  particularly useful as a "quick and dirty solution", if you have only a
    |  few corrections or annotations.

+h(2, "improving-accuracy") Improving accuracy on existing entity types

p
    |  To update the model, you first need to create an instance of
    |  #[+api("goldparse") #[code spacy.gold.GoldParse]], with the entity labels
    |  you want to learn. You will then pass this instance to the
    |  #[+api("entityrecognizer#update") #[code EntityRecognizer.update()]]
    |  method. For example:

+code.
    import spacy
    from spacy.gold import GoldParse

    nlp = spacy.load('en')
    doc = nlp.make_doc(u'Facebook released React in 2014')
    gold = GoldParse(doc, entities=['U-ORG', 'O', 'U-TECHNOLOGY', 'O', 'U-DATE'])
    nlp.entity.update(doc, gold)

p
    |  You'll usually need to provide many examples to meaningfully improve the
    |  system — a few hundred is a good start, although more is better. You
    |  should avoid iterating over the same few examples multiple times, or the
    |  model is likely to "forget" how to annotate other examples. If you
    |  iterate over the same few examples, you're effectively changing the loss
    |  function. The optimizer will find a way to minimize the loss on your
    |  examples, without regard for the consequences on the examples it's no
    |  longer paying attention to.

p
    |  One way to avoid this "catastrophic forgetting" problem is to "remind"
    |  the model of other examples by augmenting your annotations with sentences
    |  annotated with entities automatically recognised by the original model.
    |  Ultimately, this is an empirical process: you'll need to
    |  #[strong experiment on your own data] to find a solution that works best
    |  for you.

+h(2, "adding") Adding a new entity type

p
    |  You can add new entity types to an existing model. Let's say we want to
    |  recognise the category #[code TECHNOLOGY]. The new category will include
    |  programming languages, frameworks and platforms. First, we need to
    |  register the new entity type:

+code.
    nlp.entity.add_label('TECHNOLOGY')

p
    |  Next, iterate over your examples, calling #[code entity.update()]. As
    |  above, we want to avoid iterating over only a small number of sentences.
    |  A useful compromise is to run the model over a number of plain-text
    |  sentences, and pass the entities to #[code GoldParse], as "true"
    |  annotations. This encourages the optimizer to find a solution that
    |  predicts the new category with minimal difference from the previous
    |  output.

+h(2, "example") Example: Adding and training an #[code ANIMAL] entity

+under-construction

p
    |  This script shows how to add a new entity type to an existing pre-trained
    |  NER model. To keep the example short and simple, only four sentences are
    |  provided as examples. In practice, you'll need many more —
    |  #[strong a few hundred] would be a good start. You will also likely need
    |  to mix in #[strong examples of other entity types], which might be
    |  obtained by running the entity recognizer over unlabelled sentences, and
    |  adding their annotations to the training set.

p
    |  For the full, runnable script of this example, see
    |  #[+src(gh("spacy", "examples/training/train_new_entity_type.py")) train_new_entity_type.py].

+code("Training the entity recognizer").
    import spacy
    from spacy.pipeline import EntityRecognizer
    from spacy.gold import GoldParse
    from spacy.tagger import Tagger
    import random

    model_name = 'en'
    entity_label = 'ANIMAL'
    output_directory = '/path/to/model'
    train_data = [
        ("Horses are too tall and they pretend to care about your feelings",
        [(0, 6, 'ANIMAL')]),
        ("horses are too tall and they pretend to care about your feelings",
        [(0, 6, 'ANIMAL')]),
        ("horses pretend to care about your feelings",
        [(0, 6, 'ANIMAL')]),
        ("they pretend to care about your feelings, those horses",
        [(48, 54, 'ANIMAL')])
    ]

    nlp = spacy.load(model_name)
    nlp.entity.add_label(entity_label)
    ner = train_ner(nlp, train_data, output_directory)

    def train_ner(nlp, train_data, output_dir):
        # Add new words to vocab
        for raw_text, _ in train_data:
            doc = nlp.make_doc(raw_text)
            for word in doc:
                _ = nlp.vocab[word.orth]

        for itn in range(20):
            random.shuffle(train_data)
            for raw_text, entity_offsets in train_data:
                gold = GoldParse(doc, entities=entity_offsets)
                doc = nlp.make_doc(raw_text)
                nlp.tagger(doc)
                loss = nlp.entity.update(doc, gold)
        nlp.save_to_directory(output_dir)

p
    +button(gh("spaCy", "examples/training/train_new_entity_type.py"), false, "secondary") Full example

p
    |  The actual training is performed by looping over the examples, and
    |  calling #[code nlp.entity.update()]. The #[code update()] method steps
    |  through the words of the input. At each word, it makes a prediction. It
    |  then consults the annotations provided on the #[code GoldParse] instance,
    |  to see whether it was right. If it was wrong, it adjusts its weights so
    |  that the correct action will score higher next time.

p
    |  After training your model, you can
    |  #[+a("/docs/usage/saving-loading") save it to a directory]. We recommend
    |  wrapping models as Python packages, for ease of deployment.

+h(2, "saving-loading") Saving and loading

p
    |  After training our model, you'll usually want to save its state, and load
    |  it back later. You can do this with the
    |  #[+api("language#to_disk") #[code Language.to_disk()]] method:

+code.
    nlp.to_disk('/home/me/data/en_technology')

p
    |  To make the model more convenient to deploy, we recommend wrapping it as
    |  a Python package, so that you can install it via pip and load it as a
    |  module. spaCy comes with a handy #[+api("cli#package") #[code package]]
    |  CLI command to create all required files and directories.

+code(false, "bash").
    python -m spacy package /home/me/data/en_technology /home/me/my_models

p
    |  To build the package and create a #[code .tar.gz] archive, run
    |  #[code python setup.py sdist] from within its directory.

+infobox("Saving and loading models")
    |  For more information and a detailed guide on how to package your model,
    |  see the documentation on
    |  #[+a("/docs/usage/saving-loading#models") saving and loading models].
Add NER training docs 2017-04-16 18:35:47 +00:00			`include ../../_includes/_mixins`

			`p`
			`\| All #[+a("/docs/usage/models") spaCy models] support online learning, so`
			`\| you can update a pre-trained model with new examples. You can even add`
			`\| new classes to an existing model, to recognise a new entity type,`
			`\| part-of-speech, or syntactic relation. Updating an existing model is`
			`\| particularly useful as a "quick and dirty solution", if you have only a`
			`\| few corrections or annotations.`

			`+h(2, "improving-accuracy") Improving accuracy on existing entity types`

			`p`
			`\| To update the model, you first need to create an instance of`
			`\| #[+api("goldparse") #[code spacy.gold.GoldParse]], with the entity labels`
			`\| you want to learn. You will then pass this instance to the`
			`\| #[+api("entityrecognizer#update") #[code EntityRecognizer.update()]]`
			`\| method. For example:`

			`+code.`
			`import spacy`
			`from spacy.gold import GoldParse`

			`nlp = spacy.load('en')`
			`doc = nlp.make_doc(u'Facebook released React in 2014')`
			`gold = GoldParse(doc, entities=['U-ORG', 'O', 'U-TECHNOLOGY', 'O', 'U-DATE'])`
			`nlp.entity.update(doc, gold)`

			`p`
			`\| You'll usually need to provide many examples to meaningfully improve the`
			`\| system — a few hundred is a good start, although more is better. You`
			`\| should avoid iterating over the same few examples multiple times, or the`
			`\| model is likely to "forget" how to annotate other examples. If you`
			`\| iterate over the same few examples, you're effectively changing the loss`
			`\| function. The optimizer will find a way to minimize the loss on your`
			`\| examples, without regard for the consequences on the examples it's no`
			`\| longer paying attention to.`

			`p`
			`\| One way to avoid this "catastrophic forgetting" problem is to "remind"`
			`\| the model of other examples by augmenting your annotations with sentences`
			`\| annotated with entities automatically recognised by the original model.`
			`\| Ultimately, this is an empirical process: you'll need to`
			`\| #[strong experiment on your own data] to find a solution that works best`
			`\| for you.`

			`+h(2, "adding") Adding a new entity type`

			`p`
			`\| You can add new entity types to an existing model. Let's say we want to`
			`\| recognise the category #[code TECHNOLOGY]. The new category will include`
			`\| programming languages, frameworks and platforms. First, we need to`
			`\| register the new entity type:`

			`+code.`
			`nlp.entity.add_label('TECHNOLOGY')`

			`p`
			`\| Next, iterate over your examples, calling #[code entity.update()]. As`
			`\| above, we want to avoid iterating over only a small number of sentences.`
			`\| A useful compromise is to run the model over a number of plain-text`
			`\| sentences, and pass the entities to #[code GoldParse], as "true"`
			`\| annotations. This encourages the optimizer to find a solution that`
			`\| predicts the new category with minimal difference from the previous`
			`\| output.`

			`+h(2, "example") Example: Adding and training an #[code ANIMAL] entity`

Update usage docs and ddd "under construction" 2017-05-26 11:17:48 +00:00			`+under-construction`

Add NER training docs 2017-04-16 18:35:47 +00:00			`p`
			`\| This script shows how to add a new entity type to an existing pre-trained`
			`\| NER model. To keep the example short and simple, only four sentences are`
			`\| provided as examples. In practice, you'll need many more —`
			`\| #[strong a few hundred] would be a good start. You will also likely need`
			`\| to mix in #[strong examples of other entity types], which might be`
			`\| obtained by running the entity recognizer over unlabelled sentences, and`
			`\| adding their annotations to the training set.`

			`p`
			`\| For the full, runnable script of this example, see`
			`\| #[+src(gh("spacy", "examples/training/train_new_entity_type.py")) train_new_entity_type.py].`

			`+code("Training the entity recognizer").`
			`import spacy`
			`from spacy.pipeline import EntityRecognizer`
			`from spacy.gold import GoldParse`
			`from spacy.tagger import Tagger`
			`import random`

			`model_name = 'en'`
			`entity_label = 'ANIMAL'`
			`output_directory = '/path/to/model'`
			`train_data = [`
			`("Horses are too tall and they pretend to care about your feelings",`
			`[(0, 6, 'ANIMAL')]),`
			`("horses are too tall and they pretend to care about your feelings",`
			`[(0, 6, 'ANIMAL')]),`
			`("horses pretend to care about your feelings",`
			`[(0, 6, 'ANIMAL')]),`
			`("they pretend to care about your feelings, those horses",`
			`[(48, 54, 'ANIMAL')])`
			`]`

			`nlp = spacy.load(model_name)`
			`nlp.entity.add_label(entity_label)`
			`ner = train_ner(nlp, train_data, output_directory)`

			`def train_ner(nlp, train_data, output_dir):`
			`# Add new words to vocab`
			`for raw_text, _ in train_data:`
			`doc = nlp.make_doc(raw_text)`
			`for word in doc:`
			`_ = nlp.vocab[word.orth]`

			`for itn in range(20):`
			`random.shuffle(train_data)`
			`for raw_text, entity_offsets in train_data:`
			`gold = GoldParse(doc, entities=entity_offsets)`
			`doc = nlp.make_doc(raw_text)`
			`nlp.tagger(doc)`
			`loss = nlp.entity.update(doc, gold)`
			`nlp.save_to_directory(output_dir)`

			`p`
			`+button(gh("spaCy", "examples/training/train_new_entity_type.py"), false, "secondary") Full example`

			`p`
			`\| The actual training is performed by looping over the examples, and`
			`\| calling #[code nlp.entity.update()]. The #[code update()] method steps`
			`\| through the words of the input. At each word, it makes a prediction. It`
			`\| then consults the annotations provided on the #[code GoldParse] instance,`
			`\| to see whether it was right. If it was wrong, it adjusts its weights so`
			`\| that the correct action will score higher next time.`

			`p`
			`\| After training your model, you can`
Update usage docs and ddd "under construction" 2017-05-26 11:17:48 +00:00			`\| #[+a("/docs/usage/saving-loading") save it to a directory]. We recommend`
			`\| wrapping models as Python packages, for ease of deployment.`

			`+h(2, "saving-loading") Saving and loading`

			`p`
			`\| After training our model, you'll usually want to save its state, and load`
			`\| it back later. You can do this with the`
			`\| #[+api("language#to_disk") #[code Language.to_disk()]] method:`

			`+code.`
			`nlp.to_disk('/home/me/data/en_technology')`

			`p`
			`\| To make the model more convenient to deploy, we recommend wrapping it as`
			`\| a Python package, so that you can install it via pip and load it as a`
			`\| module. spaCy comes with a handy #[+api("cli#package") #[code package]]`
			`\| CLI command to create all required files and directories.`

			`+code(false, "bash").`
			`python -m spacy package /home/me/data/en_technology /home/me/my_models`

			`p`
			`\| To build the package and create a #[code .tar.gz] archive, run`
			`\| #[code python setup.py sdist] from within its directory.`

			`+infobox("Saving and loading models")`
			`\| For more information and a detailed guide on how to package your model,`
			`\| see the documentation on`
			`\| #[+a("/docs/usage/saving-loading#models") saving and loading models].`