spaCy/website/usage/_training/_basics.jade

//- 💫 DOCS > USAGE > TRAINING > BASICS

include ../_spacy-101/_training

+h(3, "spacy-train-cli") Training via the command-line interface

p
    |  For most purposes, the best way to train spaCy is via the command-line
    |  interface. The #[+api("cli#train") #[code spacy train]] command takes
    |  care of many details for you, including making sure that the data is
    |  minibatched and shuffled correctly, progress is printed, and models are
    |  saved after each epoch. You can prepare your data for use in
    |  #[+api("cli#train") #[code spacy train]] using the
    |  #[+api("cli#convert") #[code spacy convert]] command, which accepts many
    |  common NLP data formats, including #[code .iob] for named entities, and
    |  the CoNLL format for dependencies:

+code("Example", "bash").
    git clone https://github.com/UniversalDependencies/UD_Spanish-AnCora
    mkdir ancora-json
    python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-train.conllu ancora-json
    python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-dev.conllu ancora-json
    mkdir models
    python -m spacy train es models ancora-json/es_ancora-ud-train.json ancora-json/es_ancora-ud-dev.json

+h(3, "training-data") How do I get training data?

p
    |  Collecting training data may sound incredibly painful – and it can be,
    |  if you're planning a large-scale annotation project. However, if your main
    |  goal is to update an existing model's predictions – for example, spaCy's
    |  named entity recognition – the hard part is usually not creating the
    |  actual annotations. It's finding representative examples and
    |  #[strong extracting potential candidates]. The good news is, if you've
    |  been noticing bad performance on your data, you likely
    |  already have some relevant text, and you can use spaCy to
    |  #[strong bootstrap a first set of training examples]. For example,
    |  after processing a few sentences, you may end up with the following
    |  entities, some correct, some incorrect.

+aside("How many examples do I need?")
    |  As a rule of thumb, you should allocate at least 10% of your project
    |  resources to creating training and evaluation data. If you're looking to
    |  improve an existing model, you might be able to start off with only a
    |  handful of examples. Keep in mind that you'll always want a lot more than
    |  that for #[strong evaluation] – especially previous errors the model has
    |  made. Otherwise, you won't be able to sufficiently verify that the model
    |  has actually made the #[strong correct generalisations] required for your
    |  use case.

+table(["Text", "Entity", "Start", "End", "Label", ""])
    - var style = [0, 0, 1, 1, 1]
    +annotation-row(["Uber blew through $1 million a week", "Uber", 0, 4, "ORG"], style)
        +cell #[+procon("yes", "right", true)]
    +annotation-row(["Android Pay expands to Canada", "Android", 0, 7, "PERSON"], style)
        +cell #[+procon("no", "wrong", true)]
    +annotation-row(["Android Pay expands to Canada", "Canada", 23, 30, "GPE"], style)
        +cell #[+procon("yes", "right", true)]
    +annotation-row(["Spotify steps up Asia expansion", "Spotify", 0, 8, "ORG"], style)
        +cell #[+procon("yes", "right", true)]
    +annotation-row(["Spotify steps up Asia expansion", "Asia", 17, 21, "NORP"], style)
        +cell #[+procon("no", "wrong", true)]

p
    |  Alternatively, the
    |  #[+a("/usage/linguistic-features#rule-based-matching") rule-based matcher]
    |  can be a useful tool to extract tokens or combinations of tokens, as
    |  well as their start and end index in a document. In this case, we'll
    |  extract mentions of Google and assume they're an #[code ORG].

+table(["Text", "Entity", "Start", "End", "Label", ""])
    - var style = [0, 0, 1, 1, 1]
    +annotation-row(["let me google this for you", "google", 7, 13, "ORG"], style)
        +cell #[+procon("no", "wrong", true)]
    +annotation-row(["Google Maps launches location sharing", "Google", 0, 6, "ORG"], style)
        +cell #[+procon("no", "wrong", true)]
    +annotation-row(["Google rebrands its business apps", "Google", 0, 6, "ORG"], style)
        +cell #[+procon("yes", "right", true)]
    +annotation-row(["look what i found on google! 😂", "google", 21, 27, "ORG"], style)
        +cell #[+procon("no", "wrong", true)]

p
    |  Based on the few examples above, you can already create six training
    |  sentences with eight entities in total. Of course, what you consider a
    |  "correct annotation" will always depend on
    |  #[strong what you want the model to learn]. While there are some entity
    |  annotations that are more or less universally correct – like Canada being
    |  a geopolitical entity – your application may have its very own definition
    |  of the #[+a("/api/annotation#named-entities") NER annotation scheme].

+code.
    train_data = [
        ("Uber blew through $1 million a week", [(0, 4, 'ORG')]),
        ("Android Pay expands to Canada", [(0, 11, 'PRODUCT'), (23, 30, 'GPE')]),
        ("Spotify steps up Asia expansion", [(0, 8, "ORG"), (17, 21, "LOC")]),
        ("Google Maps launches location sharing", [(0, 11, "PRODUCT")]),
        ("Google rebrands its business apps", [(0, 6, "ORG")]),
        ("look what i found on google! 😂", [(21, 27, "PRODUCT")])]

+infobox("Tip: Try the Prodigy annotation tool")
    +infobox-logos(["prodigy", 100, 29, "https://prodi.gy"])
    |  If you need to label a lot of data, check out
    |  #[+a("https://prodi.gy", true) Prodigy], a new, active learning-powered
    |  annotation tool we've developed. Prodigy is fast and extensible, and
    |  comes with a modern  #[strong web application] that helps you collect
    |  training data faster. It integrates seamlessly with spaCy, pre-selects
    |  the #[strong most relevant examples] for annotation, and lets you
    |  train and evaluate ready-to-use spaCy models.

+h(3, "annotations") Training with annotations

p
    |  The #[+api("goldparse") #[code GoldParse]] object collects the annotated
    |  training examples, also called the #[strong gold standard]. It's
    |  initialised with the #[+api("doc") #[code Doc]] object it refers to,
    |  and keyword arguments specifying the annotations, like #[code tags]
    |  or #[code entities]. Its job is to encode the annotations, keep them
    |  aligned and create the C-level data structures required for efficient access.
    |  Here's an example of a simple #[code GoldParse] for part-of-speech tags:

+code.
    vocab = Vocab(tag_map={'N': {'pos': 'NOUN'}, 'V': {'pos': 'VERB'}})
    doc = Doc(vocab, words=['I', 'like', 'stuff'])
    gold = GoldParse(doc, tags=['N', 'V', 'N'])

p
    |  Using the #[code Doc] and its gold-standard annotations, the model can be
    |  updated to learn a sentence of three words with their assigned
    |  part-of-speech tags. The #[+a("/usage/adding-languages#tag-map") tag map]
    |  is part of the vocabulary and defines the annotation scheme. If you're
    |  training a new language model, this will let you map the tags present in
    |  the treebank you train on to spaCy's tag scheme.

+code.
    doc = Doc(Vocab(), words=['Facebook', 'released', 'React', 'in', '2014'])
    gold = GoldParse(doc, entities=['U-ORG', 'O', 'U-TECHNOLOGY', 'O', 'U-DATE'])

p
    |  The same goes for named entities. The letters added before the labels
    |  refer to the tags of the
    |  #[+a("/usage/linguistic-features#updating-biluo") BILUO scheme] –
    |  #[code O] is a token outside an entity, #[code U] an single entity unit,
    |  #[code B] the beginning of an entity, #[code I] a token inside an entity
    |  and #[code L] the last token of an entity.

+aside
    |  #[strong Training data]: The training examples.#[br]
    |  #[strong Text and label]: The current example.#[br]
    |  #[strong Doc]: A #[code Doc] object created from the example text.#[br]
    |  #[strong GoldParse]: A #[code GoldParse] object of the #[code Doc] and label.#[br]
    |  #[strong nlp]: The #[code nlp] object with the model.#[br]
    |  #[strong Optimizer]: A function that holds state between updates.#[br]
    |  #[strong Update]: Update the model's weights.#[br]
    |  #[strong ]

+graphic("/assets/img/training-loop.svg")
    include ../../assets/img/training-loop.svg

p
    |  Of course, it's not enough to only show a model a single example once.
    |  Especially if you only have few examples, you'll want to train for a
    |  #[strong number of iterations]. At each iteration, the training data is
    |  #[strong shuffled] to ensure the model doesn't make any generalisations
    |  based on the order of examples. Another technique to improve the learning
    |  results is to set a #[strong dropout rate], a rate at which to randomly
    |  "drop" individual features and representations. This makes it harder for
    |  the model to memorise the training data. For example, a #[code 0.25]
    |  dropout means that each feature or internal representation has a 1/4
    |  likelihood of being dropped.

+aside
    |  #[+api("language#begin_training") #[code begin_training()]]: Start the
    |  training and return an optimizer function to update the model's weights.
    |  Can take an optional function converting the training data to spaCy's
    |  training format.#[br]
    |  #[+api("language#update") #[code update()]]: Update the model with the
    |  training example and gold data.#[br]
    |  #[+api("language#to_disk") #[code to_disk()]]: Save the updated model to
    |  a directory.

+code("Example training loop").
    optimizer = nlp.begin_training(get_data)
    for itn in range(100):
        random.shuffle(train_data)
        for raw_text, entity_offsets in train_data:
            doc = nlp.make_doc(raw_text)
            gold = GoldParse(doc, entities=entity_offsets)
            nlp.update([doc], [gold], drop=0.5, sgd=optimizer)
    nlp.to_disk('/model')

p
    |  The #[+api("language#update") #[code nlp.update]] method takes the
    |  following arguments:

+table(["Name", "Description"])
    +row
        +cell #[code docs]
        +cell
            |  #[+api("doc") #[code Doc]] objects. The #[code update] method
            |  takes a sequence of them, so you can batch up your training
            |  examples. Alternatively, you can also pass in a sequence of
            |  raw texts.

    +row
        +cell #[code golds]
        +cell
            |  #[+api("goldparse") #[code GoldParse]] objects. The #[code update]
            |  method takes a sequence of them, so you can batch up your
            |  training examples. Alternatively, you can also pass in a
            |  dictionary containing the annotations.

    +row
        +cell #[code drop]
        +cell
            |  Dropout rate. Makes it harder for the model to just memorise
            |  the data.

    +row
        +cell #[code sgd]
        +cell
            |  An optimizer, i.e. a callable to update the model's weights. If
            |  not set, spaCy will create a new one and save it for further use.

p
    |  Instead of writing your own training loop, you can also use the
    |  built-in #[+api("cli#train") #[code train]] command, which expects data
    |  in spaCy's #[+a("/api/annotation#json-input") JSON format]. On each epoch,
    |  a model will be saved out to the directory. After training, you can
    |  use the #[+api("cli#package") #[code package]] command to generate an
    |  installable Python package from your model.

+code(false, "bash").
    python -m spacy convert /tmp/train.conllu /tmp/data
    python -m spacy train en /tmp/model /tmp/data/train.json -n 5

+h(3, "training-simple-style") Simple training style
    +tag-new(2)

p
    |  Instead of sequences of #[code Doc] and #[code GoldParse] objects,
    |  you can also use the "simple training style" and pass
    |  #[strong raw texts] and #[strong dictionaries of annotations]
    |  to #[+api("language#update") #[code nlp.update]].
    |  The dictionaries can have the keys #[code entities], #[code heads],
    |  #[code deps], #[code tags] and #[code cats]. This is generally
    |  recommended, as it removes one layer of abstraction, and avoids
    |  unnecessary imports. It also makes it easier to structure and load
    |  your training data.

+aside-code("Example Annotations").
    {
        'entities': [(0, 4, 'ORG')],
        'heads': [1, 1, 1, 5, 5, 2, 7, 5],
        'deps': ['nsubj', 'ROOT', 'prt', 'quantmod', 'compound', 'pobj', 'det', 'npadvmod'],
        'tags': ['PROPN', 'VERB', 'ADP', 'SYM', 'NUM', 'NUM', 'DET', 'NOUN'],
        'cats': {'BUSINESS': 1.0}
    }

+code("Simple training loop").
    TRAIN_DATA = [
         ("Uber blew through $1 million a week", {'entities': [(0, 4, 'ORG')]}),
         ("Google rebrands its business apps", {'entities': [(0, 6, "ORG")]})]

    nlp = spacy.blank('en')
    optimizer = nlp.begin_training()
    for i in range(20):
        random.shuffle(TRAIN_DATA)
        for text, annotations in TRAIN_DATA:
            nlp.update([text], [annotations], sgd=optimizer)
    nlp.to_disk('/model')

p
    |  The above training loop leaves out a few details that can really
    |  improve accuracy – but the principle really is #[em that] simple. Once
    |  you've got your pipeline together and you want to tune the accuracy,
    |  you usually want to process your training examples in batches, and
    |  experiment with #[+api("top-level#util.minibatch") #[code minibatch]]
    |  sizes and dropout rates, set via the #[code drop] keyword argument. See
    |  the #[+api("language") #[code Language]] and #[+api("pipe") #[code Pipe]]
    |  API docs for available options.
-												Update usage documentation

											
										
										
											2017-10-03 12:26:20 +00:00
+								//- 💫 DOCS > USAGE > TRAINING > BASICS
-												Add customizing tokenizer and training workflow

											
										
										
											2016-11-05 19:40:11 +00:00
-												Update usage documentation

											
										
										
											2017-10-03 12:26:20 +00:00
+								include ../_spacy-101/_training
-												Add Training 101 stub

											
										
										
											2017-05-25 09:18:02 +00:00
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 00:06:46 +00:00
+								+h(3, "spacy-train-cli") Training via the command-line interface
 								p
 								    |  For most purposes, the best way to train spaCy is via the command-line
 								    |  interface. The #[+api("cli#train") #[code spacy train]] command takes
 								    |  care of many details for you, including making sure that the data is
 								    |  minibatched and shuffled correctly, progress is printed, and models are
 								    |  saved after each epoch. You can prepare your data for use in
 								    |  #[+api("cli#train") #[code spacy train]] using the
 								    |  #[+api("cli#convert") #[code spacy convert]] command, which accepts many
 								    |  common NLP data formats, including #[code .iob] for named entities, and
 								    |  the CoNLL format for dependencies:
 								+code("Example", "bash").
 								    git clone https://github.com/UniversalDependencies/UD_Spanish-AnCora
 								    mkdir ancora-json
-												fix UD data file extensions (#2425)

* fix UD data files extension

* add contributor agreement for msklvsk

											
										
										
											2018-06-08 12:26:12 +00:00
+								    python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-train.conllu ancora-json
 								    python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-dev.conllu ancora-json
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 00:06:46 +00:00
+								    mkdir models
 								    python -m spacy train es models ancora-json/es_ancora-ud-train.json ancora-json/es_ancora-ud-dev.json
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								+h(3, "training-data") How do I get training data?
-												Add customizing tokenizer and training workflow

											
										
										
											2016-11-05 19:40:11 +00:00
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								p
 								    |  Collecting training data may sound incredibly painful – and it can be,
 								    |  if you're planning a large-scale annotation project. However, if your main
 								    |  goal is to update an existing model's predictions – for example, spaCy's
-												Small Grammar Fix to _basics.jade

Fixed an incorrect word order.
											
										
										
											2018-01-11 14:26:47 +00:00
+								    |  named entity recognition – the hard part is usually not creating the
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								    |  actual annotations. It's finding representative examples and
 								    |  #[strong extracting potential candidates]. The good news is, if you've
 								    |  been noticing bad performance on your data, you likely
 								    |  already have some relevant text, and you can use spaCy to
 								    |  #[strong bootstrap a first set of training examples]. For example,
 								    |  after processing a few sentences, you may end up with the following
 								    |  entities, some correct, some incorrect.
 								+aside("How many examples do I need?")
 								    |  As a rule of thumb, you should allocate at least 10% of your project
 								    |  resources to creating training and evaluation data. If you're looking to
 								    |  improve an existing model, you might be able to start off with only a
 								    |  handful of examples. Keep in mind that you'll always want a lot more than
 								    |  that for #[strong evaluation] – especially previous errors the model has
 								    |  made. Otherwise, you won't be able to sufficiently verify that the model
 								    |  has actually made the #[strong correct generalisations] required for your
 								    |  use case.
 								+table(["Text", "Entity", "Start", "End", "Label", ""])
 								    - var style = [0, 0, 1, 1, 1]
 								    +annotation-row(["Uber blew through $1 million a week", "Uber", 0, 4, "ORG"], style)
-												Improve style/accessibility of yes/no/neutral icons (see #1471)

Use distinctive icons instead of only colour, add proper handling of labels (hidden or visible, but always present) with optional custom text.

											
										
										
											2017-10-28 23:14:30 +00:00
+								        +cell #[+procon("yes", "right", true)]
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								    +annotation-row(["Android Pay expands to Canada", "Android", 0, 7, "PERSON"], style)
-												Improve style/accessibility of yes/no/neutral icons (see #1471)

Use distinctive icons instead of only colour, add proper handling of labels (hidden or visible, but always present) with optional custom text.

											
										
										
											2017-10-28 23:14:30 +00:00
+								        +cell #[+procon("no", "wrong", true)]
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								    +annotation-row(["Android Pay expands to Canada", "Canada", 23, 30, "GPE"], style)
-												Improve style/accessibility of yes/no/neutral icons (see #1471)

Use distinctive icons instead of only colour, add proper handling of labels (hidden or visible, but always present) with optional custom text.

											
										
										
											2017-10-28 23:14:30 +00:00
+								        +cell #[+procon("yes", "right", true)]
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								    +annotation-row(["Spotify steps up Asia expansion", "Spotify", 0, 8, "ORG"], style)
-												Improve style/accessibility of yes/no/neutral icons (see #1471)

Use distinctive icons instead of only colour, add proper handling of labels (hidden or visible, but always present) with optional custom text.

											
										
										
											2017-10-28 23:14:30 +00:00
+								        +cell #[+procon("yes", "right", true)]
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								    +annotation-row(["Spotify steps up Asia expansion", "Asia", 17, 21, "NORP"], style)
-												Improve style/accessibility of yes/no/neutral icons (see #1471)

Use distinctive icons instead of only colour, add proper handling of labels (hidden or visible, but always present) with optional custom text.

											
										
										
											2017-10-28 23:14:30 +00:00
+								        +cell #[+procon("no", "wrong", true)]
-												Add customizing tokenizer and training workflow

											
										
										
											2016-11-05 19:40:11 +00:00
 								p
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								    |  Alternatively, the
-												Update usage documentation

											
										
										
											2017-10-03 12:26:20 +00:00
+								    |  #[+a("/usage/linguistic-features#rule-based-matching") rule-based matcher]
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								    |  can be a useful tool to extract tokens or combinations of tokens, as
 								    |  well as their start and end index in a document. In this case, we'll
 								    |  extract mentions of Google and assume they're an #[code ORG].
 								+table(["Text", "Entity", "Start", "End", "Label", ""])
 								    - var style = [0, 0, 1, 1, 1]
 								    +annotation-row(["let me google this for you", "google", 7, 13, "ORG"], style)
-												Improve style/accessibility of yes/no/neutral icons (see #1471)

Use distinctive icons instead of only colour, add proper handling of labels (hidden or visible, but always present) with optional custom text.

											
										
										
											2017-10-28 23:14:30 +00:00
+								        +cell #[+procon("no", "wrong", true)]
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								    +annotation-row(["Google Maps launches location sharing", "Google", 0, 6, "ORG"], style)
-												Improve style/accessibility of yes/no/neutral icons (see #1471)

Use distinctive icons instead of only colour, add proper handling of labels (hidden or visible, but always present) with optional custom text.

											
										
										
											2017-10-28 23:14:30 +00:00
+								        +cell #[+procon("no", "wrong", true)]
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								    +annotation-row(["Google rebrands its business apps", "Google", 0, 6, "ORG"], style)
-												Improve style/accessibility of yes/no/neutral icons (see #1471)

Use distinctive icons instead of only colour, add proper handling of labels (hidden or visible, but always present) with optional custom text.

											
										
										
											2017-10-28 23:14:30 +00:00
+								        +cell #[+procon("yes", "right", true)]
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								    +annotation-row(["look what i found on google! 😂", "google", 21, 27, "ORG"], style)
-												Improve style/accessibility of yes/no/neutral icons (see #1471)

Use distinctive icons instead of only colour, add proper handling of labels (hidden or visible, but always present) with optional custom text.

											
										
										
											2017-10-28 23:14:30 +00:00
+								        +cell #[+procon("no", "wrong", true)]
-												Add customizing tokenizer and training workflow

											
										
										
											2016-11-05 19:40:11 +00:00
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								p
 								    |  Based on the few examples above, you can already create six training
 								    |  sentences with eight entities in total. Of course, what you consider a
 								    |  "correct annotation" will always depend on
 								    |  #[strong what you want the model to learn]. While there are some entity
 								    |  annotations that are more or less universally correct – like Canada being
 								    |  a geopolitical entity – your application may have its very own definition
-												Update usage documentation

											
										
										
											2017-10-03 12:26:20 +00:00
+								    |  of the #[+a("/api/annotation#named-entities") NER annotation scheme].
-												Add customizing tokenizer and training workflow

											
										
										
											2016-11-05 19:40:11 +00:00
 								+code.
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								    train_data = [
 								        ("Uber blew through $1 million a week", [(0, 4, 'ORG')]),
 								        ("Android Pay expands to Canada", [(0, 11, 'PRODUCT'), (23, 30, 'GPE')]),
 								        ("Spotify steps up Asia expansion", [(0, 8, "ORG"), (17, 21, "LOC")]),
 								        ("Google Maps launches location sharing", [(0, 11, "PRODUCT")]),
 								        ("Google rebrands its business apps", [(0, 6, "ORG")]),
 								        ("look what i found on google! 😂", [(21, 27, "PRODUCT")])]
-												Add customizing tokenizer and training workflow

											
										
										
											2016-11-05 19:40:11 +00:00
-												Update training docs

											
										
										
											2017-10-30 19:28:00 +00:00
+								+infobox("Tip: Try the Prodigy annotation tool")
 								    +infobox-logos(["prodigy", 100, 29, "https://prodi.gy"])
 								    |  If you need to label a lot of data, check out
 								    |  #[+a("https://prodi.gy", true) Prodigy], a new, active learning-powered
 								    |  annotation tool we've developed. Prodigy is fast and extensible, and
 								    |  comes with a modern  #[strong web application] that helps you collect
 								    |  training data faster. It integrates seamlessly with spaCy, pre-selects
 								    |  the #[strong most relevant examples] for annotation, and lets you
 								    |  train and evaluate ready-to-use spaCy models.
-												Update usage documentation

											
										
										
											2017-10-03 12:26:20 +00:00
+								+h(3, "annotations") Training with annotations
-												Add customizing tokenizer and training workflow

											
										
										
											2016-11-05 19:40:11 +00:00
 								p
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								    |  The #[+api("goldparse") #[code GoldParse]] object collects the annotated
 								    |  training examples, also called the #[strong gold standard]. It's
 								    |  initialised with the #[+api("doc") #[code Doc]] object it refers to,
 								    |  and keyword arguments specifying the annotations, like #[code tags]
 								    |  or #[code entities]. Its job is to encode the annotations, keep them
 								    |  aligned and create the C-level data structures required for efficient access.
 								    |  Here's an example of a simple #[code GoldParse] for part-of-speech tags:
-												Add customizing tokenizer and training workflow

											
										
										
											2016-11-05 19:40:11 +00:00
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								+code.
 								    vocab = Vocab(tag_map={'N': {'pos': 'NOUN'}, 'V': {'pos': 'VERB'}})
 								    doc = Doc(vocab, words=['I', 'like', 'stuff'])
 								    gold = GoldParse(doc, tags=['N', 'V', 'N'])
-												Update training docs

											
										
										
											2017-04-16 18:35:56 +00:00
 								p
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								    |  Using the #[code Doc] and its gold-standard annotations, the model can be
 								    |  updated to learn a sentence of three words with their assigned
-												Update usage documentation

											
										
										
											2017-10-03 12:26:20 +00:00
+								    |  part-of-speech tags. The #[+a("/usage/adding-languages#tag-map") tag map]
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								    |  is part of the vocabulary and defines the annotation scheme. If you're
 								    |  training a new language model, this will let you map the tags present in
 								    |  the treebank you train on to spaCy's tag scheme.
-												Add customizing tokenizer and training workflow

											
										
										
											2016-11-05 19:40:11 +00:00
 								+code.
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								    doc = Doc(Vocab(), words=['Facebook', 'released', 'React', 'in', '2014'])
 								    gold = GoldParse(doc, entities=['U-ORG', 'O', 'U-TECHNOLOGY', 'O', 'U-DATE'])
-												Add customizing tokenizer and training workflow

											
										
										
											2016-11-05 19:40:11 +00:00
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								p
 								    |  The same goes for named entities. The letters added before the labels
 								    |  refer to the tags of the
-												Update usage documentation

											
										
										
											2017-10-03 12:26:20 +00:00
+								    |  #[+a("/usage/linguistic-features#updating-biluo") BILUO scheme] –
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								    |  #[code O] is a token outside an entity, #[code U] an single entity unit,
 								    |  #[code B] the beginning of an entity, #[code I] a token inside an entity
 								    |  and #[code L] the last token of an entity.
 								+aside
 								    |  #[strong Training data]: The training examples.#[br]
 								    |  #[strong Text and label]: The current example.#[br]
 								    |  #[strong Doc]: A #[code Doc] object created from the example text.#[br]
 								    |  #[strong GoldParse]: A #[code GoldParse] object of the #[code Doc] and label.#[br]
 								    |  #[strong nlp]: The #[code nlp] object with the model.#[br]
 								    |  #[strong Optimizer]: A function that holds state between updates.#[br]
 								    |  #[strong Update]: Update the model's weights.#[br]
 								    |  #[strong ]
-												Update usage documentation

											
										
										
											2017-10-03 12:26:20 +00:00
+								+graphic("/assets/img/training-loop.svg")
 								    include ../../assets/img/training-loop.svg
-												Add customizing tokenizer and training workflow

											
										
										
											2016-11-05 19:40:11 +00:00
 								p
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								    |  Of course, it's not enough to only show a model a single example once.
 								    |  Especially if you only have few examples, you'll want to train for a
 								    |  #[strong number of iterations]. At each iteration, the training data is
 								    |  #[strong shuffled] to ensure the model doesn't make any generalisations
 								    |  based on the order of examples. Another technique to improve the learning
 								    |  results is to set a #[strong dropout rate], a rate at which to randomly
 								    |  "drop" individual features and representations. This makes it harder for
 								    |  the model to memorise the training data. For example, a #[code 0.25]
 								    |  dropout means that each feature or internal representation has a 1/4
 								    |  likelihood of being dropped.
 								+aside
 								    |  #[+api("language#begin_training") #[code begin_training()]]: Start the
-												Improve nlp.update / training loop overview (see #1507)

											
										
										
											2017-11-08 00:17:42 +00:00
+								    |  training and return an optimizer function to update the model's weights.
 								    |  Can take an optional function converting the training data to spaCy's
 								    |  training format.#[br]
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								    |  #[+api("language#update") #[code update()]]: Update the model with the
 								    |  training example and gold data.#[br]
 								    |  #[+api("language#to_disk") #[code to_disk()]]: Save the updated model to
 								    |  a directory.
 								+code("Example training loop").
 								    optimizer = nlp.begin_training(get_data)
 								    for itn in range(100):
 								        random.shuffle(train_data)
 								        for raw_text, entity_offsets in train_data:
 								            doc = nlp.make_doc(raw_text)
 								            gold = GoldParse(doc, entities=entity_offsets)
 								            nlp.update([doc], [gold], drop=0.5, sgd=optimizer)
 								    nlp.to_disk('/model')
-												Improve nlp.update / training loop overview (see #1507)

											
										
										
											2017-11-08 00:17:42 +00:00
+								p
 								    |  The #[+api("language#update") #[code nlp.update]] method takes the
 								    |  following arguments:
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
-												Improve nlp.update / training loop overview (see #1507)

											
										
										
											2017-11-08 00:17:42 +00:00
+								+table(["Name", "Description"])
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
+								    +row
-												Improve nlp.update / training loop overview (see #1507)

											
										
										
											2017-11-08 00:17:42 +00:00
+								        +cell #[code docs]
-												Document "simple training style"

											
										
										
											2017-11-06 23:23:19 +00:00
+								        +cell
 								            |  #[+api("doc") #[code Doc]] objects. The #[code update] method
 								            |  takes a sequence of them, so you can batch up your training
-												Improve nlp.update / training loop overview (see #1507)

											
										
										
											2017-11-08 00:17:42 +00:00
+								            |  examples. Alternatively, you can also pass in a sequence of
 								            |  raw texts.
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
 								    +row
-												Improve nlp.update / training loop overview (see #1507)

											
										
										
											2017-11-08 00:17:42 +00:00
+								        +cell #[code golds]
-												Document "simple training style"

											
										
										
											2017-11-06 23:23:19 +00:00
+								        +cell
 								            |  #[+api("goldparse") #[code GoldParse]] objects. The #[code update]
 								            |  method takes a sequence of them, so you can batch up your
-												Improve nlp.update / training loop overview (see #1507)

											
										
										
											2017-11-08 00:17:42 +00:00
+								            |  training examples. Alternatively, you can also pass in a
 								            |  dictionary containing the annotations.
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
 								    +row
 								        +cell #[code drop]
-												Improve nlp.update / training loop overview (see #1507)

											
										
										
											2017-11-08 00:17:42 +00:00
+								        +cell
 								            |  Dropout rate. Makes it harder for the model to just memorise
 								            |  the data.
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
 								    +row
-												Improve nlp.update / training loop overview (see #1507)

											
										
										
											2017-11-08 00:17:42 +00:00
+								        +cell #[code sgd]
 								        +cell
 								            |  An optimizer, i.e. a callable to update the model's weights. If
 								            |  not set, spaCy will create a new one and save it for further use.
-												Update training guide

											
										
										
											2017-06-01 09:53:23 +00:00
-												Update training docs

											
										
										
											2017-10-30 19:28:00 +00:00
+								p
 								    |  Instead of writing your own training loop, you can also use the
 								    |  built-in #[+api("cli#train") #[code train]] command, which expects data
 								    |  in spaCy's #[+a("/api/annotation#json-input") JSON format]. On each epoch,
 								    |  a model will be saved out to the directory. After training, you can
 								    |  use the #[+api("cli#package") #[code package]] command to generate an
 								    |  installable Python package from your model.
-												Document "simple training style"

											
										
										
											2017-11-06 23:23:19 +00:00
-												Update model docs and add tips section

											
										
										
											2017-11-07 00:05:37 +00:00
+								+code(false, "bash").
-												Re-add python -m to commands, too brittle :( (see #1536)

											
										
										
											2017-11-10 01:30:55 +00:00
+								    python -m spacy convert /tmp/train.conllu /tmp/data
 								    python -m spacy train en /tmp/model /tmp/data/train.json -n 5
-												Update model docs and add tips section

											
										
										
											2017-11-07 00:05:37 +00:00
-												Document "simple training style"

											
										
										
											2017-11-06 23:23:19 +00:00
+								+h(3, "training-simple-style") Simple training style
 								    +tag-new(2)
 								p
 								    |  Instead of sequences of #[code Doc] and #[code GoldParse] objects,
 								    |  you can also use the "simple training style" and pass
 								    |  #[strong raw texts] and #[strong dictionaries of annotations]
 								    |  to #[+api("language#update") #[code nlp.update]].
 								    |  The dictionaries can have the keys #[code entities], #[code heads],
 								    |  #[code deps], #[code tags] and #[code cats]. This is generally
 								    |  recommended, as it removes one layer of abstraction, and avoids
 								    |  unnecessary imports. It also makes it easier to structure and load
 								    |  your training data.
 								+aside-code("Example Annotations").
 								    {
 								        'entities': [(0, 4, 'ORG')],
 								        'heads': [1, 1, 1, 5, 5, 2, 7, 5],
 								        'deps': ['nsubj', 'ROOT', 'prt', 'quantmod', 'compound', 'pobj', 'det', 'npadvmod'],
 								        'tags': ['PROPN', 'VERB', 'ADP', 'SYM', 'NUM', 'NUM', 'DET', 'NOUN'],
 								        'cats': {'BUSINESS': 1.0}
 								    }
 								+code("Simple training loop").
 								    TRAIN_DATA = [
 								         ("Uber blew through $1 million a week", {'entities': [(0, 4, 'ORG')]}),
 								         ("Google rebrands its business apps", {'entities': [(0, 6, "ORG")]})]
 								    nlp = spacy.blank('en')
 								    optimizer = nlp.begin_training()
 								    for i in range(20):
 								        random.shuffle(TRAIN_DATA)
 								        for text, annotations in TRAIN_DATA:
 								            nlp.update([text], [annotations], sgd=optimizer)
 								    nlp.to_disk('/model')
 								p
 								    |  The above training loop leaves out a few details that can really
 								    |  improve accuracy – but the principle really is #[em that] simple. Once
 								    |  you've got your pipeline together and you want to tune the accuracy,
 								    |  you usually want to process your training examples in batches, and
 								    |  experiment with #[+api("top-level#util.minibatch") #[code minibatch]]
 								    |  sizes and dropout rates, set via the #[code drop] keyword argument. See
 								    |  the #[+api("language") #[code Language]] and #[+api("pipe") #[code Pipe]]
 								    |  API docs for available options.