spaCy/website/usage/_training/_basics.jade

254 lines
12 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

//- 💫 DOCS > USAGE > TRAINING > BASICS
include ../_spacy-101/_training
+h(3, "training-data") How do I get training data?
p
| Collecting training data may sound incredibly painful and it can be,
| if you're planning a large-scale annotation project. However, if your main
| goal is to update an existing model's predictions for example, spaCy's
| named entity recognition the hard is part usually not creating the
| actual annotations. It's finding representative examples and
| #[strong extracting potential candidates]. The good news is, if you've
| been noticing bad performance on your data, you likely
| already have some relevant text, and you can use spaCy to
| #[strong bootstrap a first set of training examples]. For example,
| after processing a few sentences, you may end up with the following
| entities, some correct, some incorrect.
+aside("How many examples do I need?")
| As a rule of thumb, you should allocate at least 10% of your project
| resources to creating training and evaluation data. If you're looking to
| improve an existing model, you might be able to start off with only a
| handful of examples. Keep in mind that you'll always want a lot more than
| that for #[strong evaluation] especially previous errors the model has
| made. Otherwise, you won't be able to sufficiently verify that the model
| has actually made the #[strong correct generalisations] required for your
| use case.
+table(["Text", "Entity", "Start", "End", "Label", ""])
- var style = [0, 0, 1, 1, 1]
+annotation-row(["Uber blew through $1 million a week", "Uber", 0, 4, "ORG"], style)
+cell #[+procon("yes", "right", true)]
+annotation-row(["Android Pay expands to Canada", "Android", 0, 7, "PERSON"], style)
+cell #[+procon("no", "wrong", true)]
+annotation-row(["Android Pay expands to Canada", "Canada", 23, 30, "GPE"], style)
+cell #[+procon("yes", "right", true)]
+annotation-row(["Spotify steps up Asia expansion", "Spotify", 0, 8, "ORG"], style)
+cell #[+procon("yes", "right", true)]
+annotation-row(["Spotify steps up Asia expansion", "Asia", 17, 21, "NORP"], style)
+cell #[+procon("no", "wrong", true)]
p
| Alternatively, the
| #[+a("/usage/linguistic-features#rule-based-matching") rule-based matcher]
| can be a useful tool to extract tokens or combinations of tokens, as
| well as their start and end index in a document. In this case, we'll
| extract mentions of Google and assume they're an #[code ORG].
+table(["Text", "Entity", "Start", "End", "Label", ""])
- var style = [0, 0, 1, 1, 1]
+annotation-row(["let me google this for you", "google", 7, 13, "ORG"], style)
+cell #[+procon("no", "wrong", true)]
+annotation-row(["Google Maps launches location sharing", "Google", 0, 6, "ORG"], style)
+cell #[+procon("no", "wrong", true)]
+annotation-row(["Google rebrands its business apps", "Google", 0, 6, "ORG"], style)
+cell #[+procon("yes", "right", true)]
+annotation-row(["look what i found on google! 😂", "google", 21, 27, "ORG"], style)
+cell #[+procon("no", "wrong", true)]
p
| Based on the few examples above, you can already create six training
| sentences with eight entities in total. Of course, what you consider a
| "correct annotation" will always depend on
| #[strong what you want the model to learn]. While there are some entity
| annotations that are more or less universally correct like Canada being
| a geopolitical entity your application may have its very own definition
| of the #[+a("/api/annotation#named-entities") NER annotation scheme].
+code.
train_data = [
("Uber blew through $1 million a week", [(0, 4, 'ORG')]),
("Android Pay expands to Canada", [(0, 11, 'PRODUCT'), (23, 30, 'GPE')]),
("Spotify steps up Asia expansion", [(0, 8, "ORG"), (17, 21, "LOC")]),
("Google Maps launches location sharing", [(0, 11, "PRODUCT")]),
("Google rebrands its business apps", [(0, 6, "ORG")]),
("look what i found on google! 😂", [(21, 27, "PRODUCT")])]
+infobox("Tip: Try the Prodigy annotation tool")
+infobox-logos(["prodigy", 100, 29, "https://prodi.gy"])
| If you need to label a lot of data, check out
| #[+a("https://prodi.gy", true) Prodigy], a new, active learning-powered
| annotation tool we've developed. Prodigy is fast and extensible, and
| comes with a modern #[strong web application] that helps you collect
| training data faster. It integrates seamlessly with spaCy, pre-selects
| the #[strong most relevant examples] for annotation, and lets you
| train and evaluate ready-to-use spaCy models.
+h(3, "annotations") Training with annotations
p
| The #[+api("goldparse") #[code GoldParse]] object collects the annotated
| training examples, also called the #[strong gold standard]. It's
| initialised with the #[+api("doc") #[code Doc]] object it refers to,
| and keyword arguments specifying the annotations, like #[code tags]
| or #[code entities]. Its job is to encode the annotations, keep them
| aligned and create the C-level data structures required for efficient access.
| Here's an example of a simple #[code GoldParse] for part-of-speech tags:
+code.
vocab = Vocab(tag_map={'N': {'pos': 'NOUN'}, 'V': {'pos': 'VERB'}})
doc = Doc(vocab, words=['I', 'like', 'stuff'])
gold = GoldParse(doc, tags=['N', 'V', 'N'])
p
| Using the #[code Doc] and its gold-standard annotations, the model can be
| updated to learn a sentence of three words with their assigned
| part-of-speech tags. The #[+a("/usage/adding-languages#tag-map") tag map]
| is part of the vocabulary and defines the annotation scheme. If you're
| training a new language model, this will let you map the tags present in
| the treebank you train on to spaCy's tag scheme.
+code.
doc = Doc(Vocab(), words=['Facebook', 'released', 'React', 'in', '2014'])
gold = GoldParse(doc, entities=['U-ORG', 'O', 'U-TECHNOLOGY', 'O', 'U-DATE'])
p
| The same goes for named entities. The letters added before the labels
| refer to the tags of the
| #[+a("/usage/linguistic-features#updating-biluo") BILUO scheme]
| #[code O] is a token outside an entity, #[code U] an single entity unit,
| #[code B] the beginning of an entity, #[code I] a token inside an entity
| and #[code L] the last token of an entity.
+aside
| #[strong Training data]: The training examples.#[br]
| #[strong Text and label]: The current example.#[br]
| #[strong Doc]: A #[code Doc] object created from the example text.#[br]
| #[strong GoldParse]: A #[code GoldParse] object of the #[code Doc] and label.#[br]
| #[strong nlp]: The #[code nlp] object with the model.#[br]
| #[strong Optimizer]: A function that holds state between updates.#[br]
| #[strong Update]: Update the model's weights.#[br]
| #[strong ]
+graphic("/assets/img/training-loop.svg")
include ../../assets/img/training-loop.svg
p
| Of course, it's not enough to only show a model a single example once.
| Especially if you only have few examples, you'll want to train for a
| #[strong number of iterations]. At each iteration, the training data is
| #[strong shuffled] to ensure the model doesn't make any generalisations
| based on the order of examples. Another technique to improve the learning
| results is to set a #[strong dropout rate], a rate at which to randomly
| "drop" individual features and representations. This makes it harder for
| the model to memorise the training data. For example, a #[code 0.25]
| dropout means that each feature or internal representation has a 1/4
| likelihood of being dropped.
+aside
| #[+api("language#begin_training") #[code begin_training()]]: Start the
| training and return an optimizer function to update the model's weights.#[br]
| #[+api("language#update") #[code update()]]: Update the model with the
| training example and gold data.#[br]
| #[+api("language#to_disk") #[code to_disk()]]: Save the updated model to
| a directory.
+code("Example training loop").
optimizer = nlp.begin_training(get_data)
for itn in range(100):
random.shuffle(train_data)
for raw_text, entity_offsets in train_data:
doc = nlp.make_doc(raw_text)
gold = GoldParse(doc, entities=entity_offsets)
nlp.update([doc], [gold], drop=0.5, sgd=optimizer)
nlp.to_disk('/model')
+table(["Name", "Description"])
+row
+cell #[code train_data]
+cell The training data.
+row
+cell #[code get_data]
+cell
| An optional function converting the training data to spaCy's
| JSON format.
+row
+cell #[code doc]
+cell
| #[+api("doc") #[code Doc]] objects. The #[code update] method
| takes a sequence of them, so you can batch up your training
| examples.
+row
+cell #[code gold]
+cell
| #[+api("goldparse") #[code GoldParse]] objects. The #[code update]
| method takes a sequence of them, so you can batch up your
| training examples.
+row
+cell #[code drop]
+cell Dropout rate. Makes it harder for the model to just memorise the data.
+row
+cell #[code optimizer]
+cell Callable to update the model's weights.
p
| Instead of writing your own training loop, you can also use the
| built-in #[+api("cli#train") #[code train]] command, which expects data
| in spaCy's #[+a("/api/annotation#json-input") JSON format]. On each epoch,
| a model will be saved out to the directory. After training, you can
| use the #[+api("cli#package") #[code package]] command to generate an
| installable Python package from your model.
+h(3, "training-simple-style") Simple training style
+tag-new(2)
p
| Instead of sequences of #[code Doc] and #[code GoldParse] objects,
| you can also use the "simple training style" and pass
| #[strong raw texts] and #[strong dictionaries of annotations]
| to #[+api("language#update") #[code nlp.update]].
| The dictionaries can have the keys #[code entities], #[code heads],
| #[code deps], #[code tags] and #[code cats]. This is generally
| recommended, as it removes one layer of abstraction, and avoids
| unnecessary imports. It also makes it easier to structure and load
| your training data.
+aside-code("Example Annotations").
{
'entities': [(0, 4, 'ORG')],
'heads': [1, 1, 1, 5, 5, 2, 7, 5],
'deps': ['nsubj', 'ROOT', 'prt', 'quantmod', 'compound', 'pobj', 'det', 'npadvmod'],
'tags': ['PROPN', 'VERB', 'ADP', 'SYM', 'NUM', 'NUM', 'DET', 'NOUN'],
'cats': {'BUSINESS': 1.0}
}
+code("Simple training loop").
TRAIN_DATA = [
("Uber blew through $1 million a week", {'entities': [(0, 4, 'ORG')]}),
("Google rebrands its business apps", {'entities': [(0, 6, "ORG")]})]
nlp = spacy.blank('en')
optimizer = nlp.begin_training()
for i in range(20):
random.shuffle(TRAIN_DATA)
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk('/model')
p
| The above training loop leaves out a few details that can really
| improve accuracy but the principle really is #[em that] simple. Once
| you've got your pipeline together and you want to tune the accuracy,
| you usually want to process your training examples in batches, and
| experiment with #[+api("top-level#util.minibatch") #[code minibatch]]
| sizes and dropout rates, set via the #[code drop] keyword argument. See
| the #[+api("language") #[code Language]] and #[+api("pipe") #[code Pipe]]
| API docs for available options.