mirror of https://github.com/explosion/spaCy.git
790 lines
40 KiB
Markdown
790 lines
40 KiB
Markdown
---
|
||
title: Training spaCy's Statistical Models
|
||
next: /usage/adding-languages
|
||
menu:
|
||
- ['Basics', 'basics']
|
||
- ['NER', 'ner']
|
||
- ['Tagger & Parser', 'tagger-parser']
|
||
- ['Text Classification', 'textcat']
|
||
- ['Entity Linking', 'entity-linker']
|
||
- ['Tips and Advice', 'tips']
|
||
---
|
||
|
||
This guide describes how to train new statistical models for spaCy's
|
||
part-of-speech tagger, named entity recognizer, dependency parser, text
|
||
classifier and entity linker. Once the model is trained, you can then
|
||
[save and load](/usage/saving-loading#models) it.
|
||
|
||
## Training basics {#basics}
|
||
|
||
import Training101 from 'usage/101/\_training.md'
|
||
|
||
<Training101 />
|
||
|
||
### Training via the command-line interface {#spacy-train-cli}
|
||
|
||
For most purposes, the best way to train spaCy is via the command-line
|
||
interface. The [`spacy train`](/api/cli#train) command takes care of many
|
||
details for you, including making sure that the data is minibatched and shuffled
|
||
correctly, progress is printed, and models are saved after each epoch. You can
|
||
prepare your data for use in [`spacy train`](/api/cli#train) using the
|
||
[`spacy convert`](/api/cli#convert) command, which accepts many common NLP data
|
||
formats, including `.iob` for named entities, and the CoNLL format for
|
||
dependencies:
|
||
|
||
```bash
|
||
git clone https://github.com/UniversalDependencies/UD_Spanish-AnCora
|
||
mkdir ancora-json
|
||
python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-train.conllu ancora-json
|
||
python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-dev.conllu ancora-json
|
||
mkdir models
|
||
python -m spacy train es models ancora-json/es_ancora-ud-train.json ancora-json/es_ancora-ud-dev.json
|
||
```
|
||
|
||
<Infobox title="Tip: Debug your data">
|
||
|
||
If you're running spaCy v2.2 or above, you can use the
|
||
[`debug-data` command](/api/cli#debug-data) to analyze and validate your
|
||
training and development data, get useful stats, and find problems like invalid
|
||
entity annotations, cyclic dependencies, low data labels and more.
|
||
|
||
```bash
|
||
$ python -m spacy debug-data en train.json dev.json --verbose
|
||
```
|
||
|
||
</Infobox>
|
||
|
||
You can also use the [`gold.docs_to_json`](/api/goldparse#docs_to_json) helper
|
||
to convert a list of `Doc` objects to spaCy's JSON training format.
|
||
|
||
#### Understanding the training output
|
||
|
||
When you train a model using the [`spacy train`](/api/cli#train) command, you'll
|
||
see a table showing metrics after each pass over the data. Here's what those
|
||
metrics means:
|
||
|
||
> #### Tokenization metrics
|
||
>
|
||
> Note that if the development data has raw text, some of the gold-standard
|
||
> entities might not align to the predicted tokenization. These tokenization
|
||
> errors are **excluded from the NER evaluation**. If your tokenization makes it
|
||
> impossible for the model to predict 50% of your entities, your NER F-score
|
||
> might still look good.
|
||
|
||
| Name | Description |
|
||
| ---------- | ------------------------------------------------------------------------------------------------- |
|
||
| `Dep Loss` | Training loss for dependency parser. Should decrease, but usually not to 0. |
|
||
| `NER Loss` | Training loss for named entity recognizer. Should decrease, but usually not to 0. |
|
||
| `UAS` | Unlabeled attachment score for parser. The percentage of unlabeled correct arcs. Should increase. |
|
||
| `NER P.` | NER precision on development data. Should increase. |
|
||
| `NER R.` | NER recall on development data. Should increase. |
|
||
| `NER F.` | NER F-score on development data. Should increase. |
|
||
| `Tag %` | Fine-grained part-of-speech tag accuracy on development data. Should increase. |
|
||
| `Token %` | Tokenization accuracy on development data. |
|
||
| `CPU WPS` | Prediction speed on CPU in words per second, if available. Should stay stable. |
|
||
| `GPU WPS` | Prediction speed on GPU in words per second, if available. Should stay stable. |
|
||
|
||
### Improving accuracy with transfer learning {#transfer-learning new="2.1"}
|
||
|
||
In most projects, you'll usually have a small amount of labelled data, and
|
||
access to a much bigger sample of raw text. The raw text contains a lot of
|
||
information about the language in general. Learning this general information
|
||
from the raw text can help your model use the smaller labelled data more
|
||
efficiently.
|
||
|
||
The two main ways to use raw text in your spaCy models are **word vectors** and
|
||
**language model pretraining**. Word vectors provide information about the
|
||
definitions of words. The vectors are a look-up table, so each word only has one
|
||
representation, regardless of its context. Language model pretraining lets you
|
||
learn contextualized word representations. Instead of initializing spaCy's
|
||
convolutional neural network layers with random weights, the `spacy pretrain`
|
||
command trains a language model to predict each word's word vector based on the
|
||
surrounding words. The information used to predict this task is a good starting
|
||
point for other tasks such as named entity recognition, text classification or
|
||
dependency parsing.
|
||
|
||
<Infobox title="📖 Vectors and pretraining">
|
||
|
||
For more details, see the documentation on
|
||
[vectors and similarity](/usage/vectors-similarity) and the
|
||
[`spacy pretrain`](/api/cli#pretrain) command.
|
||
|
||
</Infobox>
|
||
|
||
### How do I get training data? {#training-data}
|
||
|
||
Collecting training data may sound incredibly painful – and it can be, if you're
|
||
planning a large-scale annotation project. However, if your main goal is to
|
||
update an existing model's predictions – for example, spaCy's named entity
|
||
recognition – the hard part is usually not creating the actual annotations. It's
|
||
finding representative examples and **extracting potential candidates**. The
|
||
good news is, if you've been noticing bad performance on your data, you likely
|
||
already have some relevant text, and you can use spaCy to **bootstrap a first
|
||
set of training examples**. For example, after processing a few sentences, you
|
||
may end up with the following entities, some correct, some incorrect.
|
||
|
||
> #### How many examples do I need?
|
||
>
|
||
> As a rule of thumb, you should allocate at least 10% of your project resources
|
||
> to creating training and evaluation data. If you're looking to improve an
|
||
> existing model, you might be able to start off with only a handful of
|
||
> examples. Keep in mind that you'll always want a lot more than that for
|
||
> **evaluation** – especially previous errors the model has made. Otherwise, you
|
||
> won't be able to sufficiently verify that the model has actually made the
|
||
> **correct generalizations** required for your use case.
|
||
|
||
| Text | Entity | Start | End | Label | |
|
||
| ---------------------------------- | ------- | ----- | ---- | -------- | --- |
|
||
| Uber blew through 1 million a week | Uber | `0` | `4` | `ORG` | ✅ |
|
||
| Android Pay expands to Canada | Android | `0` | `7` | `PERSON` | ❌ |
|
||
| Android Pay expands to Canada | Canada | `23` | `30` | `GPE` | ✅ |
|
||
| Spotify steps up Asia expansion | Spotify | `0` | `8` | `ORG` | ✅ |
|
||
| Spotify steps up Asia expansion | Asia | `17` | `21` | `NORP` | ❌ |
|
||
|
||
Alternatively, the [rule-based matcher](/usage/rule-based-matching) can be a
|
||
useful tool to extract tokens or combinations of tokens, as well as their start
|
||
and end index in a document. In this case, we'll extract mentions of Google and
|
||
assume they're an `ORG`.
|
||
|
||
| Text | Entity | Start | End | Label | |
|
||
| ------------------------------------- | ------- | ----- | ---- | ----- | --- |
|
||
| let me google this for you | google | `7` | `13` | `ORG` | ❌ |
|
||
| Google Maps launches location sharing | Google | `0` | `6` | `ORG` | ❌ |
|
||
| Google rebrands its business apps | Google | `0` | `6` | `ORG` | ✅ |
|
||
| look what i found on google! 😂 | google | `21` | `27` | `ORG` | ✅ |
|
||
|
||
Based on the few examples above, you can already create six training sentences
|
||
with eight entities in total. Of course, what you consider a "correct
|
||
annotation" will always depend on **what you want the model to learn**. While
|
||
there are some entity annotations that are more or less universally correct –
|
||
like Canada being a geopolitical entity – your application may have its very own
|
||
definition of the [NER annotation scheme](/api/annotation#named-entities).
|
||
|
||
```python
|
||
train_data = [
|
||
("Uber blew through $1 million a week", [(0, 4, 'ORG')]),
|
||
("Android Pay expands to Canada", [(0, 11, 'PRODUCT'), (23, 30, 'GPE')]),
|
||
("Spotify steps up Asia expansion", [(0, 8, "ORG"), (17, 21, "LOC")]),
|
||
("Google Maps launches location sharing", [(0, 11, "PRODUCT")]),
|
||
("Google rebrands its business apps", [(0, 6, "ORG")]),
|
||
("look what i found on google! 😂", [(21, 27, "PRODUCT")])]
|
||
```
|
||
|
||
<Infobox title="Tip: Try the Prodigy annotation tool">
|
||
|
||
[![Prodigy: Radically efficient machine teaching](../images/prodigy.jpg)](https://prodi.gy)
|
||
|
||
If you need to label a lot of data, check out [Prodigy](https://prodi.gy), a
|
||
new, active learning-powered annotation tool we've developed. Prodigy is fast
|
||
and extensible, and comes with a modern **web application** that helps you
|
||
collect training data faster. It integrates seamlessly with spaCy, pre-selects
|
||
the **most relevant examples** for annotation, and lets you train and evaluate
|
||
ready-to-use spaCy models.
|
||
|
||
</Infobox>
|
||
|
||
### Training with annotations {#annotations}
|
||
|
||
The [`GoldParse`](/api/goldparse) object collects the annotated training
|
||
examples, also called the **gold standard**. It's initialized with the
|
||
[`Doc`](/api/doc) object it refers to, and keyword arguments specifying the
|
||
annotations, like `tags` or `entities`. Its job is to encode the annotations,
|
||
keep them aligned and create the C-level data structures required for efficient
|
||
access. Here's an example of a simple `GoldParse` for part-of-speech tags:
|
||
|
||
```python
|
||
vocab = Vocab(tag_map={"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}})
|
||
doc = Doc(vocab, words=["I", "like", "stuff"])
|
||
gold = GoldParse(doc, tags=["N", "V", "N"])
|
||
```
|
||
|
||
Using the `Doc` and its gold-standard annotations, the model can be updated to
|
||
learn a sentence of three words with their assigned part-of-speech tags. The
|
||
[tag map](/usage/adding-languages#tag-map) is part of the vocabulary and defines
|
||
the annotation scheme. If you're training a new language model, this will let
|
||
you map the tags present in the treebank you train on to spaCy's tag scheme.
|
||
|
||
```python
|
||
doc = Doc(Vocab(), words=["Facebook", "released", "React", "in", "2014"])
|
||
gold = GoldParse(doc, entities=["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"])
|
||
```
|
||
|
||
The same goes for named entities. The letters added before the labels refer to
|
||
the tags of the [BILUO scheme](/usage/linguistic-features#updating-biluo) – `O`
|
||
is a token outside an entity, `U` an single entity unit, `B` the beginning of an
|
||
entity, `I` a token inside an entity and `L` the last token of an entity.
|
||
|
||
> - **Training data**: The training examples.
|
||
> - **Text and label**: The current example.
|
||
> - **Doc**: A `Doc` object created from the example text.
|
||
> - **GoldParse**: A `GoldParse` object of the `Doc` and label.
|
||
> - **nlp**: The `nlp` object with the model.
|
||
> - **Optimizer**: A function that holds state between updates.
|
||
> - **Update**: Update the model's weights.
|
||
|
||
![The training loop](../images/training-loop.svg)
|
||
|
||
Of course, it's not enough to only show a model a single example once.
|
||
Especially if you only have few examples, you'll want to train for a **number of
|
||
iterations**. At each iteration, the training data is **shuffled** to ensure the
|
||
model doesn't make any generalizations based on the order of examples. Another
|
||
technique to improve the learning results is to set a **dropout rate**, a rate
|
||
at which to randomly "drop" individual features and representations. This makes
|
||
it harder for the model to memorize the training data. For example, a `0.25`
|
||
dropout means that each feature or internal representation has a 1/4 likelihood
|
||
of being dropped.
|
||
|
||
> - [`begin_training()`](/api/language#begin_training): Start the training and
|
||
> return an optimizer function to update the model's weights. Can take an
|
||
> optional function converting the training data to spaCy's training format.
|
||
> - [`update()`](/api/language#update): Update the model with the training
|
||
> example and gold data.
|
||
> - [`to_disk()`](/api/language#to_disk): Save the updated model to a directory.
|
||
|
||
```python
|
||
### Example training loop
|
||
optimizer = nlp.begin_training(get_data)
|
||
for itn in range(100):
|
||
random.shuffle(train_data)
|
||
for raw_text, entity_offsets in train_data:
|
||
doc = nlp.make_doc(raw_text)
|
||
gold = GoldParse(doc, entities=entity_offsets)
|
||
nlp.update([doc], [gold], drop=0.5, sgd=optimizer)
|
||
nlp.to_disk("/model")
|
||
```
|
||
|
||
The [`nlp.update`](/api/language#update) method takes the following arguments:
|
||
|
||
| Name | Description |
|
||
| ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| `docs` | [`Doc`](/api/doc) objects. The `update` method takes a sequence of them, so you can batch up your training examples. Alternatively, you can also pass in a sequence of raw texts. |
|
||
| `golds` | [`GoldParse`](/api/goldparse) objects. The `update` method takes a sequence of them, so you can batch up your training examples. Alternatively, you can also pass in a dictionary containing the annotations. |
|
||
| `drop` | Dropout rate. Makes it harder for the model to just memorize the data. |
|
||
| `sgd` | An optimizer, i.e. a callable to update the model's weights. If not set, spaCy will create a new one and save it for further use. |
|
||
|
||
Instead of writing your own training loop, you can also use the built-in
|
||
[`train`](/api/cli#train) command, which expects data in spaCy's
|
||
[JSON format](/api/annotation#json-input). On each epoch, a model will be saved
|
||
out to the directory. After training, you can use the
|
||
[`package`](/api/cli#package) command to generate an installable Python package
|
||
from your model.
|
||
|
||
```bash
|
||
python -m spacy convert /tmp/train.conllu /tmp/data
|
||
python -m spacy train en /tmp/model /tmp/data/train.json -n 5
|
||
```
|
||
|
||
### Simple training style {#training-simple-style new="2"}
|
||
|
||
Instead of sequences of `Doc` and `GoldParse` objects, you can also use the
|
||
"simple training style" and pass **raw texts** and **dictionaries of
|
||
annotations** to [`nlp.update`](/api/language#update). The dictionaries can have
|
||
the keys `entities`, `heads`, `deps`, `tags` and `cats`. This is generally
|
||
recommended, as it removes one layer of abstraction, and avoids unnecessary
|
||
imports. It also makes it easier to structure and load your training data.
|
||
|
||
> #### Example Annotations
|
||
>
|
||
> ```python
|
||
> {
|
||
> "entities": [(0, 4, "ORG")],
|
||
> "heads": [1, 1, 1, 5, 5, 2, 7, 5],
|
||
> "deps": ["nsubj", "ROOT", "prt", "quantmod", "compound", "pobj", "det", "npadvmod"],
|
||
> "tags": ["PROPN", "VERB", "ADP", "SYM", "NUM", "NUM", "DET", "NOUN"],
|
||
> "cats": {"BUSINESS": 1.0},
|
||
> }
|
||
> ```
|
||
|
||
```python
|
||
### Simple training loop
|
||
TRAIN_DATA = [
|
||
("Uber blew through $1 million a week", {"entities": [(0, 4, "ORG")]}),
|
||
("Google rebrands its business apps", {"entities": [(0, 6, "ORG")]})]
|
||
|
||
nlp = spacy.blank("en")
|
||
optimizer = nlp.begin_training()
|
||
for i in range(20):
|
||
random.shuffle(TRAIN_DATA)
|
||
for text, annotations in TRAIN_DATA:
|
||
nlp.update([text], [annotations], sgd=optimizer)
|
||
nlp.to_disk("/model")
|
||
```
|
||
|
||
The above training loop leaves out a few details that can really improve
|
||
accuracy – but the principle really is _that_ simple. Once you've got your
|
||
pipeline together and you want to tune the accuracy, you usually want to process
|
||
your training examples in batches, and experiment with
|
||
[`minibatch`](/api/top-level#util.minibatch) sizes and dropout rates, set via
|
||
the `drop` keyword argument. See the [`Language`](/api/language) and
|
||
[`Pipe`](/api/pipe) API docs for available options.
|
||
|
||
## Training the named entity recognizer {#ner}
|
||
|
||
All [spaCy models](/models) support online learning, so you can update a
|
||
pretrained model with new examples. You'll usually need to provide many
|
||
**examples** to meaningfully improve the system — a few hundred is a good start,
|
||
although more is better.
|
||
|
||
You should avoid iterating over the same few examples multiple times, or the
|
||
model is likely to "forget" how to annotate other examples. If you iterate over
|
||
the same few examples, you're effectively changing the loss function. The
|
||
optimizer will find a way to minimize the loss on your examples, without regard
|
||
for the consequences on the examples it's no longer paying attention to. One way
|
||
to avoid this
|
||
["catastrophic forgetting" problem](https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting)
|
||
is to "remind" the model of other examples by augmenting your annotations with
|
||
sentences annotated with entities automatically recognized by the original
|
||
model. Ultimately, this is an empirical process: you'll need to **experiment on
|
||
your data** to find a solution that works best for you.
|
||
|
||
> #### Tip: Converting entity annotations
|
||
>
|
||
> You can train the entity recognizer with entity offsets or annotations in the
|
||
> [BILUO scheme](/api/annotation#biluo). The `spacy.gold` module also exposes
|
||
> [two helper functions](/api/goldparse#util) to convert offsets to BILUO tags,
|
||
> and BILUO tags to entity offsets.
|
||
|
||
### Updating the Named Entity Recognizer {#example-train-ner}
|
||
|
||
This example shows how to update spaCy's entity recognizer with your own
|
||
examples, starting off with an existing, pretrained model, or from scratch using
|
||
a blank `Language` class. To do this, you'll need **example texts** and the
|
||
**character offsets** and **labels** of each entity contained in the texts.
|
||
|
||
```python
|
||
https://github.com/explosion/spaCy/tree/master/examples/training/train_ner.py
|
||
```
|
||
|
||
#### Step by step guide {#step-by-step-ner}
|
||
|
||
1. **Load the model** you want to start with, or create an **empty model** using
|
||
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language. If
|
||
you're using a blank model, don't forget to add the entity recognizer to the
|
||
pipeline. If you're using an existing model, make sure to disable all other
|
||
pipeline components during training using
|
||
[`nlp.select_pipes`](/api/language#select_pipes). This way, you'll only be
|
||
training the entity recognizer.
|
||
2. **Shuffle and loop over** the examples. For each example, **update the
|
||
model** by calling [`nlp.update`](/api/language#update), which steps through
|
||
the words of the input. At each word, it makes a **prediction**. It then
|
||
consults the annotations to see whether it was right. If it was wrong, it
|
||
adjusts its weights so that the correct action will score higher next time.
|
||
3. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
|
||
4. **Test** the model to make sure the entities in the training data are
|
||
recognized correctly.
|
||
|
||
### Training an additional entity type {#example-new-entity-type}
|
||
|
||
This script shows how to add a new entity type `ANIMAL` to an existing
|
||
pretrained NER model, or an empty `Language` class. To keep the example short
|
||
and simple, only a few sentences are provided as examples. In practice, you'll
|
||
need many more — a few hundred would be a good start. You will also likely need
|
||
to mix in examples of other entity types, which might be obtained by running the
|
||
entity recognizer over unlabelled sentences, and adding their annotations to the
|
||
training set.
|
||
|
||
```python
|
||
https://github.com/explosion/spaCy/tree/master/examples/training/train_new_entity_type.py
|
||
```
|
||
|
||
<Infobox title="Important note" variant="warning">
|
||
|
||
If you're using an existing model, make sure to mix in examples of **other
|
||
entity types** that spaCy correctly recognized before. Otherwise, your model
|
||
might learn the new type, but "forget" what it previously knew. This is also
|
||
referred to as the "catastrophic forgetting" problem.
|
||
|
||
</Infobox>
|
||
|
||
#### Step by step guide {#step-by-step-ner-new}
|
||
|
||
1. **Load the model** you want to start with, or create an **empty model** using
|
||
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language. If
|
||
you're using a blank model, don't forget to add the entity recognizer to the
|
||
pipeline. If you're using an existing model, make sure to disable all other
|
||
pipeline components during training using
|
||
[`nlp.select_pipes`](/api/language#select_pipes). This way, you'll only be
|
||
training the entity recognizer.
|
||
2. **Add the new entity label** to the entity recognizer using the
|
||
[`add_label`](/api/entityrecognizer#add_label) method. You can access the
|
||
entity recognizer in the pipeline via `nlp.get_pipe('ner')`.
|
||
3. **Loop over** the examples and call [`nlp.update`](/api/language#update),
|
||
which steps through the words of the input. At each word, it makes a
|
||
**prediction**. It then consults the annotations, to see whether it was
|
||
right. If it was wrong, it adjusts its weights so that the correct action
|
||
will score higher next time.
|
||
4. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
|
||
5. **Test** the model to make sure the new entity is recognized correctly.
|
||
|
||
## Training the tagger and parser {#tagger-parser}
|
||
|
||
### Updating the Dependency Parser {#example-train-parser}
|
||
|
||
This example shows how to train spaCy's dependency parser, starting off with an
|
||
existing model or a blank model. You'll need a set of **training examples** and
|
||
the respective **heads** and **dependency label** for each token of the example
|
||
texts.
|
||
|
||
```python
|
||
https://github.com/explosion/spaCy/tree/master/examples/training/train_parser.py
|
||
```
|
||
|
||
#### Step by step guide {#step-by-step-parser}
|
||
|
||
1. **Load the model** you want to start with, or create an **empty model** using
|
||
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language. If
|
||
you're using a blank model, don't forget to add the parser to the pipeline.
|
||
If you're using an existing model, make sure to disable all other pipeline
|
||
components during training using
|
||
[`nlp.select_pipes`](/api/language#select_pipes). This way, you'll only be
|
||
training the parser.
|
||
2. **Add the dependency labels** to the parser using the
|
||
[`add_label`](/api/dependencyparser#add_label) method. If you're starting off
|
||
with a pretrained spaCy model, this is usually not necessary – but it doesn't
|
||
hurt either, just to be safe.
|
||
3. **Shuffle and loop over** the examples. For each example, **update the
|
||
model** by calling [`nlp.update`](/api/language#update), which steps through
|
||
the words of the input. At each word, it makes a **prediction**. It then
|
||
consults the annotations to see whether it was right. If it was wrong, it
|
||
adjusts its weights so that the correct action will score higher next time.
|
||
4. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
|
||
5. **Test** the model to make sure the parser works as expected.
|
||
|
||
### Updating the Part-of-speech Tagger {#example-train-tagger}
|
||
|
||
In this example, we're training spaCy's part-of-speech tagger with a custom tag
|
||
map. We start off with a blank `Language` class, update its defaults with our
|
||
custom tags and then train the tagger. You'll need a set of **training
|
||
examples** and the respective **custom tags**, as well as a dictionary mapping
|
||
those tags to the
|
||
[Universal Dependencies scheme](http://universaldependencies.github.io/docs/u/pos/index.html).
|
||
|
||
```python
|
||
https://github.com/explosion/spaCy/tree/master/examples/training/train_tagger.py
|
||
```
|
||
|
||
#### Step by step guide {#step-by-step-tagger}
|
||
|
||
1. **Load the model** you want to start with, or create an **empty model** using
|
||
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language. If
|
||
you're using a blank model, don't forget to add the tagger to the pipeline.
|
||
If you're using an existing model, make sure to disable all other pipeline
|
||
components during training using
|
||
[`nlp.select_pipes`](/api/language#select_pipes). This way, you'll only be
|
||
training the tagger.
|
||
2. **Add the tag map** to the tagger using the
|
||
[`add_label`](/api/tagger#add_label) method. The first argument is the new
|
||
tag name, the second the mapping to spaCy's coarse-grained tags, e.g.
|
||
`{'pos': 'NOUN'}`.
|
||
3. **Shuffle and loop over** the examples. For each example, **update the
|
||
model** by calling [`nlp.update`](/api/language#update), which steps through
|
||
the words of the input. At each word, it makes a **prediction**. It then
|
||
consults the annotations to see whether it was right. If it was wrong, it
|
||
adjusts its weights so that the correct action will score higher next time.
|
||
4. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
|
||
5. **Test** the model to make sure the parser works as expected.
|
||
|
||
### Training a parser for custom semantics {#intent-parser}
|
||
|
||
spaCy's parser component can be used to be trained to predict any type of tree
|
||
structure over your input text – including **semantic relations** that are not
|
||
syntactic dependencies. This can be useful to for **conversational
|
||
applications**, which need to predict trees over whole documents or chat logs,
|
||
with connections between the sentence roots used to annotate discourse
|
||
structure. For example, you can train spaCy's parser to label intents and their
|
||
targets, like attributes, quality, time and locations. The result could look
|
||
like this:
|
||
|
||
![Custom dependencies](../images/displacy-custom-parser.svg)
|
||
|
||
```python
|
||
doc = nlp("find a hotel with good wifi")
|
||
print([(t.text, t.dep_, t.head.text) for t in doc if t.dep_ != '-'])
|
||
# [('find', 'ROOT', 'find'), ('hotel', 'PLACE', 'find'),
|
||
# ('good', 'QUALITY', 'wifi'), ('wifi', 'ATTRIBUTE', 'hotel')]
|
||
```
|
||
|
||
The above tree attaches "wifi" to "hotel" and assigns the dependency label
|
||
`ATTRIBUTE`. This may not be a correct syntactic dependency – but in this case,
|
||
it expresses exactly what we need: the user is looking for a hotel with the
|
||
attribute "wifi" of the quality "good". This query can then be processed by your
|
||
application and used to trigger the respective action – e.g. search the database
|
||
for hotels with high ratings for their wifi offerings.
|
||
|
||
> #### Tip: merge phrases and entities
|
||
>
|
||
> To achieve even better accuracy, try merging multi-word tokens and entities
|
||
> specific to your domain into one token before parsing your text. You can do
|
||
> this by running the entity recognizer or
|
||
> [rule-based matcher](/usage/rule-based-matching) to find relevant spans, and
|
||
> merging them using [`Doc.retokenize`](/api/doc#retokenize). You could even add
|
||
> your own custom
|
||
> [pipeline component](/usage/processing-pipelines#custom-components) to do this
|
||
> automatically – just make sure to add it `before='parser'`.
|
||
|
||
The following example shows a full implementation of a training loop for a
|
||
custom message parser for a common "chat intent": finding local businesses. Our
|
||
message semantics will have the following types of relations: `ROOT`, `PLACE`,
|
||
`QUALITY`, `ATTRIBUTE`, `TIME` and `LOCATION`.
|
||
|
||
```python
|
||
https://github.com/explosion/spaCy/tree/master/examples/training/train_intent_parser.py
|
||
```
|
||
|
||
#### Step by step guide {#step-by-step-parser-custom}
|
||
|
||
1. **Create the training data** consisting of words, their heads and their
|
||
dependency labels in order. A token's head is the index of the token it is
|
||
attached to. The heads don't need to be syntactically correct – they should
|
||
express the **semantic relations** you want the parser to learn. For words
|
||
that shouldn't receive a label, you can choose an arbitrary placeholder, for
|
||
example `-`.
|
||
2. **Load the model** you want to start with, or create an **empty model** using
|
||
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language. If
|
||
you're using a blank model, don't forget to add the custom parser to the
|
||
pipeline. If you're using an existing model, make sure to **remove the old
|
||
parser** from the pipeline, and disable all other pipeline components during
|
||
training using [`nlp.select_pipes`](/api/language#select_pipes). This way,
|
||
you'll only be training the parser.
|
||
3. **Add the dependency labels** to the parser using the
|
||
[`add_label`](/api/dependencyparser#add_label) method.
|
||
4. **Shuffle and loop over** the examples. For each example, **update the
|
||
model** by calling [`nlp.update`](/api/language#update), which steps through
|
||
the words of the input. At each word, it makes a **prediction**. It then
|
||
consults the annotations to see whether it was right. If it was wrong, it
|
||
adjusts its weights so that the correct action will score higher next time.
|
||
5. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
|
||
6. **Test** the model to make sure the parser works as expected.
|
||
|
||
## Training a text classification model {#textcat}
|
||
|
||
### Adding a text classifier to a spaCy model {#example-textcat new="2"}
|
||
|
||
This example shows how to train a convolutional neural network text classifier
|
||
on IMDB movie reviews, using spaCy's new
|
||
[`TextCategorizer`](/api/textcategorizer) component. The dataset will be loaded
|
||
automatically via Thinc's built-in dataset loader. Predictions are available via
|
||
[`Doc.cats`](/api/doc#attributes).
|
||
|
||
```python
|
||
https://github.com/explosion/spaCy/tree/master/examples/training/train_textcat.py
|
||
```
|
||
|
||
#### Step by step guide {#step-by-step-textcat}
|
||
|
||
1. **Load the model** you want to start with, or create an **empty model** using
|
||
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language. If
|
||
you're using an existing model, make sure to disable all other pipeline
|
||
components during training using
|
||
[`nlp.select_pipes`](/api/language#select_pipes). This way, you'll only be
|
||
training the text classifier.
|
||
2. **Add the text classifier** to the pipeline, and add the labels you want to
|
||
train – for example, `POSITIVE`.
|
||
3. **Load and pre-process the dataset**, shuffle the data and split off a part
|
||
of it to hold back for evaluation. This way, you'll be able to see results on
|
||
each training iteration.
|
||
4. **Loop over** the training examples and partition them into batches using
|
||
spaCy's [`minibatch`](/api/top-level#util.minibatch) and
|
||
[`compounding`](/api/top-level#util.compounding) helpers.
|
||
5. **Update the model** by calling [`nlp.update`](/api/language#update), which
|
||
steps through the examples and makes a **prediction**. It then consults the
|
||
annotations to see whether it was right. If it was wrong, it adjusts its
|
||
weights so that the correct prediction will score higher next time.
|
||
6. Optionally, you can also **evaluate the text classifier** on each iteration,
|
||
by checking how it performs on the development data held back from the
|
||
dataset. This lets you print the **precision**, **recall** and **F-score**.
|
||
7. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
|
||
8. **Test** the model to make sure the text classifier works as expected.
|
||
|
||
## Entity linking {#entity-linker}
|
||
|
||
To train an entity linking model, you first need to define a knowledge base
|
||
(KB).
|
||
|
||
### Creating a knowledge base {#kb}
|
||
|
||
A KB consists of a list of entities with unique identifiers. Each such entity
|
||
has an entity vector that will be used to measure similarity with the context in
|
||
which an entity is used. These vectors have a fixed length and are stored in the
|
||
KB.
|
||
|
||
The following example shows how to build a knowledge base from scratch, given a
|
||
list of entities and potential aliases. The script requires an `nlp` model with
|
||
pretrained word vectors to obtain an encoding of an entity's description as its
|
||
vector.
|
||
|
||
```python
|
||
https://github.com/explosion/spaCy/tree/master/examples/training/create_kb.py
|
||
```
|
||
|
||
#### Step by step guide {#step-by-step-kb}
|
||
|
||
1. **Load the model** you want to start with. It should contain pretrained word
|
||
vectors.
|
||
2. **Obtain the entity embeddings** by running the descriptions of the entities
|
||
through the `nlp` model and taking the average of all words with
|
||
`nlp(desc).vector`. At this point, a custom encoding step can also be used.
|
||
3. **Construct the KB** by defining all entities with their embeddings, and all
|
||
aliases with their prior probabilities.
|
||
4. **Save** the KB using [`kb.dump`](/api/kb#dump).
|
||
5. **Print** the contents of the KB to make sure the entities were added
|
||
correctly.
|
||
|
||
### Training an entity linking model {#entity-linker-model}
|
||
|
||
This example shows how to create an entity linker pipe using a previously
|
||
created knowledge base. The entity linker is then trained with a set of custom
|
||
examples. To do so, you need to provide **example texts**, and the **character
|
||
offsets** and **knowledge base identifiers** of each entity contained in the
|
||
texts.
|
||
|
||
```python
|
||
https://github.com/explosion/spaCy/tree/master/examples/training/train_entity_linker.py
|
||
```
|
||
|
||
#### Step by step guide {#step-by-step-entity-linker}
|
||
|
||
1. **Load the KB** you want to start with, and specify the path to the `Vocab`
|
||
object that was used to create this KB. Then, create an **empty model** using
|
||
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language. Add
|
||
a component for recognizing sentences en one for identifying relevant
|
||
entities. In practical applications, you will want a more advanced pipeline
|
||
including also a component for
|
||
[named entity recognition](/usage/training#ner). Then, create a new entity
|
||
linker component, add the KB to it, and then add the entity linker to the
|
||
pipeline. If you're using a model with additional components, make sure to
|
||
disable all other pipeline components during training using
|
||
[`nlp.select_pipes`](/api/language#select_pipes). This way, you'll only be
|
||
training the entity linker.
|
||
2. **Shuffle and loop over** the examples. For each example, **update the
|
||
model** by calling [`nlp.update`](/api/language#update), which steps through
|
||
the annotated examples of the input. For each combination of a mention in
|
||
text and a potential KB identifier, the model makes a **prediction** whether
|
||
or not this is the correct match. It then consults the annotations to see
|
||
whether it was right. If it was wrong, it adjusts its weights so that the
|
||
correct combination will score higher next time.
|
||
3. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
|
||
4. **Test** the model to make sure the entities in the training data are
|
||
recognized correctly.
|
||
|
||
## Optimization tips and advice {#tips}
|
||
|
||
There are lots of conflicting "recipes" for training deep neural networks at the
|
||
moment. The cutting-edge models take a very long time to train, so most
|
||
researchers can't run enough experiments to figure out what's _really_ going on.
|
||
For what it's worth, here's a recipe that seems to work well on a lot of NLP
|
||
problems:
|
||
|
||
1. Initialize with batch size 1, and compound to a maximum determined by your
|
||
data size and problem type.
|
||
2. Use Adam solver with fixed learning rate.
|
||
3. Use averaged parameters
|
||
4. Use L2 regularization.
|
||
5. Clip gradients by L2 norm to 1.
|
||
6. On small data sizes, start at a high dropout rate, with linear decay.
|
||
|
||
This recipe has been cobbled together experimentally. Here's why the various
|
||
elements of the recipe made enough sense to try initially, and what you might
|
||
try changing, depending on your problem.
|
||
|
||
### Compounding batch size {#tips-batch-size}
|
||
|
||
The trick of increasing the batch size is starting to become quite popular (see
|
||
[Smith et al., 2017](https://arxiv.org/abs/1711.00489)). Their recipe is quite
|
||
different from how spaCy's models are being trained, but there are some
|
||
similarities. In training the various spaCy models, we haven't found much
|
||
advantage from decaying the learning rate – but starting with a low batch size
|
||
has definitely helped. You should try it out on your data, and see how you go.
|
||
Here's our current strategy:
|
||
|
||
```python
|
||
### Batch heuristic
|
||
def get_batches(train_data, model_type):
|
||
max_batch_sizes = {"tagger": 32, "parser": 16, "ner": 16, "textcat": 64}
|
||
max_batch_size = max_batch_sizes[model_type]
|
||
if len(train_data) < 1000:
|
||
max_batch_size /= 2
|
||
if len(train_data) < 500:
|
||
max_batch_size /= 2
|
||
batch_size = compounding(1, max_batch_size, 1.001)
|
||
batches = minibatch(train_data, size=batch_size)
|
||
return batches
|
||
```
|
||
|
||
This will set the batch size to start at `1`, and increase each batch until it
|
||
reaches a maximum size. The tagger, parser and entity recognizer all take whole
|
||
sentences as input, so they're learning a lot of labels in a single example. You
|
||
therefore need smaller batches for them. The batch size for the text categorizer
|
||
should be somewhat larger, especially if your documents are long.
|
||
|
||
### Learning rate, regularization and gradient clipping {#tips-hyperparams}
|
||
|
||
By default spaCy uses the Adam solver, with default settings
|
||
(`learn_rate=0.001`, `beta1=0.9`, `beta2=0.999`). Some researchers have said
|
||
they found these settings terrible on their problems – but they've always
|
||
performed very well in training spaCy's models, in combination with the rest of
|
||
our recipe. You can change these settings directly, by modifying the
|
||
corresponding attributes on the `optimizer` object. You can also set environment
|
||
variables, to adjust the defaults.
|
||
|
||
There are two other key hyper-parameters of the solver: `L2` **regularization**,
|
||
and **gradient clipping** (`max_grad_norm`). Gradient clipping is a hack that's
|
||
not discussed often, but everybody seems to be using. It's quite important in
|
||
helping to ensure the network doesn't diverge, which is a fancy way of saying
|
||
"fall over during training". The effect is sort of similar to setting the
|
||
learning rate low. It can also compensate for a large batch size (this is a good
|
||
example of how the choices of all these hyper-parameters intersect).
|
||
|
||
### Dropout rate {#tips-dropout}
|
||
|
||
For small datasets, it's useful to set a **high dropout rate at first**, and
|
||
**decay** it down towards a more reasonable value. This helps avoid the network
|
||
immediately overfitting, while still encouraging it to learn some of the more
|
||
interesting things in your data. spaCy comes with a
|
||
[`decaying`](/api/top-level#util.decaying) utility function to facilitate this.
|
||
You might try setting:
|
||
|
||
```python
|
||
from spacy.util import decaying
|
||
dropout = decaying(0.6, 0.2, 1e-4)
|
||
```
|
||
|
||
You can then draw values from the iterator with `next(dropout)`, which you would
|
||
pass to the `drop` keyword argument of [`nlp.update`](/api/language#update).
|
||
It's pretty much always a good idea to use at least **some dropout**. All of the
|
||
models currently use Bernoulli dropout, for no particularly principled reason –
|
||
we just haven't experimented with another scheme like Gaussian dropout yet.
|
||
|
||
### Parameter averaging {#tips-param-avg}
|
||
|
||
The last part of our optimization recipe is **parameter averaging**, an old
|
||
trick introduced by
|
||
[Freund and Schapire (1999)](https://cseweb.ucsd.edu/~yfreund/papers/LargeMarginsUsingPerceptron.pdf),
|
||
popularized in the NLP community by
|
||
[Collins (2002)](http://www.aclweb.org/anthology/P04-1015), and explained in
|
||
more detail by [Leon Bottou](http://leon.bottou.org/projects/sgd). Just about
|
||
the only other people who seem to be using this for neural network training are
|
||
the SyntaxNet team (one of whom is Michael Collins) – but it really seems to
|
||
work great on every problem.
|
||
|
||
The trick is to store the moving average of the weights during training. We
|
||
don't optimize this average – we just track it. Then when we want to actually
|
||
use the model, we use the averages, not the most recent value. In spaCy (and
|
||
[Thinc](https://github.com/explosion/thinc)) this is done by using a context
|
||
manager, [`use_params`](/api/language#use_params), to temporarily replace the
|
||
weights:
|
||
|
||
```python
|
||
with nlp.use_params(optimizer.averages):
|
||
nlp.to_disk("/model")
|
||
```
|
||
|
||
The context manager is handy because you naturally want to evaluate and save the
|
||
model at various points during training (e.g. after each epoch). After
|
||
evaluating and saving, the context manager will exit and the weights will be
|
||
restored, so you resume training from the most recent value, rather than the
|
||
average. By evaluating the model after each epoch, you can remove one
|
||
hyper-parameter from consideration (the number of epochs). Having one less magic
|
||
number to guess is extremely nice – so having the averaging under a context
|
||
manager is very convenient.
|