Update training docs [ci skip]

This commit is contained in:
Ines Montani 2019-09-12 15:32:39 +02:00
parent b544dcb3c5
commit a31e9e1cd5
1 changed files with 71 additions and 70 deletions

View File

@ -6,6 +6,7 @@ menu:
- ['NER', 'ner'] - ['NER', 'ner']
- ['Tagger & Parser', 'tagger-parser'] - ['Tagger & Parser', 'tagger-parser']
- ['Text Classification', 'textcat'] - ['Text Classification', 'textcat']
- ['Entity Linking', 'entity-linker']
- ['Tips and Advice', 'tips'] - ['Tips and Advice', 'tips']
--- ---
@ -415,76 +416,6 @@ referred to as the "catastrophic forgetting" problem.
4. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk). 4. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
5. **Test** the model to make sure the new entity is recognized correctly. 5. **Test** the model to make sure the new entity is recognized correctly.
## Entity linking {#entity-linker}
To train an entity linking model, you first need to define a knowledge base
(KB).
### Creating a knowledge base {#kb}
A KB consists of a list of entities with unique identifiers. Each such entity
has an entity vector that will be used to measure similarity with the context in
which an entity is used. These vectors are pretrained and stored in the KB
before the entity linking model will be trained.
The following example shows how to build a knowledge base from scratch, given a
list of entities and potential aliases. The script further demonstrates how to
pretrain and store the entity vectors. To run this example, the script needs
access to a `vocab` instance or an `nlp` model with pretrained word embeddings.
```python
https://github.com/explosion/spaCy/tree/master/examples/training/pretrain_kb.py
```
#### Step by step guide {#step-by-step-kb}
1. **Load the model** you want to start with, or create an **empty model** using
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language and
a pre-defined [`vocab`](/api/vocab) object.
2. **Pretrain the entity embeddings** by running the descriptions of the
entities through a simple encoder-decoder network. The current implementation
requires the `nlp` model to have access to pre-trained word embeddings, but a
custom implementation of this enoding step can also be used.
3. **Construct the KB** by defining all entities with their pretrained vectors,
and all aliases with their prior probabilities.
4. **Save** the KB using [`kb.dump`](/api/kb#dump).
5. **Test** the KB to make sure the entities were added correctly.
### Training an entity linking model {#entity-linker-model}
This example shows how to create an entity linker pipe using a previously
created knowledge base. The entity linker pipe is then trained with your own
examples. To do so, you'll need to provide **example texts**, and the
**character offsets** and **knowledge base identifiers** of each entity
contained in the texts.
```python
https://github.com/explosion/spaCy/tree/master/examples/training/train_entity_linker.py
```
#### Step by step guide {#step-by-step-entity-linker}
1. **Load the KB** you want to start with, and specify the path to the `Vocab`
object that was used to create this KB. Then, create an **empty model** using
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language.
Don't forget to add the KB to the entity linker, and to add the entity linker
to the pipeline. In practical applications, you will want a more advanced
pipeline including also a component for
[named entity recognition](/usage/training#ner). If you're using a model with
additional components, make sure to disable all other pipeline components
during training using [`nlp.disable_pipes`](/api/language#disable_pipes).
This way, you'll only be training the entity linker.
2. **Shuffle and loop over** the examples. For each example, **update the
model** by calling [`nlp.update`](/api/language#update), which steps through
the annotated examples of the input. For each combination of a mention in
text and a potential KB identifier, the model makes a **prediction** whether
or not this is the correct match. It then consults the annotations to see
whether it was right. If it was wrong, it adjusts its weights so that the
correct combination will score higher next time.
3. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
4. **Test** the model to make sure the entities in the training data are
recognized correctly.
## Training the tagger and parser {#tagger-parser} ## Training the tagger and parser {#tagger-parser}
### Updating the Dependency Parser {#example-train-parser} ### Updating the Dependency Parser {#example-train-parser}
@ -665,6 +596,76 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_textcat.p
7. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk). 7. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
8. **Test** the model to make sure the text classifier works as expected. 8. **Test** the model to make sure the text classifier works as expected.
## Entity linking {#entity-linker}
To train an entity linking model, you first need to define a knowledge base
(KB).
### Creating a knowledge base {#kb}
A KB consists of a list of entities with unique identifiers. Each such entity
has an entity vector that will be used to measure similarity with the context in
which an entity is used. These vectors are pretrained and stored in the KB
before the entity linking model will be trained.
The following example shows how to build a knowledge base from scratch, given a
list of entities and potential aliases. The script further demonstrates how to
pretrain and store the entity vectors. To run this example, the script needs
access to a `vocab` instance or an `nlp` model with pretrained word embeddings.
```python
https://github.com/explosion/spaCy/tree/master/examples/training/pretrain_kb.py
```
#### Step by step guide {#step-by-step-kb}
1. **Load the model** you want to start with, or create an **empty model** using
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language and
a pre-defined [`vocab`](/api/vocab) object.
2. **Pretrain the entity embeddings** by running the descriptions of the
entities through a simple encoder-decoder network. The current implementation
requires the `nlp` model to have access to pre-trained word embeddings, but a
custom implementation of this enoding step can also be used.
3. **Construct the KB** by defining all entities with their pretrained vectors,
and all aliases with their prior probabilities.
4. **Save** the KB using [`kb.dump`](/api/kb#dump).
5. **Test** the KB to make sure the entities were added correctly.
### Training an entity linking model {#entity-linker-model}
This example shows how to create an entity linker pipe using a previously
created knowledge base. The entity linker pipe is then trained with your own
examples. To do so, you'll need to provide **example texts**, and the
**character offsets** and **knowledge base identifiers** of each entity
contained in the texts.
```python
https://github.com/explosion/spaCy/tree/master/examples/training/train_entity_linker.py
```
#### Step by step guide {#step-by-step-entity-linker}
1. **Load the KB** you want to start with, and specify the path to the `Vocab`
object that was used to create this KB. Then, create an **empty model** using
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language.
Don't forget to add the KB to the entity linker, and to add the entity linker
to the pipeline. In practical applications, you will want a more advanced
pipeline including also a component for
[named entity recognition](/usage/training#ner). If you're using a model with
additional components, make sure to disable all other pipeline components
during training using [`nlp.disable_pipes`](/api/language#disable_pipes).
This way, you'll only be training the entity linker.
2. **Shuffle and loop over** the examples. For each example, **update the
model** by calling [`nlp.update`](/api/language#update), which steps through
the annotated examples of the input. For each combination of a mention in
text and a potential KB identifier, the model makes a **prediction** whether
or not this is the correct match. It then consults the annotations to see
whether it was right. If it was wrong, it adjusts its weights so that the
correct combination will score higher next time.
3. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
4. **Test** the model to make sure the entities in the training data are
recognized correctly.
## Optimization tips and advice {#tips} ## Optimization tips and advice {#tips}
There are lots of conflicting "recipes" for training deep neural networks at the There are lots of conflicting "recipes" for training deep neural networks at the