From a31e9e1cd5176651942b32a7db5d00186aad97dc Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Thu, 12 Sep 2019 15:32:39 +0200 Subject: [PATCH] Update training docs [ci skip] --- website/docs/usage/training.md | 141 +++++++++++++++++---------------- 1 file changed, 71 insertions(+), 70 deletions(-) diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index e3386a64f..9489615bc 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -6,6 +6,7 @@ menu: - ['NER', 'ner'] - ['Tagger & Parser', 'tagger-parser'] - ['Text Classification', 'textcat'] + - ['Entity Linking', 'entity-linker'] - ['Tips and Advice', 'tips'] --- @@ -415,76 +416,6 @@ referred to as the "catastrophic forgetting" problem. 4. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk). 5. **Test** the model to make sure the new entity is recognized correctly. -## Entity linking {#entity-linker} - -To train an entity linking model, you first need to define a knowledge base -(KB). - -### Creating a knowledge base {#kb} - -A KB consists of a list of entities with unique identifiers. Each such entity -has an entity vector that will be used to measure similarity with the context in -which an entity is used. These vectors are pretrained and stored in the KB -before the entity linking model will be trained. - -The following example shows how to build a knowledge base from scratch, given a -list of entities and potential aliases. The script further demonstrates how to -pretrain and store the entity vectors. To run this example, the script needs -access to a `vocab` instance or an `nlp` model with pretrained word embeddings. - -```python -https://github.com/explosion/spaCy/tree/master/examples/training/pretrain_kb.py -``` - -#### Step by step guide {#step-by-step-kb} - -1. **Load the model** you want to start with, or create an **empty model** using - [`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language and - a pre-defined [`vocab`](/api/vocab) object. -2. **Pretrain the entity embeddings** by running the descriptions of the - entities through a simple encoder-decoder network. The current implementation - requires the `nlp` model to have access to pre-trained word embeddings, but a - custom implementation of this enoding step can also be used. -3. **Construct the KB** by defining all entities with their pretrained vectors, - and all aliases with their prior probabilities. -4. **Save** the KB using [`kb.dump`](/api/kb#dump). -5. **Test** the KB to make sure the entities were added correctly. - -### Training an entity linking model {#entity-linker-model} - -This example shows how to create an entity linker pipe using a previously -created knowledge base. The entity linker pipe is then trained with your own -examples. To do so, you'll need to provide **example texts**, and the -**character offsets** and **knowledge base identifiers** of each entity -contained in the texts. - -```python -https://github.com/explosion/spaCy/tree/master/examples/training/train_entity_linker.py -``` - -#### Step by step guide {#step-by-step-entity-linker} - -1. **Load the KB** you want to start with, and specify the path to the `Vocab` - object that was used to create this KB. Then, create an **empty model** using - [`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language. - Don't forget to add the KB to the entity linker, and to add the entity linker - to the pipeline. In practical applications, you will want a more advanced - pipeline including also a component for - [named entity recognition](/usage/training#ner). If you're using a model with - additional components, make sure to disable all other pipeline components - during training using [`nlp.disable_pipes`](/api/language#disable_pipes). - This way, you'll only be training the entity linker. -2. **Shuffle and loop over** the examples. For each example, **update the - model** by calling [`nlp.update`](/api/language#update), which steps through - the annotated examples of the input. For each combination of a mention in - text and a potential KB identifier, the model makes a **prediction** whether - or not this is the correct match. It then consults the annotations to see - whether it was right. If it was wrong, it adjusts its weights so that the - correct combination will score higher next time. -3. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk). -4. **Test** the model to make sure the entities in the training data are - recognized correctly. - ## Training the tagger and parser {#tagger-parser} ### Updating the Dependency Parser {#example-train-parser} @@ -665,6 +596,76 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_textcat.p 7. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk). 8. **Test** the model to make sure the text classifier works as expected. +## Entity linking {#entity-linker} + +To train an entity linking model, you first need to define a knowledge base +(KB). + +### Creating a knowledge base {#kb} + +A KB consists of a list of entities with unique identifiers. Each such entity +has an entity vector that will be used to measure similarity with the context in +which an entity is used. These vectors are pretrained and stored in the KB +before the entity linking model will be trained. + +The following example shows how to build a knowledge base from scratch, given a +list of entities and potential aliases. The script further demonstrates how to +pretrain and store the entity vectors. To run this example, the script needs +access to a `vocab` instance or an `nlp` model with pretrained word embeddings. + +```python +https://github.com/explosion/spaCy/tree/master/examples/training/pretrain_kb.py +``` + +#### Step by step guide {#step-by-step-kb} + +1. **Load the model** you want to start with, or create an **empty model** using + [`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language and + a pre-defined [`vocab`](/api/vocab) object. +2. **Pretrain the entity embeddings** by running the descriptions of the + entities through a simple encoder-decoder network. The current implementation + requires the `nlp` model to have access to pre-trained word embeddings, but a + custom implementation of this enoding step can also be used. +3. **Construct the KB** by defining all entities with their pretrained vectors, + and all aliases with their prior probabilities. +4. **Save** the KB using [`kb.dump`](/api/kb#dump). +5. **Test** the KB to make sure the entities were added correctly. + +### Training an entity linking model {#entity-linker-model} + +This example shows how to create an entity linker pipe using a previously +created knowledge base. The entity linker pipe is then trained with your own +examples. To do so, you'll need to provide **example texts**, and the +**character offsets** and **knowledge base identifiers** of each entity +contained in the texts. + +```python +https://github.com/explosion/spaCy/tree/master/examples/training/train_entity_linker.py +``` + +#### Step by step guide {#step-by-step-entity-linker} + +1. **Load the KB** you want to start with, and specify the path to the `Vocab` + object that was used to create this KB. Then, create an **empty model** using + [`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language. + Don't forget to add the KB to the entity linker, and to add the entity linker + to the pipeline. In practical applications, you will want a more advanced + pipeline including also a component for + [named entity recognition](/usage/training#ner). If you're using a model with + additional components, make sure to disable all other pipeline components + during training using [`nlp.disable_pipes`](/api/language#disable_pipes). + This way, you'll only be training the entity linker. +2. **Shuffle and loop over** the examples. For each example, **update the + model** by calling [`nlp.update`](/api/language#update), which steps through + the annotated examples of the input. For each combination of a mention in + text and a potential KB identifier, the model makes a **prediction** whether + or not this is the correct match. It then consults the annotations to see + whether it was right. If it was wrong, it adjusts its weights so that the + correct combination will score higher next time. +3. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk). +4. **Test** the model to make sure the entities in the training data are + recognized correctly. + ## Optimization tips and advice {#tips} There are lots of conflicting "recipes" for training deep neural networks at the