mirror of https://github.com/explosion/spaCy.git
Update docs [ci skip]
This commit is contained in:
parent
6efb7688a6
commit
a0b4389a38
|
@ -610,99 +610,141 @@ def MyCustomVectors(
|
||||||
|
|
||||||
## Pretraining {#pretraining}
|
## Pretraining {#pretraining}
|
||||||
|
|
||||||
The `spacy pretrain` command lets you initialize your models with information
|
The [`spacy pretrain`](/api/cli#pretrain) command lets you initialize your
|
||||||
from raw text. Without pretraining, the models for your components will usually
|
models with **information from raw text**. Without pretraining, the models for
|
||||||
be initialized randomly. The idea behind pretraining is simple: random probably
|
your components will usually be initialized randomly. The idea behind
|
||||||
isn't optimal, so if we have some text to learn from, we can probably find
|
pretraining is simple: random probably isn't optimal, so if we have some text to
|
||||||
a way to get the model off to a better start. The impact of `spacy pretrain` varies,
|
learn from, we can probably find a way to get the model off to a better start.
|
||||||
but it will usually be worth trying if you're not using a transformer model and
|
|
||||||
you have relatively little training data (for instance, fewer than 5,000 sentence).
|
|
||||||
A good rule of thumb is that pretraining will generally give you a similar accuracy
|
|
||||||
improvement to using word vectors in your model. If word vectors have given you
|
|
||||||
a 10% error reduction, the `spacy pretrain` command might give you another 10%,
|
|
||||||
for a 20% error reduction in total.
|
|
||||||
|
|
||||||
The `spacy pretrain` command will take a specific subnetwork within one of your
|
Pretraining uses the same [`config.cfg`](/usage/training#config) file as the
|
||||||
components, and add additional layers to build a network for a temporary task,
|
regular training, which helps keep the settings and hyperparameters consistent.
|
||||||
that forces the model to learn something about sentence structure and word
|
The additional `[pretraining]` section has several configuration subsections
|
||||||
cooccurrence statistics. Pretraining produces a binary weights file that can be
|
that are familiar from the training block: the `[pretraining.batcher]`,
|
||||||
loaded back in at the start of training. The weights file specifies an initial
|
`[pretraining.optimizer]` and `[pretraining.corpus]` all work the same way and
|
||||||
set of weights. Training then proceeds as normal.
|
|
||||||
|
|
||||||
You can only pretrain one subnetwork from your pipeline at a time, and the subnetwork
|
|
||||||
must be typed `Model[List[Doc], List[Floats2d]]` (i.e., it has to be a "tok2vec" layer).
|
|
||||||
The most common workflow is to use the `Tok2Vec` component to create a shared
|
|
||||||
token-to-vector layer for several components of your pipeline, and apply
|
|
||||||
pretraining to its whole model.
|
|
||||||
|
|
||||||
The `spacy pretrain` command is configured using the `[pretraining]` section of
|
|
||||||
your config file. The `pretraining.component` and `pretraining.layer` settings
|
|
||||||
tell spaCy how to find the subnetwork to pretrain. The `pretraining.layer`
|
|
||||||
setting should be either the empty string (to use the whole model), or a
|
|
||||||
[node reference](https://thinc.ai/docs/usage-models#model-state). Most of spaCy's
|
|
||||||
built-in model architectures have a reference named `"tok2vec"` that will refer
|
|
||||||
to the right layer.
|
|
||||||
|
|
||||||
```ini
|
|
||||||
# Pretrain nlp.get_pipe("tok2vec").model
|
|
||||||
[pretraining]
|
|
||||||
component = "tok2vec"
|
|
||||||
layer = ""
|
|
||||||
|
|
||||||
[pretraining]
|
|
||||||
# Pretrain nlp.get_pipe("textcat").model.get_ref("tok2vec")
|
|
||||||
component = "textcat"
|
|
||||||
layer = "tok2vec"
|
|
||||||
```
|
|
||||||
|
|
||||||
two pretraining objectives are available, both of which are variants of the cloze
|
|
||||||
task Devlin et al (2018) introduced for BERT.
|
|
||||||
|
|
||||||
* The *characters* objective asks the model to predict some number of leading and
|
|
||||||
trailing UTF-8 bytes for the words. For instance, setting `n_characters=2`, the
|
|
||||||
model will try to predict the first two and last two characters of the word.
|
|
||||||
|
|
||||||
* The *vectors* objective asks the model to predict the word's vector, from
|
|
||||||
a static embeddings table. This requires a word vectors model to be trained
|
|
||||||
and loaded. The vectors objective can optimize either a cosine or an L2 loss.
|
|
||||||
We've generally found cosine loss to perform better.
|
|
||||||
|
|
||||||
These pretraining objectives use a trick that we term _language modelling with
|
|
||||||
approximate outputs (LMAO)_. The motivation for the trick is that predicting
|
|
||||||
an exact word ID introduces a lot of incidental complexity. You need a large
|
|
||||||
output layer, and even then, the vocabulary is too large, which motivates
|
|
||||||
tokenization schemes that do not align to actual word boundaries. At the end of
|
|
||||||
training, the output layer will be thrown away regardless: we just want a task
|
|
||||||
that forces the network to model something about word cooccurrence statistics.
|
|
||||||
Predicting leading and trailing characters does that more than adequately, as
|
|
||||||
the exact word sequence could be recovered with high accuracy if the initial
|
|
||||||
and trailing characters are predicted accurately. With the vectors objective,
|
|
||||||
the pretraining is use the embedding space learned by an algorithm such as
|
|
||||||
GloVe or word2vec, allowing the model to focus on the contextual
|
|
||||||
modelling we actual care about.
|
|
||||||
|
|
||||||
The `[pretraining]` section has several configuration subsections that are
|
|
||||||
familiar from the training block: the `[pretraining.batcher]`,
|
|
||||||
[pretraining.optimizer]` and `[pretraining.corpus]` all work the same way and
|
|
||||||
expect the same types of objects, although for pretraining your corpus does not
|
expect the same types of objects, although for pretraining your corpus does not
|
||||||
need to have any annotations, so you will often use a different reader, such as
|
need to have any annotations, so you will often use a different reader, such as
|
||||||
`spacy.training.JsonlReader1`.
|
the [`JsonlReader`](/api/toplevel#jsonlreader).
|
||||||
|
|
||||||
> #### Raw text format
|
> #### Raw text format
|
||||||
>
|
>
|
||||||
> The raw text can be provided as JSONL (newline-delimited JSON) with a key
|
> The raw text can be provided in spaCy's
|
||||||
> `"text"` per entry. This allows the data to be read in line by line, while
|
> [binary `.spacy` format](/api/data-formats#training) consisting of serialized
|
||||||
> also allowing you to include newlines in the texts.
|
> `Doc` objects or as a JSONL (newline-delimited JSON) with a key `"text"` per
|
||||||
|
> entry. This allows the data to be read in line by line, while also allowing
|
||||||
|
> you to include newlines in the texts.
|
||||||
>
|
>
|
||||||
> ```json
|
> ```json
|
||||||
> {"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
|
> {"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
|
||||||
> {"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}
|
> {"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}
|
||||||
> ```
|
> ```
|
||||||
|
>
|
||||||
|
> You can also use your own custom corpus loader instead.
|
||||||
|
|
||||||
|
You can add a `[pretraining]` block to your config by setting the
|
||||||
|
`--pretraining` flag on [`init config`](/api/cli#init-config) or
|
||||||
|
[`init fill-config`](/api/cli#init-fill-config):
|
||||||
|
|
||||||
```cli
|
```cli
|
||||||
$ python -m spacy init fill-config config.cfg config_pretrain.cfg --pretraining
|
$ python -m spacy init fill-config config.cfg config_pretrain.cfg --pretraining
|
||||||
```
|
```
|
||||||
|
|
||||||
|
You can then run [`spacy pretrain`](/api/cli#pretrain) with the updated config
|
||||||
|
and pass in optional config overrides, like the path to the raw text file:
|
||||||
|
|
||||||
```cli
|
```cli
|
||||||
$ python -m spacy pretrain raw_text.jsonl /output config_pretrain.cfg
|
$ python -m spacy pretrain config_pretrain.cfg ./output --paths.raw text.jsonl
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### How pretraining works {#pretraining-details}
|
||||||
|
|
||||||
|
The impact of [`spacy pretrain`](/api/cli#pretrain) varies, but it will usually
|
||||||
|
be worth trying if you're **not using a transformer** model and you have
|
||||||
|
**relatively little training data** (for instance, fewer than 5,000 sentences).
|
||||||
|
A good rule of thumb is that pretraining will generally give you a similar
|
||||||
|
accuracy improvement to using word vectors in your model. If word vectors have
|
||||||
|
given you a 10% error reduction, pretraining with spaCy might give you another
|
||||||
|
10%, for a 20% error reduction in total.
|
||||||
|
|
||||||
|
The [`spacy pretrain`](/api/cli#pretrain) command will take a **specific
|
||||||
|
subnetwork** within one of your components, and add additional layers to build a
|
||||||
|
network for a temporary task, that forces the model to learn something about
|
||||||
|
sentence structure and word cooccurrence statistics. Pretraining produces a
|
||||||
|
**binary weights file** that can be loaded back in at the start of training. The
|
||||||
|
weights file specifies an initial set of weights. Training then proceeds as
|
||||||
|
normal.
|
||||||
|
|
||||||
|
You can only pretrain one subnetwork from your pipeline at a time, and the
|
||||||
|
subnetwork must be typed ~~Model[List[Doc], List[Floats2d]]~~ (i.e. it has to be
|
||||||
|
a "tok2vec" layer). The most common workflow is to use the
|
||||||
|
[`Tok2Vec`](/api/tok2vec) component to create a shared token-to-vector layer for
|
||||||
|
several components of your pipeline, and apply pretraining to its whole model.
|
||||||
|
|
||||||
|
#### Configuring the pretraining {#pretraining-configure}
|
||||||
|
|
||||||
|
The [`spacy pretrain`](/api/cli#pretrain) command is configured using the
|
||||||
|
`[pretraining]` section of your [config file](/usage/training#config). The
|
||||||
|
`component` and `layer` settings tell spaCy how to **find the subnetwork** to
|
||||||
|
pretrain. The `layer` setting should be either the empty string (to use the
|
||||||
|
whole model), or a
|
||||||
|
[node reference](https://thinc.ai/docs/usage-models#model-state). Most of
|
||||||
|
spaCy's built-in model architectures have a reference named `"tok2vec"` that
|
||||||
|
will refer to the right layer.
|
||||||
|
|
||||||
|
```ini
|
||||||
|
### config.cfg
|
||||||
|
# 1. Use the whole model of the "tok2vec" component
|
||||||
|
[pretraining]
|
||||||
|
component = "tok2vec"
|
||||||
|
layer = ""
|
||||||
|
|
||||||
|
# 2. Pretrain the "tok2vec" node of the "textcat" component
|
||||||
|
[pretraining]
|
||||||
|
component = "textcat"
|
||||||
|
layer = "tok2vec"
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Pretraining objectives {#pretraining-details}
|
||||||
|
|
||||||
|
Two pretraining objectives are available, both of which are variants of the
|
||||||
|
cloze task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced
|
||||||
|
for BERT. The objective can be defined and configured via the
|
||||||
|
`[pretraining.objective]` config block.
|
||||||
|
|
||||||
|
> ```ini
|
||||||
|
> ### Characters objective
|
||||||
|
> [pretraining.objective]
|
||||||
|
> type = "characters"
|
||||||
|
> n_characters = 4
|
||||||
|
> ```
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> ### Vectors objective
|
||||||
|
> [pretraining.objective]
|
||||||
|
> type = "vectors"
|
||||||
|
> loss = "cosine"
|
||||||
|
> ```
|
||||||
|
|
||||||
|
- **Characters:** The `"characters"` objective asks the model to predict some
|
||||||
|
number of leading and trailing UTF-8 bytes for the words. For instance,
|
||||||
|
setting `n_characters = 2`, the model will try to predict the first two and
|
||||||
|
last two characters of the word.
|
||||||
|
|
||||||
|
- **Vectors:** The `"vectors"` objective asks the model to predict the word's
|
||||||
|
vector, from a static embeddings table. This requires a word vectors model to
|
||||||
|
be trained and loaded. The vectors objective can optimize either a cosine or
|
||||||
|
an L2 loss. We've generally found cosine loss to perform better.
|
||||||
|
|
||||||
|
These pretraining objectives use a trick that we term **language modelling with
|
||||||
|
approximate outputs (LMAO)**. The motivation for the trick is that predicting an
|
||||||
|
exact word ID introduces a lot of incidental complexity. You need a large output
|
||||||
|
layer, and even then, the vocabulary is too large, which motivates tokenization
|
||||||
|
schemes that do not align to actual word boundaries. At the end of training, the
|
||||||
|
output layer will be thrown away regardless: we just want a task that forces the
|
||||||
|
network to model something about word cooccurrence statistics. Predicting
|
||||||
|
leading and trailing characters does that more than adequately, as the exact
|
||||||
|
word sequence could be recovered with high accuracy if the initial and trailing
|
||||||
|
characters are predicted accurately. With the vectors objective, the pretraining
|
||||||
|
is use the embedding space learned by an algorithm such as
|
||||||
|
[GloVe](https://nlp.stanford.edu/projects/glove/) or
|
||||||
|
[Word2vec](https://code.google.com/archive/p/word2vec/), allowing the model to
|
||||||
|
focus on the contextual modelling we actual care about.
|
||||||
|
|
Loading…
Reference in New Issue