Update docs [ci skip]

2020-09-17 19:24:48 +02:00 · 2020-09-17 19:24:48 +02:00 · a0b4389a38
parent 6efb7688a6
commit a0b4389a38
1 changed files with 121 additions and 79 deletions
--- a/website/docs/usage/embeddings-transformers.md
+++ b/website/docs/usage/embeddings-transformers.md
@ -610,99 +610,141 @@ def MyCustomVectors(
 ## Pretraining {#pretraining}
-The `spacy pretrain` command lets you initialize your models with information
+The [`spacy pretrain`](/api/cli#pretrain) command lets you initialize your
-from raw text. Without pretraining, the models for your components will usually
+models with **information from raw text**. Without pretraining, the models for
-be initialized randomly. The idea behind pretraining is simple: random probably
+your components will usually be initialized randomly. The idea behind
-isn't optimal, so if we have some text to learn from, we can probably find
+pretraining is simple: random probably isn't optimal, so if we have some text to
-a way to get the model off to a better start. The impact of `spacy pretrain` varies,
+learn from, we can probably find a way to get the model off to a better start.
 but it will usually be worth trying if you're not using a transformer model and
 you have relatively little training data (for instance, fewer than 5,000 sentence).
 A good rule of thumb is that pretraining will generally give you a similar accuracy
 improvement to using word vectors in your model. If word vectors have given you
 a 10% error reduction, the `spacy pretrain` command might give you another 10%,
 for a 20% error reduction in total.
-The `spacy pretrain` command will take a specific subnetwork within one of your
+Pretraining uses the same [`config.cfg`](/usage/training#config) file as the
-components, and add additional layers to build a network for a temporary task,
+regular training, which helps keep the settings and hyperparameters consistent.
-that forces the model to learn something about sentence structure and word
+The additional `[pretraining]` section has several configuration subsections
-cooccurrence statistics. Pretraining produces a binary weights file that can be
+that are familiar from the training block: the `[pretraining.batcher]`,
-loaded back in at the start of training. The weights file specifies an initial
+`[pretraining.optimizer]` and `[pretraining.corpus]` all work the same way and
 set of weights. Training then proceeds as normal.
 You can only pretrain one subnetwork from your pipeline at a time, and the subnetwork
 must be typed `Model[List[Doc], List[Floats2d]]` (i.e., it has to be a "tok2vec" layer).
 The most common workflow is to use the `Tok2Vec` component to create a shared
 token-to-vector layer for several components of your pipeline, and apply
 pretraining to its whole model. 
 The `spacy pretrain` command is configured using the `[pretraining]` section of
 your config file. The `pretraining.component` and `pretraining.layer` settings
 tell spaCy how to find the subnetwork to pretrain. The `pretraining.layer`
 setting should be either the empty string (to use the whole model), or a 
 [node reference](https://thinc.ai/docs/usage-models#model-state). Most of spaCy's
 built-in model architectures have a reference named `"tok2vec"` that will refer
 to the right layer.
 ```ini
 # Pretrain nlp.get_pipe("tok2vec").model
 [pretraining]
 component = "tok2vec"
 layer = ""
 [pretraining]
 # Pretrain nlp.get_pipe("textcat").model.get_ref("tok2vec")
 component = "textcat"
 layer = "tok2vec"
 ```
 two pretraining objectives are available, both of which are variants of the cloze
 task Devlin et al (2018) introduced for BERT.
 * The *characters* objective asks the model to predict some number of leading and
  trailing UTF-8 bytes for the words. For instance, setting `n_characters=2`, the
  model will try to predict the first two and last two characters of the word.
 * The *vectors* objective asks the model to predict the word's vector, from
  a static embeddings table. This requires a word vectors model to be trained
  and loaded. The vectors objective can optimize either a cosine or an L2 loss.
  We've generally found cosine loss to perform better.
 These pretraining objectives use a trick that we term _language modelling with
 approximate outputs (LMAO)_. The motivation for the trick is that predicting
 an exact word ID introduces a lot of incidental complexity. You need a large
 output layer, and even then, the vocabulary is too large, which motivates
 tokenization schemes that do not align to actual word boundaries. At the end of
 training, the output layer will be thrown away regardless: we just want a task
 that forces the network to model something about word cooccurrence statistics.
 Predicting leading and trailing characters does that more than adequately, as
 the exact word sequence could be recovered with high accuracy if the initial
 and trailing characters are predicted accurately. With the vectors objective,
 the pretraining is use the embedding space learned by an algorithm such as
 GloVe or word2vec, allowing the model to focus on the contextual
 modelling we actual care about.
 The `[pretraining]` section has several configuration subsections that are
 familiar from the training block: the `[pretraining.batcher]`,
 [pretraining.optimizer]` and `[pretraining.corpus]` all work the same way and
 expect the same types of objects, although for pretraining your corpus does not
 need to have any annotations, so you will often use a different reader, such as
-`spacy.training.JsonlReader1`.
+the [`JsonlReader`](/api/toplevel#jsonlreader).
 > #### Raw text format
 >
-> The raw text can be provided as JSONL (newline-delimited JSON) with a key
+> The raw text can be provided in spaCy's
-> `"text"` per entry. This allows the data to be read in line by line, while
+> [binary `.spacy` format](/api/data-formats#training) consisting of serialized
-> also allowing you to include newlines in the texts.
+> `Doc` objects or as a JSONL (newline-delimited JSON) with a key `"text"` per
 > entry. This allows the data to be read in line by line, while also allowing
 > you to include newlines in the texts.
 >
 > ```json
 > {"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
 > {"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}
 > ```
 >
 > You can also use your own custom corpus loader instead.
 You can add a `[pretraining]` block to your config by setting the
 `--pretraining` flag on [`init config`](/api/cli#init-config) or
 [`init fill-config`](/api/cli#init-fill-config):
 ```cli
 $ python -m spacy init fill-config config.cfg config_pretrain.cfg --pretraining
 ```
 You can then run [`spacy pretrain`](/api/cli#pretrain) with the updated config
 and pass in optional config overrides, like the path to the raw text file:
 ```cli
-$ python -m spacy pretrain raw_text.jsonl /output config_pretrain.cfg
+$ python -m spacy pretrain config_pretrain.cfg ./output --paths.raw text.jsonl
 ```
 ### How pretraining works {#pretraining-details}
 The impact of [`spacy pretrain`](/api/cli#pretrain) varies, but it will usually
 be worth trying if you're **not using a transformer** model and you have
 **relatively little training data** (for instance, fewer than 5,000 sentences).
 A good rule of thumb is that pretraining will generally give you a similar
 accuracy improvement to using word vectors in your model. If word vectors have
 given you a 10% error reduction, pretraining with spaCy might give you another
 10%, for a 20% error reduction in total.
 The [`spacy pretrain`](/api/cli#pretrain) command will take a **specific
 subnetwork** within one of your components, and add additional layers to build a
 network for a temporary task, that forces the model to learn something about
 sentence structure and word cooccurrence statistics. Pretraining produces a
 **binary weights file** that can be loaded back in at the start of training. The
 weights file specifies an initial set of weights. Training then proceeds as
 normal.
 You can only pretrain one subnetwork from your pipeline at a time, and the
 subnetwork must be typed ~~Model[List[Doc], List[Floats2d]]~~ (i.e. it has to be
 a "tok2vec" layer). The most common workflow is to use the
 [`Tok2Vec`](/api/tok2vec) component to create a shared token-to-vector layer for
 several components of your pipeline, and apply pretraining to its whole model.
 #### Configuring the pretraining {#pretraining-configure}
 The [`spacy pretrain`](/api/cli#pretrain) command is configured using the
 `[pretraining]` section of your [config file](/usage/training#config). The
 `component` and `layer` settings tell spaCy how to **find the subnetwork** to
 pretrain. The `layer` setting should be either the empty string (to use the
 whole model), or a
 [node reference](https://thinc.ai/docs/usage-models#model-state). Most of
 spaCy's built-in model architectures have a reference named `"tok2vec"` that
 will refer to the right layer.
 ```ini
 ### config.cfg
 # 1. Use the whole model of the "tok2vec" component
 [pretraining]
 component = "tok2vec"
 layer = ""
 # 2. Pretrain the "tok2vec" node of the "textcat" component
 [pretraining]
 component = "textcat"
 layer = "tok2vec"
 ```
 #### Pretraining objectives {#pretraining-details}
 Two pretraining objectives are available, both of which are variants of the
 cloze task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced
 for BERT. The objective can be defined and configured via the
 `[pretraining.objective]` config block.
 > ```ini
 > ### Characters objective
 > [pretraining.objective]
 > type = "characters"
 > n_characters = 4
 > ```
 >
 > ```ini
 > ### Vectors objective
 > [pretraining.objective]
 > type = "vectors"
 > loss = "cosine"
 > ```
 - **Characters:** The `"characters"` objective asks the model to predict some
  number of leading and trailing UTF-8 bytes for the words. For instance,
  setting `n_characters = 2`, the model will try to predict the first two and
  last two characters of the word.
 - **Vectors:** The `"vectors"` objective asks the model to predict the word's
  vector, from a static embeddings table. This requires a word vectors model to
  be trained and loaded. The vectors objective can optimize either a cosine or
  an L2 loss. We've generally found cosine loss to perform better.
 These pretraining objectives use a trick that we term **language modelling with
 approximate outputs (LMAO)**. The motivation for the trick is that predicting an
 exact word ID introduces a lot of incidental complexity. You need a large output
 layer, and even then, the vocabulary is too large, which motivates tokenization
 schemes that do not align to actual word boundaries. At the end of training, the
 output layer will be thrown away regardless: we just want a task that forces the
 network to model something about word cooccurrence statistics. Predicting
 leading and trailing characters does that more than adequately, as the exact
 word sequence could be recovered with high accuracy if the initial and trailing
 characters are predicted accurately. With the vectors objective, the pretraining
 is use the embedding space learned by an algorithm such as
 [GloVe](https://nlp.stanford.edu/projects/glove/) or
 [Word2vec](https://code.google.com/archive/p/word2vec/), allowing the model to
 focus on the contextual modelling we actual care about.