Update docs [ci skip]

This commit is contained in:
Ines Montani 2020-09-20 12:30:53 +02:00
parent e863b3dc14
commit 554c9a2497
5 changed files with 32 additions and 11 deletions

View File

@ -8,7 +8,11 @@ train = ""
dev = ""
[system]
gpu_allocator = {{ "pytorch" if use_transformer else "" }}
{% if use_transformer -%}
gpu_allocator = "pytorch"
{% else -%}
gpu_allocator = null
{% endif %}
[nlp]
lang = "{{ lang }}"

View File

@ -60,7 +60,6 @@ your config and check that it's valid, you can run the
> [nlp]
> lang = "en"
> pipeline = ["tagger", "parser", "ner"]
> load_vocab_data = true
> before_creation = null
> after_creation = null
> after_pipeline_creation = null
@ -77,7 +76,6 @@ Defines the `nlp` object, its tokenizer and
| `lang` | Pipeline language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). Defaults to `null`. ~~str~~ |
| `pipeline` | Names of pipeline components in order. Should correspond to sections in the `[components]` block, e.g. `[components.ner]`. See docs on [defining components](/usage/training#config-components). Defaults to `[]`. ~~List[str]~~ |
| `disabled` | Names of pipeline components that are loaded but disabled by default and not run as part of the pipeline. Should correspond to components listed in `pipeline`. After a pipeline is loaded, disabled components can be enabled using [`Language.enable_pipe`](/api/language#enable_pipe). ~~List[str]~~ |
| `load_vocab_data` | Whether to load additional lexeme and vocab data from [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) if available. Defaults to `true`. ~~bool~~ |
| `before_creation` | Optional [callback](/usage/training#custom-code-nlp-callbacks) to modify `Language` subclass before it's initialized. Defaults to `null`. ~~Optional[Callable[[Type[Language]], Type[Language]]]~~ |
| `after_creation` | Optional [callback](/usage/training#custom-code-nlp-callbacks) to modify `nlp` object right after it's initialized. Defaults to `null`. ~~Optional[Callable[[Language], Language]]~~ |
| `after_pipeline_creation` | Optional [callback](/usage/training#custom-code-nlp-callbacks) to modify `nlp` object after the pipeline components have been added. Defaults to `null`. ~~Optional[Callable[[Language], Language]]~~ |
@ -189,9 +187,10 @@ process that are used when you run [`spacy train`](/api/cli#train).
| `dev_corpus` | Dot notation of the config location defining the dev corpus. Defaults to `corpora.dev`. ~~str~~ |
| `dropout` | The dropout rate. Defaults to `0.1`. ~~float~~ |
| `eval_frequency` | How often to evaluate during training (steps). Defaults to `200`. ~~int~~ |
| `gpu_allocator` | Library for cupy to route GPU memory allocation to. Can be "pytorch" or "tensorflow". Defaults to variable `${system.gpu_allocator}`. ~~str~~ |
| `frozen_components` | Pipeline component names that are "frozen" and shouldn't be updated during training. See [here](/usage/training#config-components) for details. Defaults to `[]`. ~~List[str]~~ |
| `gpu_allocator` | Library for cupy to route GPU memory allocation to. Can be `"pytorch"` or `"tensorflow"`. Defaults to variable `${system.gpu_allocator}`. ~~str~~ |
| `init_tok2vec` | Optional path to pretrained tok2vec weights created with [`spacy pretrain`](/api/cli#pretrain). Defaults to variable `${paths.init_tok2vec}`. ~~Optional[str]~~ |
| `lookups` | Additional lexeme and vocab data from [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). Defaults to `null`. ~~Optional[Lookups]~~ |
| `max_epochs` | Maximum number of epochs to train for. Defaults to `0`. ~~int~~ |
| `max_steps` | Maximum number of update steps to train for. Defaults to `20000`. ~~int~~ |
| `optimizer` | The optimizer. The learning rate schedule and other settings can be configured as part of the optimizer. Defaults to [`Adam`](https://thinc.ai/docs/api-optimizers#adam). ~~Optimizer~~ |
@ -476,7 +475,7 @@ lexical data.
Here's an example of the 20 most frequent lexemes in the English training data:
```json
%%GITHUB_SPACY / extra / example_data / vocab - data.jsonl
%%GITHUB_SPACY/extra/example_data/vocab-data.jsonl
```
## Pipeline meta {#meta}

View File

@ -458,6 +458,16 @@ remain in the config file stored on your local system.
| `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ |
| `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ |
<Project id="integrations/wandb">
Get started with tracking your spaCy training runs in Weights & Biases using our
project template. It trains on the IMDB Movie Review Dataset and includes a
simple config with the built-in `WandbLogger`, as well as a custom example of
creating variants of the config for a simple hyperparameter grid search and
logging the results.
</Project>
## Readers {#readers source="spacy/training/corpus.py" new="3"}
Corpus readers are registered functions that load data and return a function

View File

@ -655,6 +655,16 @@ and pass in optional config overrides, like the path to the raw text file:
$ python -m spacy pretrain config_pretrain.cfg ./output --paths.raw text.jsonl
```
The following defaults are used for the `[pretraining]` block and merged into
your existing config when you run [`init config`](/api/cli#init-config) or
[`init fill-config`](/api/cli#init-fill-config) with `--pretraining`. If needed,
you can [configure](#pretraining-configure) the settings and hyperparameters or
change the [objective](#pretraining-details).
```ini
%%GITHUB_SPACY/spacy/default_config_pretraining.cfg
```
### How pretraining works {#pretraining-details}
The impact of [`spacy pretrain`](/api/cli#pretrain) varies, but it will usually

View File

@ -976,14 +976,12 @@ your results.
![Screenshot: Parameter importance using config values](../images/wandb2.jpg 'Parameter importance using config values')
<!-- TODO:
<Project id="integrations/wandb">
Get started with tracking your spaCy training runs in Weights & Biases using our
project template. It includes a simple config using the `WandbLogger`, as well
as a custom logger implementation you can adjust for your specific use case.
project template. It trains on the IMDB Movie Review Dataset and includes a
simple config with the built-in `WandbLogger`, as well as a custom example of
creating variants of the config for a simple hyperparameter grid search and
logging the results.
</Project>
-->