Update docs [ci skip]

This commit is contained in:
Ines Montani 2020-10-02 11:38:03 +02:00
parent c41a4332e4
commit 32cdc1c4f4
2 changed files with 129 additions and 2 deletions

View File

@ -685,7 +685,11 @@ sequences in the batch.
## Augmenters {#augmenters source="spacy/training/augment.py" new="3"}
<!-- TODO: intro, explain data augmentation concept -->
Data augmentation is the process of applying small modifications to the training
data. It can be especially useful for punctuation and case replacement for
example, if your corpus only uses smart quotes and you want to include
variations using regular quotes, or to make the model less sensitive to
capitalization by including a mix of capitalized and lowercase examples. See the [usage guide](/usage/training#data-augmentation) for details and examples.
### spacy.orth_variants.v1 {#orth_variants tag="registered function"}

View File

@ -1011,9 +1011,132 @@ def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterator[List[Examp
<!-- TODO:
* Custom corpus class
* Minibatching
* Data augmentation
-->
### Data augmentation {#data-augmentation}
Data augmentation is the process of applying small **modifications** to the
training data. It can be especially useful for punctuation and case replacement
for example, if your corpus only uses smart quotes and you want to include
variations using regular quotes, or to make the model less sensitive to
capitalization by including a mix of capitalized and lowercase examples.
The easiest way to use data augmentation during training is to provide an
`augmenter` to the training corpus, e.g. in the `[corpora.train]` section of
your config. The built-in [`orth_variants`](/api/top-level#orth_variants)
augmenter creates a data augmentation callback that uses orth-variant
replacement.
```ini
### config.cfg (excerpt) {highlight="8,14"}
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
[corpora.train.augmenter]
@augmenters = "spacy.orth_variants.v1"
# Percentage of texts that will be augmented / lowercased
level = 0.1
lower = 0.5
[corpora.train.augmenter.orth_variants]
@readers = "srsly.read_json.v1"
path = "corpus/orth_variants.json"
```
The `orth_variants` argument lets you pass in a dictionary of replacement rules,
typically loaded from a JSON file. There are two types of orth variant rules:
`"single"` for single tokens that should be replaced (e.g. hyphens) and
`"paired"` for pairs of tokens (e.g. quotes).
<!-- prettier-ignore -->
```json
### orth_variants.json
{
"single": [{ "tags": ["NFP"], "variants": ["…", "..."] }],
"paired": [{ "tags": ["``", "''"], "variants": [["'", "'"], ["", ""]] }]
}
```
<Accordion title="Full examples for English and German" spaced>
```json
https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/en_orth_variants.json
```
```json
https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/de_orth_variants.json
```
</Accordion>
<Infobox title="Important note" variant="warning">
When adding data augmentation, keep in mind that it typically only makes sense
to apply it to the **training corpus**, not the development data.
</Infobox>
#### Writing custom data augmenters {#data-augmentation-custom}
Using the [`@spacy.augmenters`](/api/top-level#registry) registry, you can also
register your own data augmentation callbacks. The callback should be a function
that takes the current `nlp` object and a training [`Example`](/api/example) and
yields `Example` objects. Keep in mind that the augmenter should yield **all
examples** you want to use in your corpus, not only the augmented examples
(unless you want to augment all examples).
Here'a an example of a custom augmentation callback that produces text variants
in ["SpOnGeBoB cAsE"](https://knowyourmeme.com/memes/mocking-spongebob). The
registered function takes one argument `randomize` that can be set via the
config and decides whether the uppercase/lowercase transformation is applied
randomly or not. The augmenter yields two `Example` objects: the original
example and the augmented example.
> #### config.cfg
>
> ```ini
> [corpora.train.augmenter]
> @augmenters = "spongebob_augmenter.v1"
> randomize = false
> ```
```python
import spacy
import random
@spacy.registry.augmenters("spongebob_augmenter.v1")
def create_augmenter(randomize: bool = False):
def augment(nlp, example):
text = example.text
if randomize:
# Randomly uppercase/lowercase characters
chars = [c.lower() if random.random() < 0.5 else c.upper() for c in text]
else:
# Uppercase followed by lowercase
chars = [c.lower() if i % 2 else c.upper() for i, c in enumerate(text)]
# Create augmented training example
example_dict = example.to_dict()
doc = nlp.make_doc("".join(chars))
example_dict["token_annotation"]["ORTH"] = [t.text for t in doc]
# Original example followed by augmented example
yield example
yield example.from_dict(doc, example_dict)
return augment
```
An easy way to create modified `Example` objects is to use the
[`Example.from_dict`](/api/example#from_dict) method with a new reference
[`Doc`](/api/doc) created from the modified text. In this case, only the
capitalization changes, so only the `ORTH` values of the tokens will be
different between the original and augmented examples.
<!-- TODO: mention alignment -->
## Parallel & distributed training with Ray {#parallel-training}
> #### Installation