Update docs [ci skip]

2020-10-02 11:38:03 +02:00 · 2020-10-02 11:38:03 +02:00 · 32cdc1c4f4
parent c41a4332e4
commit 32cdc1c4f4
2 changed files with 129 additions and 2 deletions
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -685,7 +685,11 @@ sequences in the batch.
 ## Augmenters {#augmenters source="spacy/training/augment.py" new="3"}
-<!-- TODO: intro, explain data augmentation concept -->
+Data augmentation is the process of applying small modifications to the training
 data. It can be especially useful for punctuation and case replacement – for
 example, if your corpus only uses smart quotes and you want to include
 variations using regular quotes, or to make the model less sensitive to
 capitalization by including a mix of capitalized and lowercase examples. See the [usage guide](/usage/training#data-augmentation) for details and examples.
 ### spacy.orth_variants.v1 {#orth_variants tag="registered function"}
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@ -1011,9 +1011,132 @@ def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterator[List[Examp
 <!-- TODO:
 * Custom corpus class
 * Minibatching
 * Data augmentation
 -->
 ### Data augmentation {#data-augmentation}
 Data augmentation is the process of applying small **modifications** to the
 training data. It can be especially useful for punctuation and case replacement
 – for example, if your corpus only uses smart quotes and you want to include
 variations using regular quotes, or to make the model less sensitive to
 capitalization by including a mix of capitalized and lowercase examples.
 The easiest way to use data augmentation during training is to provide an
 `augmenter` to the training corpus, e.g. in the `[corpora.train]` section of
 your config. The built-in [`orth_variants`](/api/top-level#orth_variants)
 augmenter creates a data augmentation callback that uses orth-variant
 replacement.
 ```ini
 ### config.cfg (excerpt) {highlight="8,14"}
 [corpora.train]
@readers = "spacy.Corpus.v1"
 path = ${paths.train}
 gold_preproc = false
 max_length = 0
 limit = 0
 [corpora.train.augmenter]
@augmenters = "spacy.orth_variants.v1"
 # Percentage of texts that will be augmented / lowercased
 level = 0.1
 lower = 0.5
 [corpora.train.augmenter.orth_variants]
@readers = "srsly.read_json.v1"
 path = "corpus/orth_variants.json"
 ```
 The `orth_variants` argument lets you pass in a dictionary of replacement rules,
 typically loaded from a JSON file. There are two types of orth variant rules:
 `"single"` for single tokens that should be replaced (e.g. hyphens) and
 `"paired"` for pairs of tokens (e.g. quotes).
 <!-- prettier-ignore -->
 ```json
 ### orth_variants.json
 {
  "single": [{ "tags": ["NFP"], "variants": ["…", "..."] }],
  "paired": [{ "tags": ["``", "''"], "variants": [["'", "'"], ["‘", "’"]] }]
 }
 ```
 <Accordion title="Full examples for English and German" spaced>
 ```json
 https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/en_orth_variants.json
 ```
 ```json
 https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/de_orth_variants.json
 ```
 </Accordion>
 <Infobox title="Important note" variant="warning">
 When adding data augmentation, keep in mind that it typically only makes sense
 to apply it to the **training corpus**, not the development data.
 </Infobox>
 #### Writing custom data augmenters {#data-augmentation-custom}
 Using the [`@spacy.augmenters`](/api/top-level#registry) registry, you can also
 register your own data augmentation callbacks. The callback should be a function
 that takes the current `nlp` object and a training [`Example`](/api/example) and
 yields `Example` objects. Keep in mind that the augmenter should yield **all
 examples** you want to use in your corpus, not only the augmented examples
 (unless you want to augment all examples).
 Here'a an example of a custom augmentation callback that produces text variants
 in ["SpOnGeBoB cAsE"](https://knowyourmeme.com/memes/mocking-spongebob). The
 registered function takes one argument `randomize` that can be set via the
 config and decides whether the uppercase/lowercase transformation is applied
 randomly or not. The augmenter yields two `Example` objects: the original
 example and the augmented example.
 > #### config.cfg
 >
 > ```ini
 > [corpora.train.augmenter]
 > @augmenters = "spongebob_augmenter.v1"
 > randomize = false
 > ```
 ```python
 import spacy
 import random
@spacy.registry.augmenters("spongebob_augmenter.v1")
 def create_augmenter(randomize: bool = False):
    def augment(nlp, example):
        text = example.text
        if randomize:
            # Randomly uppercase/lowercase characters
            chars = [c.lower() if random.random() < 0.5 else c.upper() for c in text]
        else:
            # Uppercase followed by lowercase
            chars = [c.lower() if i % 2 else c.upper() for i, c in enumerate(text)]
        # Create augmented training example
        example_dict = example.to_dict()
        doc = nlp.make_doc("".join(chars))
        example_dict["token_annotation"]["ORTH"] = [t.text for t in doc]
        # Original example followed by augmented example
        yield example
        yield example.from_dict(doc, example_dict)
    return augment
 ```
 An easy way to create modified `Example` objects is to use the
 [`Example.from_dict`](/api/example#from_dict) method with a new reference
 [`Doc`](/api/doc) created from the modified text. In this case, only the
 capitalization changes, so only the `ORTH` values of the tokens will be
 different between the original and augmented examples.
 <!-- TODO: mention alignment -->
 ## Parallel & distributed training with Ray {#parallel-training}
 > #### Installation