diff --git a/website/docs/api/corpus.md b/website/docs/api/corpus.md
index 38e19129d..5f639d050 100644
--- a/website/docs/api/corpus.md
+++ b/website/docs/api/corpus.md
@@ -6,30 +6,44 @@ source: spacy/gold/corpus.py
new: 3
---
-This class manages annotated corpora and can read training and development
-datasets in the [DocBin](/api/docbin) (`.spacy`) format.
+This class manages annotated corpora and can be used for training and
+development datasets in the [DocBin](/api/docbin) (`.spacy`) format. To
+customize the data loading during training, you can register your own
+[data readers and batchers](/usage/training#custom-code-readers-batchers)
## Corpus.\_\_init\_\_ {#init tag="method"}
-Create a `Corpus`. The input data can be a file or a directory of files.
+Create a `Corpus` for iterating [Example](/api/example) objects from a file or
+directory of [`.spacy` data files](/api/data-formats#binary-training). The
+`gold_preproc` setting lets you specify whether to set up the `Example` object
+with gold-standard sentences and tokens for the predictions. Gold preprocessing
+helps the annotations align to the tokenization, and may result in sequences of
+more consistent length. However, it may reduce runtime accuracy due to
+train/test skew.
> #### Example
>
> ```python
> from spacy.gold import Corpus
>
-> corpus = Corpus("./train.spacy", "./dev.spacy")
+> # With a single file
+> corpus = Corpus("./data/train.spacy")
+>
+> # With a directory
+> corpus = Corpus("./data", limit=10)
> ```
-| Name | Type | Description |
-| ------- | ------------ | ---------------------------------------------------------------- |
-| `train` | str / `Path` | Training data (`.spacy` file or directory of `.spacy` files). |
-| `dev` | str / `Path` | Development data (`.spacy` file or directory of `.spacy` files). |
-| `limit` | int | Maximum number of examples returned. `0` for no limit (default). |
+| Name | Type | Description |
+| --------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------- |
+| `path` | str / `Path` | The directory or filename to read from. |
+| _keyword-only_ | | |
+| `gold_preproc` | bool | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`. |
+| `max_length` | int | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. |
+| `limit` | int | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. |
-## Corpus.train_dataset {#train_dataset tag="method"}
+## Corpus.\_\_call\_\_ {#call tag="method"}
-Yield examples from the training data.
+Yield examples from the data.
> #### Example
>
@@ -37,60 +51,12 @@ Yield examples from the training data.
> from spacy.gold import Corpus
> import spacy
>
-> corpus = Corpus("./train.spacy", "./dev.spacy")
+> corpus = Corpus("./train.spacy")
> nlp = spacy.blank("en")
-> train_data = corpus.train_dataset(nlp)
+> train_data = corpus(nlp)
> ```
-| Name | Type | Description |
-| -------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
-| `nlp` | `Language` | The current `nlp` object. |
-| _keyword-only_ | | |
-| `shuffle` | bool | Whether to shuffle the examples. Defaults to `True`. |
-| `gold_preproc` | bool | Whether to train on gold-standard sentences and tokens. Defaults to `False`. |
-| `max_length` | int | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. `0` for no limit (default). |
-| **YIELDS** | `Example` | The examples. |
-
-## Corpus.dev_dataset {#dev_dataset tag="method"}
-
-Yield examples from the development data.
-
-> #### Example
->
-> ```python
-> from spacy.gold import Corpus
-> import spacy
->
-> corpus = Corpus("./train.spacy", "./dev.spacy")
-> nlp = spacy.blank("en")
-> dev_data = corpus.dev_dataset(nlp)
-> ```
-
-| Name | Type | Description |
-| -------------- | ---------- | ---------------------------------------------------------------------------- |
-| `nlp` | `Language` | The current `nlp` object. |
-| _keyword-only_ | | |
-| `gold_preproc` | bool | Whether to train on gold-standard sentences and tokens. Defaults to `False`. |
-| **YIELDS** | `Example` | The examples. |
-
-## Corpus.count_train {#count_train tag="method"}
-
-Get the word count of all training examples.
-
-> #### Example
->
-> ```python
-> from spacy.gold import Corpus
-> import spacy
->
-> corpus = Corpus("./train.spacy", "./dev.spacy")
-> nlp = spacy.blank("en")
-> word_count = corpus.count_train(nlp)
-> ```
-
-| Name | Type | Description |
-| ----------- | ---------- | ------------------------- |
-| `nlp` | `Language` | The current `nlp` object. |
-| **RETURNS** | int | The word count. |
-
-
+| Name | Type | Description |
+| ---------- | ---------- | ------------------------- |
+| `nlp` | `Language` | The current `nlp` object. |
+| **YIELDS** | `Example` | The examples. |
diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md
index 882dfa193..5b3326739 100644
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@@ -4,7 +4,7 @@ menu:
- ['spacy', 'spacy']
- ['displacy', 'displacy']
- ['registry', 'registry']
- - ['Loaders & Batchers', 'loaders-batchers']
+ - ['Readers & Batchers', 'readers-batchers']
- ['Data & Alignment', 'gold']
- ['Utility Functions', 'util']
---
@@ -303,6 +303,9 @@ factories.
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
| `assets` | |
+| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. |
+| `readers` | Registry for training and evaluation [data readers](#readers-batchers). |
+| `batchers` | Registry for training and evaluation [data batchers](#readers-batchers). |
| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). |
| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). |
| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). |
@@ -334,10 +337,113 @@ See the [`Transformer`](/api/transformer) API reference and
| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. |
| [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. |
-## Training data loaders and batchers {#loaders-batchers new="3"}
+## Data readers and batchers {#readers-batchers new="3"}
+### spacy.Corpus.v1 {#corpus tag="registered function" source="spacy/gold/corpus.py"}
+
+Registered function that creates a [`Corpus`](/api/corpus) of training or
+evaluation data. It takes the same arguments as the `Corpus` class and returns a
+callable that yields [`Example`](/api/example) objects. You can replace it with
+your own registered function in the [`@readers` registry](#regsitry) to
+customize the data loading and streaming.
+
+> #### Example config
+>
+> ```ini
+> [paths]
+> train = "corpus/train.spacy"
+>
+> [training.train_corpus]
+> @readers = "spacy.Corpus.v1"
+> path = ${paths:train}
+> gold_preproc = false
+> max_length = 0
+> limit = 0
+> ```
+
+| Name | Type | Description |
+| --------------- | ------ | ----------------------------------------------------------------------------------------------------------------------------------------------- |
+| `path` | `Path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). |
+| `gold_preproc` | bool | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. |
+| `max_length` | int | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. |
+| `limit` | int | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. |
+
+### Batchers {#batchers source="spacy/gold/batchers.py"}
+
+
+
+#### batch_by_words.v1 {#batch_by_words tag="registered function"}
+
+Create minibatches of roughly a given number of words. If any examples are
+longer than the specified batch length, they will appear in a batch by
+themselves, or be discarded if `discard_oversize` is set to `True`. The argument
+`docs` can be a list of strings, [`Doc`](/api/doc) objects or
+[`Example`](/api/example) objects.
+
+> #### Example config
+>
+> ```ini
+> [training.batcher]
+> @batchers = "batch_by_words.v1"
+> size = 100
+> tolerance = 0.2
+> discard_oversize = false
+> get_length = null
+> ```
+
+
+
+| Name | Type | Description |
+| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
+| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
+| `tolerance` | float | |
+| `discard_oversize` | bool | Discard items that are longer than the specified batch length. |
+| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. |
+
+#### batch_by_sequence.v1 {#batch_by_sequence tag="registered function"}
+
+
+
+> #### Example config
+>
+> ```ini
+> [training.batcher]
+> @batchers = "batch_by_sequence.v1"
+> size = 32
+> get_length = null
+> ```
+
+
+
+| Name | Type | Description |
+| ------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
+| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
+| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. |
+
+#### batch_by_padded.v1 {#batch_by_padded tag="registered function"}
+
+
+
+> #### Example config
+>
+> ```ini
+> [training.batcher]
+> @batchers = "batch_by_words.v1"
+> size = 100
+> buffer = TODO:
+> discard_oversize = false
+> get_length = null
+> ```
+
+| Name | Type | Description |
+| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
+| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
+| `buffer` | int | |
+| `discard_oversize` | bool | Discard items that are longer than the specified batch length. |
+| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. |
+
## Training data and alignment {#gold source="spacy/gold"}
### gold.docs_to_json {#docs_to_json tag="function"}
diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md
index 7c9d50921..c0ec052b9 100644
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@@ -5,8 +5,8 @@ menu:
- ['Introduction', 'basics']
- ['Quickstart', 'quickstart']
- ['Config System', 'config']
- - ['Transfer Learning', 'transfer-learning']
- ['Custom Models', 'custom-models']
+ - ['Transfer Learning', 'transfer-learning']
- ['Parallel Training', 'parallel-training']
- ['Internal API', 'api']
---
@@ -315,6 +315,10 @@ stop = 1000
compound = 1.001
```
+### Using variable interpolation {#config-interpolation}
+
+
+
### Model architectures {#model-architectures}
@@ -384,41 +388,17 @@ still look good.
-## Transfer learning {#transfer-learning}
-
-### Using transformer models like BERT {#transformers}
-
-spaCy v3.0 lets you use almost any statistical model to power your pipeline. You
-can use models implemented in a variety of frameworks. A transformer model is
-just a statistical model, so the
-[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package
-actually has very little work to do: it just has to provide a few functions that
-do the required plumbing. It also provides a pipeline component,
-[`Transformer`](/api/transformer), that lets you do multi-task learning and lets
-you save the transformer outputs for later use.
-
-
-
-Try out a BERT-based model pipeline using this project template: swap in your
-data, edit the settings and hyperparameters and train, evaluate, package and
-visualize your model.
-
-
-
-For more details on how to integrate transformer models into your training
-config and customize the implementations, see the usage guide on
-[training transformers](/usage/transformers#training).
-
-### Pretraining with spaCy {#pretraining}
-
-
-
## Custom model implementations and architectures {#custom-models}
### Training with custom code {#custom-code}
+> ```bash
+> ### Example {wrap="true"}
+> $ python -m spacy train train.spacy dev.spacy config.cfg --code functions.py
+> ```
+
The [`spacy train`](/api/cli#train) recipe lets you specify an optional argument
`--code` that points to a Python file. The file is imported before training and
allows you to add custom functions and architectures to the function registry
@@ -426,6 +406,120 @@ that can then be referenced from your `config.cfg`. This lets you train spaCy
models with custom components, without having to re-implement the whole training
workflow.
+#### Example: Modifying the nlp object {#custom-code-nlp-callbacks}
+
+For many use cases, you don't necessarily want to implement the whole `Language`
+subclass and language data from scratch – it's often enough to make a few small
+modifications, like adjusting the
+[tokenization rules](/usage/linguistic-features#native-tokenizer-additions) or
+[language defaults](/api/language#defaults) like stop words. The config lets you
+provide three optional **callback functions** that give you access to the
+language class and `nlp` object at different points of the lifecycle:
+
+| Callback | Description |
+| ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `before_creation` | Called before the `nlp` object is created and receives the language subclass like `English` (not the instance). Useful for writing to the [`Language.Defaults`](/api/language#defaults). |
+| `after_creation` | Called right after the `nlp` object is created, but before the pipeline components are added to the pipeline and receives the `nlp` object. Useful for modifying the tokenizer. |
+| `after_pipeline_creation` | Called right after the pipeline components are created and added and receives the `nlp` object. Useful for modifying pipeline components. |
+
+The `@spacy.registry.callbacks` decorator lets you register that function in the
+`callbacks` [registry](/api/top-level#registry) under a given name. You can then
+reference the function in a config block using the `@callbacks` key. If a block
+contains a key starting with an `@`, it's interpreted as a reference to a
+function. Because you've registered the function, spaCy knows how to create it
+when you reference `"customize_language_data"` in your config. Here's an example
+of a callback that runs before the `nlp` object is created and adds a few custom
+tokenization rules to the defaults:
+
+> #### config.cfg
+>
+> ```ini
+> [nlp.before_creation]
+> @callbacks = "customize_language_data"
+> ```
+
+```python
+### functions.py {highlight="3,6"}
+import spacy
+
+@spacy.registry.callbacks("customize_language_data")
+def create_callback():
+ def customize_language_data(lang_cls):
+ lang_cls.Defaults.suffixes = lang_cls.Defaults.suffixes + (r"-+$",)
+ return lang_cls
+
+ return customize_language_data
+```
+
+
+
+Remember that a registered function should always be a function that spaCy
+**calls to create something**. In this case, it **creates a callback** – it's
+not the callback itself.
+
+
+
+Any registered function – in this case `create_callback` – can also take
+**arguments** that can be **set by the config**. This lets you implement and
+keep track of different configurations, without having to hack at your code. You
+can choose any arguments that make sense for your use case. In this example,
+we're adding the arguments `extra_stop_words` (a list of strings) and `debug`
+(boolean) for printing additional info when the function runs.
+
+> #### config.cfg
+>
+> ```ini
+> [nlp.before_creation]
+> @callbacks = "customize_language_data"
+> extra_stop_words = ["ooh", "aah"]
+> debug = true
+> ```
+
+```python
+### functions.py {highlight="5,8-10"}
+from typing import List
+import spacy
+
+@spacy.registry.callbacks("customize_language_data")
+def create_callback(extra_stop_words: List[str] = [], debug: bool = False):
+ def customize_language_data(lang_cls):
+ lang_cls.Defaults.suffixes = lang_cls.Defaults.suffixes + (r"-+$",)
+ lang_cls.Defaults.stop_words.add(extra_stop_words)
+ if debug:
+ print("Updated stop words and tokenizer suffixes")
+ return lang_cls
+
+ return customize_language_data
+```
+
+
+
+spaCy's configs are powered by our machine learning library Thinc's
+[configuration system](https://thinc.ai/docs/usage-config), which supports
+[type hints](https://docs.python.org/3/library/typing.html) and even
+[advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types)
+using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
+function provides type hints, the values that are passed in will be checked
+against the expected types. For example, `debug: bool` in the example above will
+ensure that the value received as the argument `debug` is an boolean. If the
+value can't be coerced into a boolean, spaCy will raise an error.
+`start: pydantic.StrictBool` will force the value to be an boolean and raise an
+error if it's not – for instance, if your config defines `1` instead of `true`.
+
+
+
+With your `functions.py` defining additional code and the updated `config.cfg`,
+you can now run [`spacy train`](/api/cli#train) and point the argument `--code`
+to your Python file. Before loading the config, spaCy will import the
+`functions.py` module and your custom functions will be registered.
+
+```bash
+### Training with custom code {wrap="true"}
+python -m spacy train train.spacy dev.spacy config.cfg --output ./output --code ./functions.py
+```
+
+#### Example: Custom batch size schedule {#custom-code-schedule}
+
For example, let's say you've implemented your own batch size schedule to use
during training. The `@spacy.registry.schedules` decorator lets you register
that function in the `schedules` [registry](/api/top-level#registry) and assign
@@ -459,8 +553,6 @@ the functions need to be represented in the config. If your function defines
**default argument values**, spaCy is able to auto-fill your config when you run
[`init config`](/api/cli#init-config).
-
-
```ini
### config.cfg (excerpt)
[training.batch_size]
@@ -469,31 +561,9 @@ start = 2
factor = 1.005
```
-You can now run [`spacy train`](/api/cli#train) with the `config.cfg` and your
-custom `functions.py` as the argument `--code`. Before loading the config, spaCy
-will import the `functions.py` module and your custom functions will be
-registered.
+#### Example: Custom data reading and batching {#custom-code-readers-batchers}
-```bash
-### Training with custom code {wrap="true"}
-python -m spacy train train.spacy dev.spacy config.cfg --output ./output --code ./functions.py
-```
-
-
-
-spaCy's configs are powered by our machine learning library Thinc's
-[configuration system](https://thinc.ai/docs/usage-config), which supports
-[type hints](https://docs.python.org/3/library/typing.html) and even
-[advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types)
-using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
-function provides type hints, the values that are passed in will be checked
-against the expected types. For example, `start: int` in the example above will
-ensure that the value received as the argument `start` is an integer. If the
-value can't be coerced into an integer, spaCy will raise an error.
-`start: pydantic.StrictInt` will force the value to be an integer and raise an
-error if it's not – for instance, if your config defines a float.
-
-
+
### Wrapping PyTorch and TensorFlow {#custom-frameworks}
@@ -511,6 +581,35 @@ mattis pretium.
+## Transfer learning {#transfer-learning}
+
+### Using transformer models like BERT {#transformers}
+
+spaCy v3.0 lets you use almost any statistical model to power your pipeline. You
+can use models implemented in a variety of frameworks. A transformer model is
+just a statistical model, so the
+[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package
+actually has very little work to do: it just has to provide a few functions that
+do the required plumbing. It also provides a pipeline component,
+[`Transformer`](/api/transformer), that lets you do multi-task learning and lets
+you save the transformer outputs for later use.
+
+
+
+Try out a BERT-based model pipeline using this project template: swap in your
+data, edit the settings and hyperparameters and train, evaluate, package and
+visualize your model.
+
+
+
+For more details on how to integrate transformer models into your training
+config and customize the implementations, see the usage guide on
+[training transformers](/usage/transformers#training).
+
+### Pretraining with spaCy {#pretraining}
+
+
+
## Parallel Training with Ray {#parallel-training}
diff --git a/website/src/styles/aside.module.sass b/website/src/styles/aside.module.sass
index 7746451b4..0e73cc61a 100644
--- a/website/src/styles/aside.module.sass
+++ b/website/src/styles/aside.module.sass
@@ -24,10 +24,16 @@ $border-radius: 6px
&:last-child
margin: 0
+ &:first-child h4
+ margin-top: 0 !important
+
code
padding: 0
margin: 0
+ h4
+ margin-left: 0
+
p, ul, ol
font: inherit
margin-bottom: var(--spacing-sm)
diff --git a/website/src/styles/layout.sass b/website/src/styles/layout.sass
index 9a8640b63..9660363dd 100644
--- a/website/src/styles/layout.sass
+++ b/website/src/styles/layout.sass
@@ -373,7 +373,7 @@ body [id]:target
margin-right: -1.5em
margin-left: -1.5em
padding-right: 1.5em
- padding-left: 1.65em
+ padding-left: 1.1em
&:empty:before
// Fix issue where empty lines would disappear