diff --git a/website/docs/api/corpus.md b/website/docs/api/corpus.md index 38e19129d..5f639d050 100644 --- a/website/docs/api/corpus.md +++ b/website/docs/api/corpus.md @@ -6,30 +6,44 @@ source: spacy/gold/corpus.py new: 3 --- -This class manages annotated corpora and can read training and development -datasets in the [DocBin](/api/docbin) (`.spacy`) format. +This class manages annotated corpora and can be used for training and +development datasets in the [DocBin](/api/docbin) (`.spacy`) format. To +customize the data loading during training, you can register your own +[data readers and batchers](/usage/training#custom-code-readers-batchers) ## Corpus.\_\_init\_\_ {#init tag="method"} -Create a `Corpus`. The input data can be a file or a directory of files. +Create a `Corpus` for iterating [Example](/api/example) objects from a file or +directory of [`.spacy` data files](/api/data-formats#binary-training). The +`gold_preproc` setting lets you specify whether to set up the `Example` object +with gold-standard sentences and tokens for the predictions. Gold preprocessing +helps the annotations align to the tokenization, and may result in sequences of +more consistent length. However, it may reduce runtime accuracy due to +train/test skew. > #### Example > > ```python > from spacy.gold import Corpus > -> corpus = Corpus("./train.spacy", "./dev.spacy") +> # With a single file +> corpus = Corpus("./data/train.spacy") +> +> # With a directory +> corpus = Corpus("./data", limit=10) > ``` -| Name | Type | Description | -| ------- | ------------ | ---------------------------------------------------------------- | -| `train` | str / `Path` | Training data (`.spacy` file or directory of `.spacy` files). | -| `dev` | str / `Path` | Development data (`.spacy` file or directory of `.spacy` files). | -| `limit` | int | Maximum number of examples returned. `0` for no limit (default). | +| Name | Type | Description | +| --------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------- | +| `path` | str / `Path` | The directory or filename to read from. | +| _keyword-only_ | | | +|  `gold_preproc` | bool | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`. | +| `max_length` | int | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. | +| `limit` | int | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. | -## Corpus.train_dataset {#train_dataset tag="method"} +## Corpus.\_\_call\_\_ {#call tag="method"} -Yield examples from the training data. +Yield examples from the data. > #### Example > @@ -37,60 +51,12 @@ Yield examples from the training data. > from spacy.gold import Corpus > import spacy > -> corpus = Corpus("./train.spacy", "./dev.spacy") +> corpus = Corpus("./train.spacy") > nlp = spacy.blank("en") -> train_data = corpus.train_dataset(nlp) +> train_data = corpus(nlp) > ``` -| Name | Type | Description | -| -------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------ | -| `nlp` | `Language` | The current `nlp` object. | -| _keyword-only_ | | | -| `shuffle` | bool | Whether to shuffle the examples. Defaults to `True`. | -| `gold_preproc` | bool | Whether to train on gold-standard sentences and tokens. Defaults to `False`. | -| `max_length` | int | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. `0` for no limit (default).  | -| **YIELDS** | `Example` | The examples. | - -## Corpus.dev_dataset {#dev_dataset tag="method"} - -Yield examples from the development data. - -> #### Example -> -> ```python -> from spacy.gold import Corpus -> import spacy -> -> corpus = Corpus("./train.spacy", "./dev.spacy") -> nlp = spacy.blank("en") -> dev_data = corpus.dev_dataset(nlp) -> ``` - -| Name | Type | Description | -| -------------- | ---------- | ---------------------------------------------------------------------------- | -| `nlp` | `Language` | The current `nlp` object. | -| _keyword-only_ | | | -| `gold_preproc` | bool | Whether to train on gold-standard sentences and tokens. Defaults to `False`. | -| **YIELDS** | `Example` | The examples. | - -## Corpus.count_train {#count_train tag="method"} - -Get the word count of all training examples. - -> #### Example -> -> ```python -> from spacy.gold import Corpus -> import spacy -> -> corpus = Corpus("./train.spacy", "./dev.spacy") -> nlp = spacy.blank("en") -> word_count = corpus.count_train(nlp) -> ``` - -| Name | Type | Description | -| ----------- | ---------- | ------------------------- | -| `nlp` | `Language` | The current `nlp` object. | -| **RETURNS** | int | The word count. | - - +| Name | Type | Description | +| ---------- | ---------- | ------------------------- | +| `nlp` | `Language` | The current `nlp` object. | +| **YIELDS** | `Example` | The examples. | diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 882dfa193..5b3326739 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -4,7 +4,7 @@ menu: - ['spacy', 'spacy'] - ['displacy', 'displacy'] - ['registry', 'registry'] - - ['Loaders & Batchers', 'loaders-batchers'] + - ['Readers & Batchers', 'readers-batchers'] - ['Data & Alignment', 'gold'] - ['Utility Functions', 'util'] --- @@ -303,6 +303,9 @@ factories. | `lookups` | Registry for large lookup tables available via `vocab.lookups`. | | `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). | | `assets` | | +| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. | +| `readers` | Registry for training and evaluation [data readers](#readers-batchers). | +| `batchers` | Registry for training and evaluation [data batchers](#readers-batchers). | | `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). | | `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). | | `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). | @@ -334,10 +337,113 @@ See the [`Transformer`](/api/transformer) API reference and | [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. | | [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. | -## Training data loaders and batchers {#loaders-batchers new="3"} +## Data readers and batchers {#readers-batchers new="3"} +### spacy.Corpus.v1 {#corpus tag="registered function" source="spacy/gold/corpus.py"} + +Registered function that creates a [`Corpus`](/api/corpus) of training or +evaluation data. It takes the same arguments as the `Corpus` class and returns a +callable that yields [`Example`](/api/example) objects. You can replace it with +your own registered function in the [`@readers` registry](#regsitry) to +customize the data loading and streaming. + +> #### Example config +> +> ```ini +> [paths] +> train = "corpus/train.spacy" +> +> [training.train_corpus] +> @readers = "spacy.Corpus.v1" +> path = ${paths:train} +> gold_preproc = false +> max_length = 0 +> limit = 0 +> ``` + +| Name | Type | Description | +| --------------- | ------ | ----------------------------------------------------------------------------------------------------------------------------------------------- | +| `path` | `Path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). | +|  `gold_preproc` | bool | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. | +| `max_length` | int | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. | +| `limit` | int | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. | + +### Batchers {#batchers source="spacy/gold/batchers.py"} + + + +#### batch_by_words.v1 {#batch_by_words tag="registered function"} + +Create minibatches of roughly a given number of words. If any examples are +longer than the specified batch length, they will appear in a batch by +themselves, or be discarded if `discard_oversize` is set to `True`. The argument +`docs` can be a list of strings, [`Doc`](/api/doc) objects or +[`Example`](/api/example) objects. + +> #### Example config +> +> ```ini +> [training.batcher] +> @batchers = "batch_by_words.v1" +> size = 100 +> tolerance = 0.2 +> discard_oversize = false +> get_length = null +> ``` + + + +| Name | Type | Description | +| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | +| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). | +| `tolerance` | float | | +| `discard_oversize` | bool | Discard items that are longer than the specified batch length. | +| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. | + +#### batch_by_sequence.v1 {#batch_by_sequence tag="registered function"} + + + +> #### Example config +> +> ```ini +> [training.batcher] +> @batchers = "batch_by_sequence.v1" +> size = 32 +> get_length = null +> ``` + + + +| Name | Type | Description | +| ------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | +| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). | +| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. | + +#### batch_by_padded.v1 {#batch_by_padded tag="registered function"} + + + +> #### Example config +> +> ```ini +> [training.batcher] +> @batchers = "batch_by_words.v1" +> size = 100 +> buffer = TODO: +> discard_oversize = false +> get_length = null +> ``` + +| Name | Type | Description | +| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | +| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). | +| `buffer` | int | | +| `discard_oversize` | bool | Discard items that are longer than the specified batch length. | +| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. | + ## Training data and alignment {#gold source="spacy/gold"} ### gold.docs_to_json {#docs_to_json tag="function"} diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 7c9d50921..c0ec052b9 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -5,8 +5,8 @@ menu: - ['Introduction', 'basics'] - ['Quickstart', 'quickstart'] - ['Config System', 'config'] - - ['Transfer Learning', 'transfer-learning'] - ['Custom Models', 'custom-models'] + - ['Transfer Learning', 'transfer-learning'] - ['Parallel Training', 'parallel-training'] - ['Internal API', 'api'] --- @@ -315,6 +315,10 @@ stop = 1000 compound = 1.001 ``` +### Using variable interpolation {#config-interpolation} + + + ### Model architectures {#model-architectures} @@ -384,41 +388,17 @@ still look good. -## Transfer learning {#transfer-learning} - -### Using transformer models like BERT {#transformers} - -spaCy v3.0 lets you use almost any statistical model to power your pipeline. You -can use models implemented in a variety of frameworks. A transformer model is -just a statistical model, so the -[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package -actually has very little work to do: it just has to provide a few functions that -do the required plumbing. It also provides a pipeline component, -[`Transformer`](/api/transformer), that lets you do multi-task learning and lets -you save the transformer outputs for later use. - - - -Try out a BERT-based model pipeline using this project template: swap in your -data, edit the settings and hyperparameters and train, evaluate, package and -visualize your model. - - - -For more details on how to integrate transformer models into your training -config and customize the implementations, see the usage guide on -[training transformers](/usage/transformers#training). - -### Pretraining with spaCy {#pretraining} - - - ## Custom model implementations and architectures {#custom-models} ### Training with custom code {#custom-code} +> ```bash +> ### Example {wrap="true"} +> $ python -m spacy train train.spacy dev.spacy config.cfg --code functions.py +> ``` + The [`spacy train`](/api/cli#train) recipe lets you specify an optional argument `--code` that points to a Python file. The file is imported before training and allows you to add custom functions and architectures to the function registry @@ -426,6 +406,120 @@ that can then be referenced from your `config.cfg`. This lets you train spaCy models with custom components, without having to re-implement the whole training workflow. +#### Example: Modifying the nlp object {#custom-code-nlp-callbacks} + +For many use cases, you don't necessarily want to implement the whole `Language` +subclass and language data from scratch – it's often enough to make a few small +modifications, like adjusting the +[tokenization rules](/usage/linguistic-features#native-tokenizer-additions) or +[language defaults](/api/language#defaults) like stop words. The config lets you +provide three optional **callback functions** that give you access to the +language class and `nlp` object at different points of the lifecycle: + +| Callback | Description | +| ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `before_creation` | Called before the `nlp` object is created and receives the language subclass like `English` (not the instance). Useful for writing to the [`Language.Defaults`](/api/language#defaults). | +| `after_creation` | Called right after the `nlp` object is created, but before the pipeline components are added to the pipeline and receives the `nlp` object. Useful for modifying the tokenizer. | +| `after_pipeline_creation` | Called right after the pipeline components are created and added and receives the `nlp` object. Useful for modifying pipeline components. | + +The `@spacy.registry.callbacks` decorator lets you register that function in the +`callbacks` [registry](/api/top-level#registry) under a given name. You can then +reference the function in a config block using the `@callbacks` key. If a block +contains a key starting with an `@`, it's interpreted as a reference to a +function. Because you've registered the function, spaCy knows how to create it +when you reference `"customize_language_data"` in your config. Here's an example +of a callback that runs before the `nlp` object is created and adds a few custom +tokenization rules to the defaults: + +> #### config.cfg +> +> ```ini +> [nlp.before_creation] +> @callbacks = "customize_language_data" +> ``` + +```python +### functions.py {highlight="3,6"} +import spacy + +@spacy.registry.callbacks("customize_language_data") +def create_callback(): + def customize_language_data(lang_cls): + lang_cls.Defaults.suffixes = lang_cls.Defaults.suffixes + (r"-+$",) + return lang_cls + + return customize_language_data +``` + + + +Remember that a registered function should always be a function that spaCy +**calls to create something**. In this case, it **creates a callback** – it's +not the callback itself. + + + +Any registered function – in this case `create_callback` – can also take +**arguments** that can be **set by the config**. This lets you implement and +keep track of different configurations, without having to hack at your code. You +can choose any arguments that make sense for your use case. In this example, +we're adding the arguments `extra_stop_words` (a list of strings) and `debug` +(boolean) for printing additional info when the function runs. + +> #### config.cfg +> +> ```ini +> [nlp.before_creation] +> @callbacks = "customize_language_data" +> extra_stop_words = ["ooh", "aah"] +> debug = true +> ``` + +```python +### functions.py {highlight="5,8-10"} +from typing import List +import spacy + +@spacy.registry.callbacks("customize_language_data") +def create_callback(extra_stop_words: List[str] = [], debug: bool = False): + def customize_language_data(lang_cls): + lang_cls.Defaults.suffixes = lang_cls.Defaults.suffixes + (r"-+$",) + lang_cls.Defaults.stop_words.add(extra_stop_words) + if debug: + print("Updated stop words and tokenizer suffixes") + return lang_cls + + return customize_language_data +``` + + + +spaCy's configs are powered by our machine learning library Thinc's +[configuration system](https://thinc.ai/docs/usage-config), which supports +[type hints](https://docs.python.org/3/library/typing.html) and even +[advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types) +using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered +function provides type hints, the values that are passed in will be checked +against the expected types. For example, `debug: bool` in the example above will +ensure that the value received as the argument `debug` is an boolean. If the +value can't be coerced into a boolean, spaCy will raise an error. +`start: pydantic.StrictBool` will force the value to be an boolean and raise an +error if it's not – for instance, if your config defines `1` instead of `true`. + + + +With your `functions.py` defining additional code and the updated `config.cfg`, +you can now run [`spacy train`](/api/cli#train) and point the argument `--code` +to your Python file. Before loading the config, spaCy will import the +`functions.py` module and your custom functions will be registered. + +```bash +### Training with custom code {wrap="true"} +python -m spacy train train.spacy dev.spacy config.cfg --output ./output --code ./functions.py +``` + +#### Example: Custom batch size schedule {#custom-code-schedule} + For example, let's say you've implemented your own batch size schedule to use during training. The `@spacy.registry.schedules` decorator lets you register that function in the `schedules` [registry](/api/top-level#registry) and assign @@ -459,8 +553,6 @@ the functions need to be represented in the config. If your function defines **default argument values**, spaCy is able to auto-fill your config when you run [`init config`](/api/cli#init-config). - - ```ini ### config.cfg (excerpt) [training.batch_size] @@ -469,31 +561,9 @@ start = 2 factor = 1.005 ``` -You can now run [`spacy train`](/api/cli#train) with the `config.cfg` and your -custom `functions.py` as the argument `--code`. Before loading the config, spaCy -will import the `functions.py` module and your custom functions will be -registered. +#### Example: Custom data reading and batching {#custom-code-readers-batchers} -```bash -### Training with custom code {wrap="true"} -python -m spacy train train.spacy dev.spacy config.cfg --output ./output --code ./functions.py -``` - - - -spaCy's configs are powered by our machine learning library Thinc's -[configuration system](https://thinc.ai/docs/usage-config), which supports -[type hints](https://docs.python.org/3/library/typing.html) and even -[advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types) -using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered -function provides type hints, the values that are passed in will be checked -against the expected types. For example, `start: int` in the example above will -ensure that the value received as the argument `start` is an integer. If the -value can't be coerced into an integer, spaCy will raise an error. -`start: pydantic.StrictInt` will force the value to be an integer and raise an -error if it's not – for instance, if your config defines a float. - - + ### Wrapping PyTorch and TensorFlow {#custom-frameworks} @@ -511,6 +581,35 @@ mattis pretium. +## Transfer learning {#transfer-learning} + +### Using transformer models like BERT {#transformers} + +spaCy v3.0 lets you use almost any statistical model to power your pipeline. You +can use models implemented in a variety of frameworks. A transformer model is +just a statistical model, so the +[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package +actually has very little work to do: it just has to provide a few functions that +do the required plumbing. It also provides a pipeline component, +[`Transformer`](/api/transformer), that lets you do multi-task learning and lets +you save the transformer outputs for later use. + + + +Try out a BERT-based model pipeline using this project template: swap in your +data, edit the settings and hyperparameters and train, evaluate, package and +visualize your model. + + + +For more details on how to integrate transformer models into your training +config and customize the implementations, see the usage guide on +[training transformers](/usage/transformers#training). + +### Pretraining with spaCy {#pretraining} + + + ## Parallel Training with Ray {#parallel-training} diff --git a/website/src/styles/aside.module.sass b/website/src/styles/aside.module.sass index 7746451b4..0e73cc61a 100644 --- a/website/src/styles/aside.module.sass +++ b/website/src/styles/aside.module.sass @@ -24,10 +24,16 @@ $border-radius: 6px &:last-child margin: 0 + &:first-child h4 + margin-top: 0 !important + code padding: 0 margin: 0 + h4 + margin-left: 0 + p, ul, ol font: inherit margin-bottom: var(--spacing-sm) diff --git a/website/src/styles/layout.sass b/website/src/styles/layout.sass index 9a8640b63..9660363dd 100644 --- a/website/src/styles/layout.sass +++ b/website/src/styles/layout.sass @@ -373,7 +373,7 @@ body [id]:target margin-right: -1.5em margin-left: -1.5em padding-right: 1.5em - padding-left: 1.65em + padding-left: 1.1em &:empty:before // Fix issue where empty lines would disappear