--- title: Training Models next: /usage/layers-architectures menu: - ['Introduction', 'basics'] - ['Quickstart', 'quickstart'] - ['Config System', 'config'] - ['Custom Functions', 'custom-functions'] # - ['Parallel Training', 'parallel-training'] - ['Internal API', 'api'] --- ## Introduction to training models {#basics hidden="true"} import Training101 from 'usage/101/\_training.md' [![Prodigy: Radically efficient machine teaching](../images/prodigy.jpg)](https://prodi.gy) If you need to label a lot of data, check out [Prodigy](https://prodi.gy), a new, active learning-powered annotation tool we've developed. Prodigy is fast and extensible, and comes with a modern **web application** that helps you collect training data faster. It integrates seamlessly with spaCy, pre-selects the **most relevant examples** for annotation, and lets you train and evaluate ready-to-use spaCy models. ## Quickstart {#quickstart tag="new"} The recommended way to train your spaCy models is via the [`spacy train`](/api/cli#train) command on the command line. It only needs a single [`config.cfg`](#config) **configuration file** that includes all settings and hyperparameters. You can optionally [overwrite](#config-overrides) settings on the command line, and load in a Python file to register [custom functions](#custom-code) and architectures. This quickstart widget helps you generate a starter config with the **recommended settings** for your specific use case. It's also available in spaCy as the [`init config`](/api/cli#init-config) command. > #### Instructions: widget > > 1. Select your requirements and settings. > 2. Use the buttons at the bottom to save the result to your clipboard or a > file `base_config.cfg`. > 3. Run [`init fill-config`](/api/cli#init-fill-config) to create a full > config. > 4. Run [`train`](/api/cli#train) with your config and data. > > #### Instructions: CLI > > 1. Run the [`init config`](/api/cli#init-config) command and specify your > requirements and settings as CLI arguments. > 2. Run [`train`](/api/cli#train) with the exported config and data. import QuickstartTraining from 'widgets/quickstart-training.js' After you've saved the starter config to a file `base_config.cfg`, you can use the [`init fill-config`](/api/cli#init-fill-config) command to fill in the remaining defaults. Training configs should always be **complete and without hidden defaults**, to keep your experiments reproducible. ```cli $ python -m spacy init fill-config base_config.cfg config.cfg ``` > #### Tip: Debug your data > > The [`debug data` command](/api/cli#debug-data) lets you analyze and validate > your training and development data, get useful stats, and find problems like > invalid entity annotations, cyclic dependencies, low data labels and more. > > ```cli > $ python -m spacy debug data config.cfg > ``` Instead of exporting your starter config from the quickstart widget and auto-filling it, you can also use the [`init config`](/api/cli#init-config) command and specify your requirement and settings as CLI arguments. You can now add your data and run [`train`](/api/cli#train) with your config. See the [`convert`](/api/cli#convert) command for details on how to convert your data to spaCy's binary `.spacy` format. You can either include the data paths in the `[paths]` section of your config, or pass them in via the command line. ```cli $ python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy ``` ## Training config {#config} Training config files include all **settings and hyperparameters** for training your model. Instead of providing lots of arguments on the command line, you only need to pass your `config.cfg` file to [`spacy train`](/api/cli#train). Under the hood, the training config uses the [configuration system](https://thinc.ai/docs/usage-config) provided by our machine learning library [Thinc](https://thinc.ai). This also makes it easy to integrate custom models and architectures, written in your framework of choice. Some of the main advantages and features of spaCy's training config are: - **Structured sections.** The config is grouped into sections, and nested sections are defined using the `.` notation. For example, `[components.ner]` defines the settings for the pipeline's named entity recognizer. The config can be loaded as a Python dict. - **References to registered functions.** Sections can refer to registered functions like [model architectures](/api/architectures), [optimizers](https://thinc.ai/docs/api-optimizers) or [schedules](https://thinc.ai/docs/api-schedules) and define arguments that are passed into them. You can also [register your own functions](#custom-functions) to define custom architectures or methods, reference them in your config and tweak their parameters. - **Interpolation.** If you have hyperparameters or other settings used by multiple components, define them once and reference them as [variables](#config-interpolation). - **Reproducibility with no hidden defaults.** The config file is the "single source of truth" and includes all settings. - **Automated checks and validation.** When you load a config, spaCy checks if the settings are complete and if all values have the correct types. This lets you catch potential mistakes early. In your custom architectures, you can use Python [type hints](https://docs.python.org/3/library/typing.html) to tell the config which types of data to expect. ```ini https://github.com/explosion/spaCy/blob/develop/spacy/default_config.cfg ``` Under the hood, the config is parsed into a dictionary. It's divided into sections and subsections, indicated by the square brackets and dot notation. For example, `[training]` is a section and `[training.batch_size]` a subsection. Subsections can define values, just like a dictionary, or use the `@` syntax to refer to [registered functions](#config-functions). This allows the config to not just define static settings, but also construct objects like architectures, schedules, optimizers or any other custom components. The main top-level sections of a config file are: | Section | Description | | ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `nlp` | Definition of the `nlp` object, its tokenizer and [processing pipeline](/usage/processing-pipelines) component names. | | `components` | Definitions of the [pipeline components](/usage/processing-pipelines) and their models. | | `paths` | Paths to data and other assets. Re-used across the config as variables, e.g. `${paths.train}`, and can be [overwritten](#config-overrides) on the CLI. | | `system` | Settings related to system and hardware. Re-used across the config as variables, e.g. `${system.seed}`, and can be [overwritten](#config-overrides) on the CLI. | | `training` | Settings and controls for the training and evaluation process. | | `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining). | For a full overview of spaCy's config format and settings, see the [data format documentation](/api/data-formats#config) and [Thinc's config system docs](https://thinc.ai/usage/config). The settings available for the different architectures are documented with the [model architectures API](/api/architectures). See the Thinc documentation for [optimizers](https://thinc.ai/docs/api-optimizers) and [schedules](https://thinc.ai/docs/api-schedules). ### Overwriting config settings on the command line {#config-overrides} The config system means that you can define all settings **in one place** and in a consistent format. There are no command-line arguments that need to be set, and no hidden defaults. However, there can still be scenarios where you may want to override config settings when you run [`spacy train`](/api/cli#train). This includes **file paths** to vectors or other resources that shouldn't be hard-code in a config file, or **system-dependent settings**. For cases like this, you can set additional command-line options starting with `--` that correspond to the config section and value to override. For example, `--paths.train ./corpus/train.spacy` sets the `train` value in the `[paths]` block. ```cli $ python -m spacy train config.cfg --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy --training.batch_size 128 ``` Only existing sections and values in the config can be overwritten. At the end of the training, the final filled `config.cfg` is exported with your model, so you'll always have a record of the settings that were used, including your overrides. Overrides are added before [variables](#config-interpolation) are resolved, by the way โ€“ย so if you need to use a value in multiple places, reference it across your config and override it on the CLI once. ### Defining pipeline components {#config-components} When you train a model, you typically train a [pipeline](/usage/processing-pipelines) of **one or more components**. The `[components]` block in the config defines the available pipeline components and how they should be created โ€“ either by a built-in or custom [factory](/usage/processing-pipelines#built-in), or [sourced](/usage/processing-pipelines#sourced-components) from an existing pretrained model. For example, `[components.parser]` defines the component named `"parser"` in the pipeline. There are different ways you might want to treat your components during training, and the most common scenarios are: 1. Train a **new component** from scratch on your data. 2. Update an existing **pretrained component** with more examples. 3. Include an existing pretrained component without updating it. 4. Include a non-trainable component, like a rule-based [`EntityRuler`](/api/entityruler) or [`Sentencizer`](/api/sentencizer), or a fully [custom component](/usage/processing-pipelines#custom-components). If a component block defines a `factory`, spaCy will look it up in the [built-in](/usage/processing-pipelines#built-in) or [custom](/usage/processing-pipelines#custom-components) components and create a new component from scratch. All settings defined in the config block will be passed to the component factory as arguments. This lets you configure the model settings and hyperparameters. If a component block defines a `source`, the component will be copied over from an existing pretrained model, with its existing weights. This lets you include an already trained component in your model pipeline, or update a pretrained component with more data specific to your use case. ```ini ### config.cfg (excerpt) [components] # "parser" and "ner" are sourced from a pretrained model [components.parser] source = "en_core_web_sm" [components.ner] source = "en_core_web_sm" # "textcat" and "custom" are created blank from a built-in / custom factory [components.textcat] factory = "textcat" [components.custom] factory = "your_custom_factory" your_custom_setting = true ``` The `pipeline` setting in the `[nlp]` block defines the pipeline components added to the pipeline, in order. For example, `"parser"` here references `[components.parser]`. By default, spaCy will **update all components that can be updated**. Trainable components that are created from scratch are initialized with random weights. For sourced components, spaCy will keep the existing weights and [resume training](/api/language#resume_training). If you don't want a component to be updated, you can **freeze** it by adding it to the `frozen_components` list in the `[training]` block. Frozen components are **not updated** during training and are included in the final trained model as-is. > #### Note on frozen components > > Even though frozen components are not **updated** during training, they will > still **run** during training and evaluation. This is very important, because > they may still impact your model's performance โ€“ for instance, a sentence > boundary detector can impact what the parser or entity recognizer considers a > valid parse. So the evaluation results should always reflect what your model > will produce at runtime. ```ini [nlp] lang = "en" pipeline = ["parser", "ner", "textcat", "custom"] [training] frozen_components = ["parser", "custom"] ``` ### Using registered functions {#config-functions} The training configuration defined in the config file doesn't have to only consist of static values. Some settings can also be **functions**. For instance, the `batch_size` can be a number that doesn't change, or a schedule, like a sequence of compounding values, which has shown to be an effective trick (see [Smith et al., 2017](https://arxiv.org/abs/1711.00489)). ```ini ### With static value [training] batch_size = 128 ``` To refer to a function instead, you can make `[training.batch_size]` its own section and use the `@` syntax to specify the function and its arguments โ€“ in this case [`compounding.v1`](https://thinc.ai/docs/api-schedules#compounding) defined in the [function registry](/api/top-level#registry). All other values defined in the block are passed to the function as keyword arguments when it's initialized. You can also use this mechanism to register [custom implementations and architectures](#custom-functions) and reference them from your configs. > #### How the config is resolved > > The config file is parsed into a regular dictionary and is resolved and > validated **bottom-up**. Arguments provided for registered functions are > checked against the function's signature and type annotations. The return > value of a registered function can also be passed into another function โ€“ for > instance, a learning rate schedule can be provided as the an argument of an > optimizer. ```ini ### With registered function [training.batch_size] @schedules = "compounding.v1" start = 100 stop = 1000 compound = 1.001 ``` ### Using variable interpolation {#config-interpolation} Another very useful feature of the config system is that it supports variable interpolation for both **values and sections**. This means that you only need to define a setting once and can reference it across your config using the `${section.value}` syntax. In this example, the value of `seed` is reused within the `[training]` block, and the whole block of `[training.optimizer]` is reused in `[pretraining]` and will become `pretraining.optimizer`. ```ini ### config.cfg (excerpt) {highlight="5,18"} [system] seed = 0 [training] seed = ${system.seed} [training.optimizer] @optimizers = "Adam.v1" beta1 = 0.9 beta2 = 0.999 L2_is_weight_decay = true L2 = 0.01 grad_clip = 1.0 use_averages = false eps = 1e-8 [pretraining] optimizer = ${training.optimizer} ``` You can also use variables inside strings. In that case, it works just like f-strings in Python. If the value of a variable is not a string, it's converted to a string. ```ini [paths] version = 5 root = "/Users/you/data" train = "${paths.root}/train_${paths.version}.spacy" # Result: /Users/you/data/train_5.spacy ``` If you need to change certain values between training runs, you can define them once, reference them as variables and then [override](#config-overrides) them on the CLI. For example, `--paths.root /other/root` will change the value of `root` in the block `[paths]` and the change will be reflected across all other values that reference this variable. ### Model architectures {#model-architectures} > #### ๐Ÿ’ก Model type annotations > > In the documentation and code base, you may come across type annotations and > descriptions of [Thinc](https://thinc.ai) model types, like ~~Model[List[Doc], > List[Floats2d]]~~. This so-called generic type describes the layer and its > input and output type โ€“ in this case, it takes a list of `Doc` objects as the > input and list of 2-dimensional arrays of floats as the output. You can read > more about defining Thinc modelsย [here](https://thinc.ai/docs/usage-models). > Also see the [type checking](https://thinc.ai/docs/usage-type-checking) for > how to enable linting in your editor to see live feedback if your inputs and > outputs don't match. A **model architecture** is a function that wires up a Thinc [`Model`](https://thinc.ai/docs/api-model) instance, which you can then use in a component or as a layer of a larger network. You can use Thinc as a thin [wrapper around frameworks](https://thinc.ai/docs/usage-frameworks) such as PyTorch, TensorFlow or MXNet, or you can implement your logic in Thinc [directly](https://thinc.ai/docs/usage-models). spaCy's built-in components will never construct their `Model` instances themselves, so you won't have to subclass the component to change its model architecture. You can just **update the config** so that it refers to a different registered function. Once the component has been created, its `Model` instance has already been assigned, so you cannot change its model architecture. The architecture is like a recipe for the network, and you can't change the recipe once the dish has already been prepared. You have to make a new one. spaCy includes a variety of built-in [architectures](/api/architectures) for different tasks. For example: | Architecture | Description | | ----------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | [HashEmbedCNN](/api/architectures#HashEmbedCNN) | Build spaCyโ€™s "standard" embedding layer, which uses hash embedding with subword features and a CNN with layer-normalized maxout. ~~Model[List[Doc], List[Floats2d]]~~ | | [TransitionBasedParser](/api/architectures#TransitionBasedParser) | Build a [transition-based parser](https://explosion.ai/blog/parsing-english-in-python) model used in the default [`EntityRecognizer`](/api/entityrecognizer) and [`DependencyParser`](/api/dependencyparser). ~~Model[List[Docs], List[List[Floats2d]]]~~ | | [TextCatEnsemble](/api/architectures#TextCatEnsemble) | Stacked ensemble of a bag-of-words model and a neural network model with an internal CNN embedding layer. Used in the default [`TextCategorizer`](/api/textcategorizer). ~~Model[List[Doc], Floats2d]~~ | ### Metrics, training output and weighted scores {#metrics} When you train a model using the [`spacy train`](/api/cli#train) command, you'll see a table showing the metrics after each pass over the data. The available metrics **depend on the pipeline components**. Pipeline components also define which scores are shown and how they should be **weighted in the final score** that decides about the best model. The `training.score_weights` setting in your `config.cfg` lets you customize the scores shown in the table and how they should be weighted. In this example, the labeled dependency accuracy and NER F-score count towards the final score with 40% each and the tagging accuracy makes up the remaining 20%. The tokenization accuracy and speed are both shown in the table, but not counted towards the score. > #### Why do I need score weights? > > At the end of your training process, you typically want to select the **best > model** โ€“ but what "best" means depends on the available components and your > specific use case. For instance, you may prefer a model with higher NER and > lower POS tagging accuracy over a model with lower NER and higher POS > accuracy. You can express this preference in the score weights, e.g. by > assigning `ents_f` (NER F-score) a higher weight. ```ini [training.score_weights] dep_las = 0.4 ents_f = 0.4 tag_acc = 0.2 token_acc = 0.0 speed = 0.0 ``` The `score_weights` don't _have to_ sum to `1.0` โ€“ but it's recommended. When you generate a config for a given pipeline, the score weights are generated by combining and normalizing the default score weights of the pipeline components. The default score weights are defined by each pipeline component via the `default_score_weights` setting on the [`@Language.component`](/api/language#component) or [`@Language.factory`](/api/language#factory). By default, all pipeline components are weighted equally. | Name | Description | | -------------------------- | ----------------------------------------------------------------------------------------------------------------------- | | **Loss** | The training loss representing the amount of work left for the optimizer. Should decrease, but usually not to `0`. | | **Precision** (P) | Percentage of predicted annotations that were correct. Should increase. | | **Recall** (R) | Percentage of reference annotations recovered. Should increase. | | **F-Score** (F) | Harmonic mean of precision and recall. Should increase. | | **UAS** / **LAS** | Unlabeled and labeled attachment score for the dependency parser, i.e. the percentage of correct arcs. Should increase. | | **Words per second** (WPS) | Prediction speed in words per second. Should stay stable. | Note that if the development data has raw text, some of the gold-standard entities might not align to the predicted tokenization. These tokenization errors are **excluded from the NER evaluation**. If your tokenization makes it impossible for the model to predict 50% of your entities, your NER F-score might still look good. ## Custom Functions {#custom-functions} Registered functions in the training config files can refer to built-in implementations, but you can also plug in fully **custom implementations**. All you need to do is register your function using the `@spacy.registry` decorator with the name of the respective [registry](/api/top-level#registry), e.g. `@spacy.registry.architectures`, and a string name to assign to your function. Registering custom functions allows you to **plug in models** defined in PyTorch or TensorFlow, make **custom modifications** to the `nlp` object, create custom optimizers or schedules, or **stream in data** and preprocesses it on the fly while training. Each custom function can have any numbers of arguments that are passed in via the [config](#config), just the built-in functions. If your function defines **default argument values**, spaCy is able to auto-fill your config when you run [`init fill-config`](/api/cli#init-fill-config). If you want to make sure that a given parameter is always explicitely set in the config, avoid setting a default value for it. ### Training with custom code {#custom-code} > #### Example > > ```cli > $ python -m spacy train config.cfg --code functions.py > ``` The [`spacy train`](/api/cli#train) recipe lets you specify an optional argument `--code` that points to a Python file. The file is imported before training and allows you to add custom functions and architectures to the function registry that can then be referenced from your `config.cfg`. This lets you train spaCy models with custom components, without having to re-implement the whole training workflow. #### Example: Modifying the nlp object {#custom-code-nlp-callbacks} For many use cases, you don't necessarily want to implement the whole `Language` subclass and language data from scratch โ€“ it's often enough to make a few small modifications, like adjusting the [tokenization rules](/usage/linguistic-features#native-tokenizer-additions) or [language defaults](/api/language#defaults) like stop words. The config lets you provide three optional **callback functions** that give you access to the language class and `nlp` object at different points of the lifecycle: | Callback | Description | | ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `before_creation` | Called before the `nlp` object is created and receives the language subclass like `English` (not the instance). Useful for writing to the [`Language.Defaults`](/api/language#defaults). | | `after_creation` | Called right after the `nlp` object is created, but before the pipeline components are added to the pipeline and receives the `nlp` object. Useful for modifying the tokenizer. | | `after_pipeline_creation` | Called right after the pipeline components are created and added and receives the `nlp` object. Useful for modifying pipeline components. | The `@spacy.registry.callbacks` decorator lets you register your custom function in the `callbacks` [registry](/api/top-level#registry) under a given name. You can then reference the function in a config block using the `@callbacks` key. If a block contains a key starting with an `@`, it's interpreted as a reference to a function. Because you've registered the function, spaCy knows how to create it when you reference `"customize_language_data"` in your config. Here's an example of a callback that runs before the `nlp` object is created and adds a few custom tokenization rules to the defaults: > #### config.cfg > > ```ini > [nlp.before_creation] > @callbacks = "customize_language_data" > ``` ```python ### functions.py {highlight="3,6"} import spacy @spacy.registry.callbacks("customize_language_data") def create_callback(): def customize_language_data(lang_cls): lang_cls.Defaults.suffixes = lang_cls.Defaults.suffixes + (r"-+$",) return lang_cls return customize_language_data ``` Remember that a registered function should always be a function that spaCy **calls to create something**. In this case, it **creates a callback** โ€“ย it's not the callback itself. Any registered function โ€“ in this case `create_callback` โ€“ can also take **arguments** that can be **set by the config**. This lets you implement and keep track of different configurations, without having to hack at your code. You can choose any arguments that make sense for your use case. In this example, we're adding the arguments `extra_stop_words` (a list of strings) and `debug` (boolean) for printing additional info when the function runs. > #### config.cfg > > ```ini > [nlp.before_creation] > @callbacks = "customize_language_data" > extra_stop_words = ["ooh", "aah"] > debug = true > ``` ```python ### functions.py {highlight="5,8-10"} from typing import List import spacy @spacy.registry.callbacks("customize_language_data") def create_callback(extra_stop_words: List[str] = [], debug: bool = False): def customize_language_data(lang_cls): lang_cls.Defaults.suffixes = lang_cls.Defaults.suffixes + (r"-+$",) lang_cls.Defaults.stop_words.add(extra_stop_words) if debug: print("Updated stop words and tokenizer suffixes") return lang_cls return customize_language_data ``` spaCy's configs are powered by our machine learning library Thinc's [configuration system](https://thinc.ai/docs/usage-config), which supports [type hints](https://docs.python.org/3/library/typing.html) and even [advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types) using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered function provides type hints, the values that are passed in will be checked against the expected types. For example, `debug: bool` in the example above will ensure that the value received as the argument `debug` is a boolean. If the value can't be coerced into a boolean, spaCy will raise an error. `debug: pydantic.StrictBool` will force the value to be a boolean and raise an error if it's not โ€“ for instance, if your config defines `1` instead of `true`. With your `functions.py` defining additional code and the updated `config.cfg`, you can now run [`spacy train`](/api/cli#train) and point the argument `--code` to your Python file. Before loading the config, spaCy will import the `functions.py` module and your custom functions will be registered. ```cli $ python -m spacy train config.cfg --output ./output --code ./functions.py ``` #### Example: Custom logging function {#custom-logging} During training, the results of each step are passed to a logger function in a dictionary providing the following information: | Key | Value | | -------------- | ---------------------------------------------------------------------------------------------- | | `epoch` | How many passes over the data have been completed. ~~int~~ | | `step` | How many steps have been completed. ~~int~~ | | `score` | The main score form the last evaluation, measured on the dev set. ~~float~~ | | `other_scores` | The other scores from the last evaluation, measured on the dev set. ~~Dict[str, Any]~~ | | `losses` | The accumulated training losses. ~~Dict[str, float]~~ | | `checkpoints` | A list of previous results, where each result is a (score, step, epoch) tuple. ~~List[Tuple]~~ | By default, these results are written to the console with the [`ConsoleLogger`](/api/top-level#ConsoleLogger) # TODO #### Example: Custom batch size schedule {#custom-code-schedule} For example, let's say you've implemented your own batch size schedule to use during training. The `@spacy.registry.schedules` decorator lets you register that function in the `schedules` [registry](/api/top-level#registry) and assign it a string name: > #### Why the version in the name? > > A big benefit of the config system is that it makes your experiments > reproducible. We recommend versioning the functions you register, especially > if you expect them to change (like a new model architecture). This way, you > know that a config referencing `v1` means a different function than a config > referencing `v2`. ```python ### functions.py import spacy @spacy.registry.schedules("my_custom_schedule.v1") def my_custom_schedule(start: int = 1, factor: int = 1.001): while True: yield start start = start * factor ``` In your config, you can now reference the schedule in the `[training.batch_size]` block via `@schedules`. If a block contains a key starting with an `@`, it's interpreted as a reference to a function. All other settings in the block will be passed to the function as keyword arguments. Keep in mind that the config shouldn't have any hidden defaults and all arguments on the functions need to be represented in the config. ```ini ### config.cfg (excerpt) [training.batch_size] @schedules = "my_custom_schedule.v1" start = 2 factor = 1.005 ``` #### Example: Custom data reading and batching {#custom-code-readers-batchers} Some use-cases require **streaming in data** or manipulating datasets on the fly, rather than generating all data beforehand and storing it to file. Instead of using the built-in [`Corpus`](/api/corpus) reader, which uses static file paths, you can create and register a custom function that generates [`Example`](/api/example) objects. The resulting generator can be infinite. When using this dataset for training, stopping criteria such as maximum number of steps, or stopping when the loss does not decrease further, can be used. In this example we assume a custom function `read_custom_data` which loads or generates texts with relevant text classification annotations. Then, small lexical variations of the input text are created before generating the final [`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets you register the function creating the custom reader in the `readers` [registry](/api/top-level#registry) and assign it a string name, so it can be used in your config. All arguments on the registered function become available as **config settings** โ€“ in this case, `source`. > #### config.cfg > > ```ini > [training.train_corpus] > @readers = "corpus_variants.v1" > source = "s3://your_bucket/path/data.csv" > ``` ```python ### functions.py {highlight="7-8"} from typing import Callable, Iterator, List import spacy from spacy.gold import Example from spacy.language import Language import random @spacy.registry.readers("corpus_variants.v1") def stream_data(source: str) -> Callable[[Language], Iterator[Example]]: def generate_stream(nlp): for text, cats in read_custom_data(source): # Create a random variant of the example text i = random.randint(0, len(text) - 1) variant = text[:i] + text[i].upper() + text[i + 1:] doc = nlp.make_doc(variant) example = Example.from_dict(doc, {"cats": cats}) yield example return generate_stream ``` Remember that a registered function should always be a function that spaCy **calls to create something**. In this case, it **creates the reader function** โ€“ย it's not the reader itself. We can also customize the **batching strategy** by registering a new batcher function in the `batchers` [registry](/api/top-level#registry). A batcher turns a stream of items into a stream of batches. spaCy has several useful built-in [batching strategies](/api/top-level#batchers) with customizable sizes, but it's also easy to implement your own. For instance, the following function takes the stream of generated [`Example`](/api/example) objects, and removes those which have the same underlying raw text, to avoid duplicates within each batch. Note that in a more realistic implementation, you'd also want to check whether the annotations are the same. > #### config.cfg > > ```ini > [training.batcher] > @batchers = "filtering_batch.v1" > size = 150 > ``` ```python ### functions.py from typing import Callable, Iterable, Iterator, List import spacy from spacy.gold import Example @spacy.registry.batchers("filtering_batch.v1") def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterator[List[Example]]]: def create_filtered_batches(examples): batch = [] for eg in examples: # Remove duplicate examples with the same text from batch if eg.text not in [x.text for x in batch]: batch.append(eg) if len(batch) == size: yield batch batch = [] return create_filtered_batches ``` ### Defining custom architectures {#custom-architectures} ## Internal training API {#api} spaCy gives you full control over the training loop. However, for most use cases, it's recommended to train your models via the [`spacy train`](/api/cli#train) command with a [`config.cfg`](#config) to keep track of your settings and hyperparameters, instead of writing your own training scripts from scratch. [Custom registered functions](#custom-code) should typically give you everything you need to train fully custom models with [`spacy train`](/api/cli#train). The [`Example`](/api/example) object contains annotated training data, also called the **gold standard**. It's initialized with a [`Doc`](/api/doc) object that will hold the predictions, and another `Doc` object that holds the gold-standard annotations. It also includes the **alignment** between those two documents if they differ in tokenization. The `Example` class ensures that spaCy can rely on one **standardized format** that's passed through the pipeline. For instance, let's say we want to define gold-standard part-of-speech tags: ```python words = ["I", "like", "stuff"] predicted = Doc(vocab, words=words) # create the reference Doc with gold-standard TAG annotations tags = ["NOUN", "VERB", "NOUN"] tag_ids = [vocab.strings.add(tag) for tag in tags] reference = Doc(vocab, words=words).from_array("TAG", numpy.array(tag_ids, dtype="uint64")) example = Example(predicted, reference) ``` As this is quite verbose, there's an alternative way to create the reference `Doc` with the gold-standard annotations. The function `Example.from_dict` takes a dictionary with keyword arguments specifying the annotations, like `tags` or `entities`. Using the resulting `Example` object and its gold-standard annotations, the model can be updated to learn a sentence of three words with their assigned part-of-speech tags. > #### About the tag map > > The tag map is part of the vocabulary and defines the annotation scheme. If > you're training a new language model, this will let you map the tags present > in the treebank you train on to spaCy's tag scheme: > > ```python > tag_map = {"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}} > vocab = Vocab(tag_map=tag_map) > ``` ```python words = ["I", "like", "stuff"] tags = ["NOUN", "VERB", "NOUN"] predicted = Doc(nlp.vocab, words=words) example = Example.from_dict(predicted, {"tags": tags}) ``` Here's another example that shows how to define gold-standard named entities. The letters added before the labels refer to the tags of the [BILUO scheme](/usage/linguistic-features#updating-biluo) โ€“ `O` is a token outside an entity, `U` a single entity unit, `B` the beginning of an entity, `I` a token inside an entity and `L` the last token of an entity. ```python doc = Doc(nlp.vocab, words=["Facebook", "released", "React", "in", "2014"]) example = Example.from_dict(doc, {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]}) ``` As of v3.0, the [`Example`](/api/example) object replaces the `GoldParse` class. It can be constructed in a very similar way, from a `Doc` and a dictionary of annotations. For more details, see the [migration guide](/usage/v3#migrating-training). ```diff - gold = GoldParse(doc, entities=entities) + example = Example.from_dict(doc, {"entities": entities}) ``` Of course, it's not enough to only show a model a single example once. Especially if you only have few examples, you'll want to train for a **number of iterations**. At each iteration, the training data is **shuffled** to ensure the model doesn't make any generalizations based on the order of examples. Another technique to improve the learning results is to set a **dropout rate**, a rate at which to randomly "drop" individual features and representations. This makes it harder for the model to memorize the training data. For example, a `0.25` dropout means that each feature or internal representation has a 1/4 likelihood of being dropped. > - [`nlp`](/api/language): The `nlp` object with the model. > - [`nlp.begin_training`](/api/language#begin_training): Start the training and > return an optimizer to update the model's weights. > - [`Optimizer`](https://thinc.ai/docs/api-optimizers): Function that holds > state between updates. > - [`nlp.update`](/api/language#update): Update model with examples. > - [`Example`](/api/example): object holding predictions and gold-standard > annotations. > - [`nlp.to_disk`](/api/language#to_disk): Save the updated model to a > directory. ```python ### Example training loop optimizer = nlp.begin_training() for itn in range(100): random.shuffle(train_data) for raw_text, entity_offsets in train_data: doc = nlp.make_doc(raw_text) example = Example.from_dict(doc, {"entities": entity_offsets}) nlp.update([example], sgd=optimizer) nlp.to_disk("/model") ``` The [`nlp.update`](/api/language#update) method takes the following arguments: | Name | Description | | ---------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `examples` | [`Example`](/api/example) objects. The `update` method takes a sequence of them, so you can batch up your training examples. | | `drop` | Dropout rate. Makes it harder for the model to just memorize the data. | | `sgd` | An [`Optimizer`](https://thinc.ai/docs/api-optimizers) object, which updated the model's weights. If not set, spaCy will create a new one and save it for further use. | As of v3.0, the [`Example`](/api/example) object replaces the `GoldParse` class and the "simple training style" of calling `nlp.update` with a text and a dictionary of annotations. Updating your code to use the `Example` object should be very straightforward: you can call [`Example.from_dict`](/api/example#from_dict) with a [`Doc`](/api/doc) and the dictionary of annotations: ```diff text = "Facebook released React in 2014" annotations = {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]} + example = Example.from_dict(nlp.make_doc(text), annotations) - nlp.update([text], [annotations]) + nlp.update([example]) ```