59 KiB
title | teaser | menu | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
What's New in v3.0 | New features, backwards incompatibilities and migration guide |
|
Summary
spaCy v3.0 features all new transformer-based pipelines that bring spaCy's accuracy right up to the current state-of-the-art. You can use any pretrained transformer to train your own pipelines, and even share one transformer between multiple components with multi-task learning. Training is now fully configurable and extensible, and you can define your own custom models using PyTorch, TensorFlow and other frameworks. The new spaCy projects system lets you describe whole end-to-end workflows in a single file, giving you an easy path from prototype to production, and making it easy to clone and adapt best-practice projects for your own use cases.
- Summary
- New features
- Transformer-based pipelines
- Training & config system
- Custom models
- End-to-end project workflows
- Parallel training with Ray
- New built-in components
- New custom component API
- Dependency matching
- Python type hints
- New methods & attributes
- New & updated documentation
- Backwards incompatibilities
- Migrating from spaCy v2.x
New Features
This section contains an overview of the most important new features and improvements. The API docs include additional deprecation notes. New methods and functions that were introduced in this version are marked with the tag 3.
Transformer-based pipelines
Example
$ python -m spacy download en_core_web_trf
spaCy v3.0 features all new transformer-based pipelines that bring spaCy's
accuracy right up to the current state-of-the-art. You can use any
pretrained transformer to train your own pipelines, and even share one
transformer between multiple components with multi-task learning. spaCy's
transformer support interoperates with PyTorch and the
HuggingFace transformers
library,
giving you access to thousands of pretrained models for your pipelines.
import Benchmarks from 'usage/_benchmarks-models.md'
New trained transformer-based pipelines
Notes on model capabilities
The models are each trained with a single transformer shared across the pipeline, which requires it to be trained on a single corpus. For English and Chinese, we used the OntoNotes 5 corpus, which has annotations across several tasks. For French, Spanish and German, we didn't have a suitable corpus that had both syntactic and entity annotations, so the transformer models for those languages do not include NER.
Package | Language | Transformer | Tagger | Parser | NER |
---|---|---|---|---|---|
en_core_web_trf |
English | roberta-base |
97.8 | 95.0 | 89.4 |
de_dep_news_trf |
German | bert-base-german-cased |
99.0 | 95.8 | - |
es_dep_news_trf |
Spanish | bert-base-spanish-wwm-cased |
98.2 | 94.6 | - |
fr_dep_news_trf |
French | camembert-base |
95.7 | 94.9 | - |
zh_core_web_trf |
Chinese | bert-base-chinese |
92.5 | 77.2 | 75.6 |
- Usage: Embeddings & Transformers, Training pipelines and models, Benchmarks
- API:
Transformer
,TransformerData
,FullTransformerBatch
- **Architectures: ** TransformerModel, TransformerListener, Tok2VecTransformer
- Implementation:
spacy-transformers
New training workflow and config system
Example
[training] accumulate_gradient = 3 [training.optimizer] @optimizers = "Adam.v1" [training.optimizer.learn_rate] @schedules = "warmup_linear.v1" warmup_steps = 250 total_steps = 20000 initial_rate = 0.01
spaCy v3.0 introduces a comprehensive and extensible system for configuring
your training runs. A single configuration file describes every detail of your
training run, with no hidden defaults, making it easy to rerun your experiments
and track changes. You can use the
quickstart widget or the init config
command to
get started. Instead of providing lots of arguments on the command line, you
only need to pass your config.cfg
file to spacy train
.
Training config files include all settings and hyperparameters for training
your pipeline. Some settings can also be registered functions that you can
swap out and customize, making it easy to implement your own custom models and
architectures.
- Usage: Training pipelines and models
- Thinc: Thinc's config system,
Config
- CLI:
init config
,init fill-config
,train
,pretrain
,evaluate
- API: Config format,
registry
Custom models using any framework
Example
from torch import nn from thinc.api import PyTorchWrapper torch_model = nn.Sequential( nn.Linear(32, 32), nn.ReLU(), nn.Softmax(dim=1) ) model = PyTorchWrapper(torch_model)
spaCy's new configuration system makes it easy to customize the neural network
models used by the different pipeline components. You can also implement your
own architectures via spaCy's machine learning library Thinc
that provides various layers and utilities, as well as thin wrappers around
frameworks like PyTorch, TensorFlow and MXNet. Component models all
follow the same unified Model
API and each
Model
can also be used as a sublayer of a larger network, allowing you to
freely combine implementations from different frameworks into a single model.
- **Usage: ** Layers and architectures, Trainable component API, Trainable components and models
- **Thinc: **
Wrapping PyTorch, TensorFlow & MXNet,
Model
API - API: Model architectures,
TrainablePipe
Manage end-to-end workflows with projects
Example
# Clone a project template $ python -m spacy project clone pipelines/tagger_parser_ud $ cd tagger_parser_ud # Download data assets $ python -m spacy project assets # Run a workflow $ python -m spacy project run all
spaCy projects let you manage and share end-to-end spaCy workflows for different use cases and domains, and orchestrate training, packaging and serving your custom pipelines. You can start off by cloning a pre-defined project template, adjust it to fit your needs, load in your data, train a pipeline, export it as a Python package, upload your outputs to a remote storage and share your results with your team.
spaCy projects also make it easy to integrate with other tools in the data science and machine learning ecosystem, including DVC for data version control, Prodigy for creating labelled data, Streamlit for building interactive apps, FastAPI for serving models in production, Ray for parallel training, Weights & Biases for experiment tracking, and more!
- Usage: spaCy projects, Training pipelines and models
- CLI:
project
,train
- Templates:
projects
The easiest way to get started is to clone a project template and run it – for example, this end-to-end template that lets you train a part-of-speech tagger and dependency parser on a Universal Dependencies treebank.
Parallel and distributed training with Ray
Example
$ pip install -U %%SPACY_PKG_NAME[ray]%%SPACY_PKG_FLAGS # Check that the CLI is registered $ python -m spacy ray --help # Train a pipeline $ python -m spacy ray train config.cfg --n-workers 2
Ray is a fast and simple framework for building and running
distributed applications. You can use Ray to train spaCy on one or more
remote machines, potentially speeding up your training process. The Ray
integration is powered by a lightweight extension package,
spacy-ray
, that automatically adds
the ray
command to your spaCy CLI if it's installed in the
same environment. You can then run spacy ray train
for
parallel training.
- **Usage: ** Parallel and distributed training, spaCy Projects integration
- CLI:
ray
,ray train
- Implementation:
spacy-ray
New built-in pipeline components
spaCy v3.0 includes several new trainable and rule-based components that you can add to your pipeline and customize for your use case:
Example
# pip install -U %%SPACY_PKG_NAME[lookups]%%SPACY_PKG_FLAGS nlp = spacy.blank("en") nlp.add_pipe("lemmatizer")
Name | Description |
---|---|
SentenceRecognizer |
Trainable component for sentence segmentation. |
Morphologizer |
Trainable component to predict morphological features. |
Lemmatizer |
Standalone component for rule-based and lookup lemmatization. |
AttributeRuler |
Component for setting token attributes using match patterns. |
Transformer |
Component for using transformer models in your pipeline, accessing outputs and aligning tokens. Provided via spacy-transformers . |
TrainablePipe |
Base class for trainable pipeline components. |
- Usage: Processing pipelines
- API: Built-in pipeline components
- Implementation:
spacy/pipeline
New and improved pipeline component APIs
Example
@Language.component("my_component") def my_component(doc): return doc nlp.add_pipe("my_component") nlp.add_pipe("ner", source=other_nlp) nlp.analyze_pipes(pretty=True)
Defining, configuring, reusing, training and analyzing pipeline components is
now easier and more convenient. The @Language.component
and
@Language.factory
decorators let you register your component, define its
default configuration and meta data, like the attribute values it assigns and
requires. Any custom component can be included during training, and sourcing
components from existing trained pipelines lets you mix and match custom
pipelines. The nlp.analyze_pipes
method outputs structured information about
the current pipeline and its components, including the attributes they assign,
the scores they compute during training and whether any required attributes
aren't set.
- Usage: Custom components, Defining components for training
- API:
@Language.component
,@Language.factory
,Language.add_pipe
,Language.analyze_pipes
- Implementation:
spacy/language.py
Dependency matching
Example
from spacy.matcher import DependencyMatcher matcher = DependencyMatcher(nlp.vocab) pattern = [ {"RIGHT_ID": "anchor_founded", "RIGHT_ATTRS": {"ORTH": "founded"}}, {"LEFT_ID": "anchor_founded", "REL_OP": ">", "RIGHT_ID": "subject", "RIGHT_ATTRS": {"DEP": "nsubj"}} ] matcher.add("FOUNDED", [pattern])
The new DependencyMatcher
lets you match patterns
within the dependency parse using
Semgrex
operators. It follows the same API as the token-based Matcher
.
A pattern added to the dependency matcher consists of a list of
dictionaries, with each dictionary describing a token to match and its
relation to an existing token in the pattern.
- Usage: Dependency matching,
- API:
DependencyMatcher
, - Implementation:
spacy/matcher/dependencymatcher.pyx
Type hints and type-based data validation
Example
from spacy.language import Language from pydantic import StrictBool @Language.factory("my_component") def create_my_component( nlp: Language, name: str, custom: StrictBool ): ...
spaCy v3.0 officially drops support for Python 2 and now requires Python
3.6+. This also means that the code base can take full advantage of
type hints. spaCy's user-facing
API that's implemented in pure Python (as opposed to Cython) now comes with type
hints. The new version of spaCy's machine learning library
Thinc also features extensive
type support, including custom
types for models and arrays, and a custom mypy
plugin that can be used to
type-check model definitions.
For data validation, spaCy v3.0 adopts
pydantic
. It also powers the data
validation of Thinc's config system, which
lets you register custom functions with typed arguments, reference them in
your config and see validation errors if the argument values don't match.
- **Usage: ** Component type hints and validation, Training with custom code
- **Thinc: ** Type checking in Thinc, Thinc's config system
New methods, attributes and commands
The following methods, attributes and commands are new in spaCy v3.0.
Name | Description |
---|---|
Token.lex |
Access a token's Lexeme . |
Token.morph |
Access a token's morphological analysis. |
Doc.has_annotation |
Check whether a doc has annotation on a token attribute. |
Language.select_pipes |
Context manager for enabling or disabling specific pipeline components for a block. |
Language.disable_pipe , Language.enable_pipe |
Disable or enable a loaded pipeline component (but don't remove it). |
Language.analyze_pipes |
Analyze components and their interdependencies. |
Language.resume_training |
Experimental: continue training a trained pipeline and initialize "rehearsal" for components that implement a rehearse method to prevent catastrophic forgetting. |
@Language.factory , @Language.component |
Decorators for registering pipeline component factories and simple stateless component functions. |
Language.has_factory |
Check whether a component factory is registered on a language class. |
Language.get_factory_meta , Language.get_pipe_meta |
Get the FactoryMeta with component metadata for a factory or instance name. |
Language.config |
The config used to create the current nlp object. An instance of Config and can be saved to disk and used for training. |
Language.components , Language.component_names |
All available components and component names, including disabled components that are not run as part of the pipeline. |
Language.disabled |
Names of disabled components that are not run as part of the pipeline. |
TrainablePipe.score |
Method on pipeline components that returns a dictionary of evaluation scores. |
registry |
Function registry to map functions to string names that can be referenced in configs. |
util.load_meta , util.load_config |
Updated helpers for loading a pipeline's meta.json and config.cfg . |
util.get_installed_models |
Names of all pipeline packages installed in the environment. |
init config , init fill-config , debug config |
CLI commands for initializing, auto-filling and debugging training configs. |
init vectors |
Convert word vectors for use with spaCy. |
init labels |
Generate JSON files for the labels in the data to speed up training. |
project |
Suite of CLI commands for cloning, running and managing spaCy projects. |
ray |
Suite of CLI commands for parallel training with Ray, provided by the spacy-ray extension package. |
New and updated documentation
To help you get started with spaCy v3.0 and the new features, we've added several new or rewritten documentation pages, including a new usage guide on embeddings, transformers and transfer learning, a guide on training pipelines and models rewritten from scratch, a page explaining the new spaCy projects and updated usage documentation on custom pipeline components. We've also added a bunch of new illustrations and new API reference pages documenting spaCy's machine learning model architectures and the expected data formats. API pages about pipeline components now include more information, like the default config and implementation, and we've adopted a more detailed format for documenting argument and return types.
- **Usage: ** Embeddings & Transformers, Training models, Layers & Architectures, Projects, Custom pipeline components, Custom tokenizers, Morphology, Lemmatization, Mapping & Exceptions, Dependency matching
- **API Reference: ** Library architecture, Model architectures, Data formats
- **New Classes: **
Example
,Tok2Vec
,Transformer
,Lemmatizer
,Morphologizer
,AttributeRuler
,SentenceRecognizer
,DependencyMatcher
,TrainablePipe
,Corpus
,SpanGroup
,
Backwards Incompatibilities
As always, we've tried to keep the breaking changes to a minimum and focus on changes that were necessary to support the new features, fix problems or improve usability. The following section lists the relevant changes to the user-facing API. For specific examples of how to rewrite your code, check out the migration guide.
Note that spaCy v3.0 now requires Python 3.6+.
API changes
- Pipeline package symlinks, the
link
command and shortcut names are now deprecated. There can be many different trained pipelines and not just one "English model", so you should always use the full package name likeen_core_web_sm
explicitly. - A pipeline's
meta.json
is now only used to provide meta information like the package name, author, license and labels. It's not used to construct the processing pipeline anymore. This is all defined in theconfig.cfg
, which also includes all settings used to train the pipeline. - The
train
,pretrain
anddebug data
commands now only take aconfig.cfg
. Language.add_pipe
now takes the string name of the component factory instead of the component function.- Custom pipeline components now need to be decorated with the
@Language.component
or@Language.factory
decorator. - The
Language.update
,Language.evaluate
andTrainablePipe.update
methods now all take batches ofExample
objects instead ofDoc
andGoldParse
objects, or raw text and a dictionary of annotations. - The
begin_training
methods have been renamed toinitialize
and now take a function that returns a sequence ofExample
objects to initialize the model instead of a list of tuples. Matcher.add
andPhraseMatcher.add
now only accept a list of patterns as the second argument (instead of a variable number of arguments). Theon_match
callback becomes an optional keyword argument.- The
Doc
flags likeDoc.is_parsed
orDoc.is_tagged
have been replaced byDoc.has_annotation
. - The
spacy.gold
module has been renamed tospacy.training
. - The
PRON_LEMMA
symbol and-PRON-
as an indicator for pronoun lemmas has been removed. - The
TAG_MAP
andMORPH_RULES
in the language data have been replaced by the more flexibleAttributeRuler
. - The
Lemmatizer
is now a standalone pipeline component and doesn't provide lemmas by default or switch automatically between lookup and rule-based lemmas. You can now add it to your pipeline explicitly and set its mode on initialization. - Various keyword arguments across functions and methods are now explicitly declared as keyword-only arguments. Those arguments are documented accordingly across the API reference.
Removed or renamed API
Removed | Replacement |
---|---|
Language.disable_pipes |
Language.select_pipes , Language.disable_pipe , Language.enable_pipe |
Language.begin_training , Pipe.begin_training , ... |
Language.initialize , Pipe.initialize , ... |
Doc.is_tagged , Doc.is_parsed , ... |
Doc.has_annotation |
GoldParse |
Example |
GoldCorpus |
Corpus |
KnowledgeBase.load_bulk , KnowledgeBase.dump |
KnowledgeBase.from_disk , KnowledgeBase.to_disk |
Matcher.pipe , PhraseMatcher.pipe |
not needed |
gold.offsets_from_biluo_tags , gold.spans_from_biluo_tags , gold.biluo_tags_from_offsets |
training.biluo_tags_to_offsets , training.biluo_tags_to_spans , training.offsets_to_biluo_tags |
spacy init-model |
spacy init vectors |
spacy debug-data |
spacy debug data |
spacy profile |
spacy debug profile |
spacy link , util.set_data_path , util.get_data_path |
not needed, symlinks are deprecated |
The following deprecated methods, attributes and arguments were removed in v3.0. Most of them have been deprecated for a while and many would previously raise errors. Many of them were also mostly internals. If you've been working with more recent versions of spaCy v2.x, it's unlikely that your code relied on them.
Removed | Replacement |
---|---|
Doc.tokens_from_list |
Doc.__init__ |
Doc.merge , Span.merge |
Doc.retokenize |
Token.string , Span.string , Span.upper , Span.lower |
Span.text , Token.text |
Language.tagger , Language.parser , Language.entity |
Language.get_pipe |
keyword-arguments like vocab=False on to_disk , from_disk , to_bytes , from_bytes |
exclude=["vocab"] |
n_threads argument on Tokenizer , Matcher , PhraseMatcher |
n_process |
verbose argument on Language.evaluate |
logging (DEBUG ) |
SentenceSegmenter hook, SimilarityHook |
user hooks, Sentencizer , SentenceRecognizer |
Migrating from v2.x
Downloading and loading trained pipelines
Symlinks and shortcuts like en
are now officially deprecated. There are
many different trained pipelines with different capabilities and not
just one "English model". In order to download and load a package, you should
always use its full name – for instance,
en_core_web_sm
.
- python -m spacy download en
+ python -m spacy download en_core_web_sm
- nlp = spacy.load("en")
+ nlp = spacy.load("en_core_web_sm")
Custom pipeline components and factories
Custom pipeline components now have to be registered explicitly using the
@Language.component
or
@Language.factory
decorator. For simple functions
that take a Doc
and return it, all you have to do is add the
@Language.component
decorator to it and assign it a name:
### Stateless function components
+ from spacy.language import Language
+ @Language.component("my_component")
def my_component(doc):
return doc
For class components that are initialized with settings and/or the shared nlp
object, you can use the @Language.factory
decorator. Also make sure that that
the method used to initialize the factory has two named arguments: nlp
(the current nlp
object) and name
(the string name of the component
instance).
### Stateful class components
+ from spacy.language import Language
+ @Language.factory("my_component")
class MyComponent:
- def __init__(self, nlp):
+ def __init__(self, nlp, name):
self.nlp = nlp
def __call__(self, doc):
return doc
Instead of decorating your class, you could also add a factory function that
takes the arguments nlp
and name
and returns an instance of your component:
### Stateful class components with factory function
+ from spacy.language import Language
+ @Language.factory("my_component")
+ def create_my_component(nlp, name):
+ return MyComponent(nlp)
class MyComponent:
def __init__(self, nlp):
self.nlp = nlp
def __call__(self, doc):
return doc
The @Language.component
and @Language.factory
decorators now take care of
adding an entry to the component factories, so spaCy knows how to load a
component back in from its string name. You won't have to write to
Language.factories
manually anymore.
- Language.factories["my_component"] = lambda nlp, **cfg: MyComponent(nlp)
Adding components to the pipeline
The nlp.add_pipe
method now takes the string
name of the component factory instead of a callable component. This allows
spaCy to track and serialize components that have been added and their settings.
+ @Language.component("my_component")
def my_component(doc):
return doc
- nlp.add_pipe(my_component)
+ nlp.add_pipe("my_component")
nlp.add_pipe
now also returns the pipeline component
itself, so you can access its attributes. The
nlp.create_pipe
method is now mostly internals
and you typically shouldn't have to use it in your code.
- parser = nlp.create_pipe("parser")
- nlp.add_pipe(parser)
+ parser = nlp.add_pipe("parser")
If you need to add a component from an existing trained pipeline, you can now
use the source
argument on nlp.add_pipe
. This will
check that the component is compatible, and take care of porting over all
config. During training, you can also reference existing trained components in
your config and decide whether or not they
should be updated with more data.
config.cfg (excerpt)
[components.ner] source = "en_core_web_sm" component = "ner"
source_nlp = spacy.load("en_core_web_sm")
nlp = spacy.blank("en")
- ner = source_nlp.get_pipe("ner")
- nlp.add_pipe(ner)
+ nlp.add_pipe("ner", source=source_nlp)
Configuring pipeline components with settings
Because pipeline components are now added using their string names, you won't
have to instantiate the component classes
directly anymore. To configure the component, you can now use the config
argument on nlp.add_pipe
.
config.cfg (excerpt)
[components.sentencizer] factory = "sentencizer" punct_chars = ["!", ".", "?"]
punct_chars = ["!", ".", "?"]
- sentencizer = Sentencizer(punct_chars=punct_chars)
+ sentencizer = nlp.add_pipe("sentencizer", config={"punct_chars": punct_chars})
The config
corresponds to the component settings in the
config.cfg
and will overwrite the default
config defined by the components.
Config values you pass to components need to be JSON-serializable and can't
be arbitrary Python objects. Otherwise, the settings you provide can't be
represented in the config.cfg
and spaCy has no way of knowing how to re-create
your component with the same settings when you load the pipeline back in. If you
need to pass arbitrary objects to a component, use a
registered function:
- config = {"model": MyTaggerModel()}
+ config= {"model": {"@architectures": "MyTaggerModel"}}
tagger = nlp.add_pipe("tagger", config=config)
Adding match patterns
The Matcher.add
,
PhraseMatcher.add
and
DependencyMatcher.add
methods now only accept a
list of patterns as the second argument (instead of a variable number of
arguments). The on_match
callback becomes an optional keyword argument.
matcher = Matcher(nlp.vocab)
patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]]
- matcher.add("GoogleNow", on_match, *patterns)
+ matcher.add("GoogleNow", patterns, on_match=on_match)
matcher = PhraseMatcher(nlp.vocab)
patterns = [nlp("health care reform"), nlp("healthcare reform")]
- matcher.add("HEALTH", on_match, *patterns)
+ matcher.add("HEALTH", patterns, on_match=on_match)
Migrating attributes in tokenizer exceptions
Tokenizer exceptions are now only allowed to set ORTH
and NORM
values as
part of the token attributes. Exceptions for other attributes such as TAG
and
LEMMA
should be moved to an AttributeRuler
component:
nlp = spacy.blank("en")
- nlp.tokenizer.add_special_case("don't", [{"ORTH": "do"}, {"ORTH": "n't", "LEMMA": "not"}])
+ nlp.tokenizer.add_special_case("don't", [{"ORTH": "do"}, {"ORTH": "n't"}])
+ ruler = nlp.add_pipe("attribute_ruler")
+ ruler.add(patterns=[[{"ORTH": "n't"}]], attrs={"LEMMA": "not"})
Migrating tag maps and morph rules
Instead of defining a tag_map
and morph_rules
in the language data, spaCy
v3.0 now manages mappings and exceptions with a separate and more flexible
pipeline component, the AttributeRuler
. See the
usage guide for examples. If
you have tag maps and morph rules in the v2.x format, you can load them into the
attribute ruler before training using the [initialize]
block of your config.
What does the initialization do?
The
[initialize]
block is used whennlp.initialize
is called (usually right before training). It lets you define data resources for initializing the pipeline in yourconfig.cfg
. After training, the rules are saved to disk with the exported pipeline, so your runtime model doesn't depend on local data. For details see the config lifecycle and initialization docs.
### config.cfg (excerpt)
[initialize.components.attribute_ruler]
[initialize.components.attribute_ruler.tag_map]
@readers = "srsly.read_json.v1"
path = "./corpus/tag_map.json"
The AttributeRuler
also provides two handy helper methods
load_from_tag_map
and
load_from_morph_rules
that let
you load in your existing tag map or morph rules:
nlp = spacy.blank("en")
- nlp.vocab.morphology.load_tag_map(YOUR_TAG_MAP)
+ ruler = nlp.add_pipe("attribute_ruler")
+ ruler.load_from_tag_map(YOUR_TAG_MAP)
Migrating Doc flags
The Doc
flags Doc.is_tagged
, Doc.is_parsed
, Doc.is_nered
and
Doc.is_sentenced
are deprecated in v3.0 and replaced by
Doc.has_annotation
method, which refers to the
token attribute symbols (the same symbols used in Matcher
patterns):
doc = nlp(text)
- doc.is_parsed
+ doc.has_annotation("DEP")
- doc.is_tagged
+ doc.has_annotation("TAG")
- doc.is_sentenced
+ doc.has_annotation("SENT_START")
- doc.is_nered
+ doc.has_annotation("ENT_IOB")
Training pipelines and models
To train your pipelines, you should now pretty much always use the
spacy train
CLI. You shouldn't have to put together your own
training scripts anymore, unless you really want to. The training commands now
use a flexible config file that describes all training
settings and hyperparameters, as well as your pipeline, components and
architectures to use. The --code
argument lets you pass in code containing
custom registered functions that you can
reference in your config. To get started, check out the
quickstart widget.
Binary .spacy training data format
spaCy v3.0 uses a new
binary training data format created by
serializing a DocBin
, which represents a collection of Doc
objects. This means that you can train spaCy pipelines using the same format it
outputs: annotated Doc
objects. The binary format is extremely efficient in
storage, especially when packing multiple documents together. You can convert
your existing JSON-formatted data using the spacy convert
command, which outputs .spacy
files:
$ python -m spacy convert ./training.json ./output
Training config
The easiest way to get started with a training config is to use the
init config
command or the
quickstart widget. You can define your
requirements, and it will auto-generate a starter config with the best-matching
default settings.
$ python -m spacy init config ./config.cfg --lang en --pipeline tagger,parser
If you've exported a starter config from our
quickstart widget, you can use the
init fill-config
to fill it with all default
values. You can then use the auto-generated config.cfg
for training:
- python -m spacy train en ./output ./train.json ./dev.json
--pipeline tagger,parser --cnn-window 1 --bilstm-depth 0
+ python -m spacy train ./config.cfg --output ./output
The easiest way to get started is to clone a project template and run it – for example, this end-to-end template that lets you train a part-of-speech tagger and dependency parser on a Universal Dependencies treebank.
Modifying tokenizer settings
If you were using a base model with spacy train
to customize the tokenizer
settings in v2, your modifications can be provided in the
[initialize.before_init]
callback.
Write a registered callback that modifies the tokenizer settings and specify this callback in your config:
config.cfg
[initialize] [initialize.before_init] @callbacks = "customize_tokenizer"
### functions.py
from spacy.util import registry, compile_suffix_regex
@registry.callbacks("customize_tokenizer")
def make_customize_tokenizer():
def customize_tokenizer(nlp):
# remove a suffix
suffixes = list(nlp.Defaults.suffixes)
suffixes.remove("\\[")
suffix_regex = compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search
# add a special case
nlp.tokenizer.add_special_case("_SPECIAL_", [{"ORTH": "_SPECIAL_"}])
return customize_tokenizer
When training, provide the function above with the --code
option:
$ python -m spacy train config.cfg --code ./functions.py
The train step requires the --code
option with your registered functions from
the [initialize]
block, but since those callbacks are only required during the
initialization step, you don't need to provide them with the final pipeline
package. However, to make it easier for others to replicate your training setup,
you can choose to package the initialization callbacks with the pipeline package
or to publish them separately.
Training via the Python API
For most use cases, you shouldn't have to write your own training scripts
anymore. Instead, you can use spacy train
with a
config file and custom
registered functions if needed. You can even
register callbacks that can modify the nlp
object at different stages of its
lifecycle to fully customize it before training.
If you do decide to use the internal training API from
Python, you should only need a few small modifications to convert your scripts
from spaCy v2.x to v3.x. The Example.from_dict
classmethod takes a reference Doc
and a
dictionary of annotations, similar to the
"simple training style" in spaCy v2.x:
### Migrating Doc and GoldParse
doc = nlp.make_doc("Mark Zuckerberg is the CEO of Facebook")
entities = [(0, 15, "PERSON"), (30, 38, "ORG")]
- gold = GoldParse(doc, entities=entities)
+ example = Example.from_dict(doc, {"entities": entities})
### Migrating simple training style
text = "Mark Zuckerberg is the CEO of Facebook"
annotations = {"entities": [(0, 15, "PERSON"), (30, 38, "ORG")]}
+ doc = nlp.make_doc(text)
+ example = Example.from_dict(doc, annotations)
The Language.update
,
Language.evaluate
and
TrainablePipe.update
methods now all take batches of
Example
objects instead of Doc
and GoldParse
objects, or
raw text and a dictionary of annotations.
### Training loop {highlight="5-8,12"}
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London.", {"entities": [(7, 13, "LOC")]}),
]
examples = []
for text, annots in TRAIN_DATA:
examples.append(Example.from_dict(nlp.make_doc(text), annots))
nlp.initialize(lambda: examples)
for i in range(20):
random.shuffle(examples)
for batch in minibatch(examples, size=8):
nlp.update(examples)
Language.begin_training
and TrainablePipe.begin_training
have been renamed
to Language.initialize
and
TrainablePipe.initialize
, and the methods now take a
function that returns a sequence of Example
objects to initialize the model
instead of a list of tuples. The data examples are used to initialize the
models of trainable pipeline components, which includes validating the
network,
inferring missing shapes and
setting up the label scheme.
- nlp.begin_training()
+ nlp.initialize(lambda: examples)
Packaging trained pipelines
The spacy package
command now automatically builds the
installable .tar.gz
sdist of the Python package, so you don't have to run this
step manually anymore. You can disable the behavior by setting the --no-sdist
flag.
python -m spacy package ./output ./packages
- cd /output/en_pipeline-0.0.0
- python setup.py sdist
Data utilities and gold module
The spacy.gold
module has been renamed to spacy.training
and the conversion
utilities now follow the naming format of x_to_y
. This mostly affects
internals, but if you've been using the span offset conversion utilities
offsets_to_biluo_tags
,
biluo_tags_to_offsets
or
biluo_tags_to_spans
, you'll have to
change your names and imports:
- from spacy.gold import biluo_tags_from_offsets, offsets_from_biluo_tags, spans_from_biluo_tags
+ from spacy.training import offsets_to_biluo_tags, biluo_tags_to_offsets, biluo_tags_to_spans
Migration notes for plugin maintainers
Thanks to everyone who's been contributing to the spaCy ecosystem by developing and maintaining one of the many awesome plugins and extensions. We've tried to make it as easy as possible for you to upgrade your packages for spaCy v3.0. The most common use case for plugins is providing pipeline components and extension attributes. When migrating your plugin, double-check the following:
- Use the
@Language.factory
decorator to register your component and assign it a name. This allows users to refer to your components by name and serialize pipelines referencing them. Remove all manual entries to theLanguage.factories
. - Make sure your component factories take at least two named arguments:
nlp
(the currentnlp
object) andname
(the instance name of the added component so you can identify multiple instances of the same component). - Update all references to
nlp.add_pipe
in your docs to use string names instead of the component functions.
### {highlight="1-5"}
from spacy.language import Language
@Language.factory("my_component", default_config={"some_setting": False})
def create_component(nlp: Language, name: str, some_setting: bool):
return MyCoolComponent(some_setting=some_setting)
class MyCoolComponent:
def __init__(self, some_setting):
self.some_setting = some_setting
def __call__(self, doc):
# Do something to the doc
return doc
Result in config.cfg
[components.my_component] factory = "my_component" some_setting = true
import spacy
from your_plugin import MyCoolComponent
nlp = spacy.load("en_core_web_sm")
- component = MyCoolComponent(some_setting=True)
- nlp.add_pipe(component)
+ nlp.add_pipe("my_component", config={"some_setting": True})
The @Language.factory
decorator takes care of letting
spaCy know that a component of that name is available. This means that your
users can add it to the pipeline using its string name. However, this
requires the decorator to be executed – so users will still have to import
your plugin. Alternatively, your plugin could expose an
entry point, which spaCy can read from.
This means that spaCy knows how to initialize my_component
, even if your
package isn't imported.