spaCy/website/docs/usage/v3.md

---
title: What's New in v3.0
teaser: New features, backwards incompatibilities and migration guide
menu:
  - ['Summary', 'summary']
  - ['New Features', 'features']
  - ['Backwards Incompatibilities', 'incompat']
  - ['Migrating from v2.x', 'migrating']
  - ['Migrating plugins', 'plugins']
---

## Summary {#summary}

## New Features {#features}

## Backwards Incompatibilities {#incompat}

### Removed or renamed objects, methods, attributes and arguments {#incompat-removed}

| Removed                                                  | Replacement                               |
| -------------------------------------------------------- | ----------------------------------------- |
| `GoldParse`                                              | [`Example`](/api/example)                 |
| `GoldCorpus`                                             | [`Corpus`](/api/corpus)                   |
| `spacy debug-data`                                       | [`spacy debug data`](/api/cli#debug-data) |
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |

### Removed deprecated methods, attributes and arguments {#incompat-removed-deprecated}

The following deprecated methods, attributes and arguments were removed in v3.0.
Most of them have been **deprecated for a while** and many would previously
raise errors. Many of them were also mostly internals. If you've been working
with more recent versions of spaCy v2.x, it's **unlikely** that your code relied
on them.

| Removed                                                                                                                 | Replacement                                                                                                                                                |
| ----------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `Doc.tokens_from_list`                                                                                                  | [`Doc.__init__`](/api/doc#init)                                                                                                                            |
| `Doc.merge`, `Span.merge`                                                                                               | [`Doc.retokenize`](/api/doc#retokenize)                                                                                                                    |
| `Token.string`, `Span.string`, `Span.upper`, `Span.lower`                                                               | [`Span.text`](/api/span#attributes), [`Token.text`](/api/token#attributes)                                                                                 |
| `Language.tagger`, `Language.parser`, `Language.entity`                                                                 | [`Language.get_pipe`](/api/language#get_pipe)                                                                                                              |
| keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes`                                | `exclude=["vocab"]`                                                                                                                                        |
| `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process`                                                                                                                                                |
| `SentenceSegmenter` hook, `SimilarityHook`                                                                              | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentenceregognizer) |

## Migrating from v2.x {#migrating}

### Downloading and loading models {#migrating-downloading-models}

Model symlinks and shortcuts like `en` are now officially deprecated. There are
[many different models](/models) with different capabilities and not just one
"English model". In order to download and load a model, you should always use
its full name – for instance, `en_core_web_sm`.

```diff
- python -m spacy download en
+ python -m spacy download en_core_web_sm
```

```diff
- nlp = spacy.load("en")
+ nlp = spacy.load("en_core_web_sm")
```

### Custom pipeline components and factories {#migrating-pipeline-components}

Custom pipeline components now have to be registered explicitly using the
[`@Language.component`](/api/language#component) or
[`@Language.factory`](/api/language#factory) decorator. For simple functions
that take a `Doc` and return it, all you have to do is add the
`@Language.component` decorator to it and assign it a name:

```diff
### Stateless function components
+ from spacy.language import Language

+ @Language.component("my_component")
def my_component(doc):
    return doc
```

For class components that are initialized with settings and/or the shared `nlp`
object, you can use the `@Language.factory` decorator. Also make sure that that
the method used to initialize the factory has **two named arguments**: `nlp`
(the current `nlp` object) and `name` (the string name of the component
instance).

```diff
### Stateful class components
+ from spacy.language import Language

+ @Language.factory("my_component")
class MyComponent:
-   def __init__(self, nlp):
+   def __init__(self, nlp, name):
        self.nlp = nlp

    def __call__(self, doc):
        return doc
```

Instead of decorating your class, you could also add a factory function that
takes the arguments `nlp` and `name` and returns an instance of your component:

```diff
### Stateful class components with factory function
+ from spacy.language import Language

+ @Language.factory("my_component")
+ def create_my_component(nlp, name):
+     return MyComponent(nlp)

class MyComponent:
    def __init__(self, nlp):
        self.nlp = nlp

    def __call__(self, doc):
        return doc
```

The `@Language.component` and `@Language.factory` decorators now take care of
adding an entry to the component factories, so spaCy knows how to load a
component back in from its string name. You won't have to write to
`Language.factories` manually anymore.

```diff
- Language.factories["my_component"] = lambda nlp, **cfg: MyComponent(nlp)
```

#### Adding components to the pipeline {#migrating-add-pipe}

The [`nlp.add_pipe`](/api/language#add_pipe) method now takes the **string
name** of the component factory instead of a callable component. This allows
spaCy to track and serialize components that have been added and their settings.

```diff
+ @Language.component("my_component")
def my_component(doc):
    return doc

- nlp.add_pipe(my_component)
+ nlp.add_pipe("my_component")
```

[`nlp.add_pipe`](/api/language#add_pipe) now also returns the pipeline component
itself, so you can access its attributes. The
[`nlp.create_pipe`](/api/language#create_pipe) method is now mostly internals
and you typically shouldn't have to use it in your code.

```diff
- parser = nlp.create_pipe("parser")
- nlp.add_pipe(parser)
+ parser = nlp.add_pipe("parser")
```

### Training models {#migrating-training}

To train your models, you should now pretty much always use the
[`spacy train`](/api/cli#train) CLI. You shouldn't have to put together your own
training scripts anymore, unless you _really_ want to. The training commands now
use a [flexible config file](/usage/training#config) that describes all training
settings and hyperparameters, as well as your pipeline, model components and
architectures to use. The `--code` argument lets you pass in code containing
[custom registered functions](/usage/training#custom-code) that you can
reference in your config.

#### Binary .spacy training data format {#migrating-training-format}

spaCy now uses a new
[binary training data format](/api/data-formats#binary-training), which is much
smaller and consists of `Doc` objects, serialized via the
[`DocBin`](/api/docbin). You can convert your existing JSON-formatted data using
the [`spacy convert`](/api/cli#convert) command, which outputs `.spacy` files:

```bash
$ python -m spacy convert ./training.json ./output
```

#### Training config {#migrating-training-config}

<!-- TODO: update once we have recommended "getting started with a new config" workflow -->

```diff
### {wrap="true"}
- python -m spacy train en ./output ./train.json ./dev.json --pipeline tagger,parser --cnn-window 1 --bilstm-depth 0
+ python -m spacy train ./config.cfg --output ./output
```

<Project id="some_example_project">

The easiest way to get started with an end-to-end training process is to clone a
[project](/usage/projects) template. Projects let you manage multi-step
workflows, from data preprocessing to training and packaging your model.

</Project>

#### Migrating training scripts to CLI command and config {#migrating-training-scripts}

<!-- TODO: write -->

#### Training via the Python API {#migrating-training-python}

<!-- TODO: this should explain the GoldParse -> Example stuff -->

#### Packaging models {#migrating-training-packaging}

The [`spacy package`](/api/cli#package) command now automatically builds the
installable `.tar.gz` sdist of the Python package, so you don't have to run this
step manually anymore. You can disable the behavior by setting the `--no-sdist`
flag.

```diff
python -m spacy package ./model ./packages
- cd /output/en_model-0.0.0
- python setup.py sdist
```

## Migration notes for plugin maintainers {#plugins}

Thanks to everyone who's been contributing to the spaCy ecosystem by developing
and maintaining one of the many awesome [plugins and extensions](/universe).
We've tried to keep breaking changes to a minimum and make it as easy as
possible for you to upgrade your packages for spaCy v3.

### Custom pipeline components

The most common use case for plugins is providing pipeline components and
extension attributes.

- Use the [`@Language.factory`](/api/language#factory) decorator to register
  your component and assign it a name. This allows users to refer to your
  components by name and serialize pipelines referencing them. Remove all manual
  entries to the `Language.factories`.
- Make sure your component factories take at least two **named arguments**:
  `nlp` (the current `nlp` object) and `name` (the instance name of the added
  component so you can identify multiple instances of the same component).
- Update all references to [`nlp.add_pipe`](/api/language#add_pipe) in your docs
  to use **string names** instead of the component functions.

```python
### {highlight="1-5"}
from spacy.language import Language

@Language.factory("my_component", default_config={"some_setting": False})
def create_component(nlp: Language, name: str, some_setting: bool):
    return MyCoolComponent(some_setting=some_setting)


class MyCoolComponent:
    def __init__(self, some_setting):
        self.some_setting = some_setting

    def __call__(self, doc):
        # Do something to the doc
        return doc
```

> #### Result in config.cfg
>
> ```ini
> [components.my_component]
> factory = "my_component"
> some_setting = true
> ```

```diff
import spacy
from your_plugin import MyCoolComponent

nlp = spacy.load("en_core_web_sm")
- component = MyCoolComponent(some_setting=True)
- nlp.add_pipe(component)
+ nlp.add_pipe("my_component", config={"some_setting": True})
```

<Infobox title="Important note on registering factories" variant="warning">

The [`@Language.factory`](/api/language#factory) decorator takes care of letting
spaCy know that a component of that name is available. This means that your
users can add it to the pipeline using its **string name**. However, this
requires the decorator to be executed – so users will still have to **import
your plugin**. Alternatively, your plugin could expose an
[entry point](/usage/saving-loading#entry-points), which spaCy can read from.
This means that spaCy knows how to initialize `my_component`, even if your
package isn't imported.

</Infobox>
-												Add new in v3.0

											
										
										
											2020-07-01 11:02:17 +00:00
+								---
 								title: What's New in v3.0
 								teaser: New features, backwards incompatibilities and migration guide
 								menu:
 								  - ['Summary', 'summary']
 								  - ['New Features', 'features']
 								  - ['Backwards Incompatibilities', 'incompat']
 								  - ['Migrating from v2.x', 'migrating']
-												Update docs [ci skip]

											
										
										
											2020-07-26 22:29:45 +00:00
+								  - ['Migrating plugins', 'plugins']
-												Add new in v3.0

											
										
										
											2020-07-01 11:02:17 +00:00
+								---
 								## Summary {#summary}
 								## New Features {#features}
 								## Backwards Incompatibilities {#incompat}
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 16:11:45 +00:00
+								### Removed or renamed objects, methods, attributes and arguments {#incompat-removed}
-												Update docs [ci skip]

											
										
										
											2020-07-25 16:51:12 +00:00
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 16:11:45 +00:00
+								| Removed                                                  | Replacement                               |
 								| -------------------------------------------------------- | ----------------------------------------- |
 								| `GoldParse`                                              | [`Example`](/api/example)                 |
-												WIP: Update docs [ci skip]

											
										
										
											2020-08-06 11:10:15 +00:00
+								| `GoldCorpus`                                             | [`Corpus`](/api/corpus)                   |
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 16:11:45 +00:00
+								| `spacy debug-data`                                       | [`spacy debug data`](/api/cli#debug-data) |
 								| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |
 								### Removed deprecated methods, attributes and arguments {#incompat-removed-deprecated}
-												Update docs [ci skip]

											
										
										
											2020-07-25 16:51:12 +00:00
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 16:11:45 +00:00
+								The following deprecated methods, attributes and arguments were removed in v3.0.
 								Most of them have been **deprecated for a while** and many would previously
 								raise errors. Many of them were also mostly internals. If you've been working
 								with more recent versions of spaCy v2.x, it's **unlikely** that your code relied
 								on them.
-												Update docs [ci skip]

											
										
										
											2020-07-25 16:51:12 +00:00
-												Update docstrings, docs and types

											
										
										
											2020-07-29 09:36:42 +00:00
+								| Removed                                                                                                                 | Replacement                                                                                                                                                |
 								| ----------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `Doc.tokens_from_list`                                                                                                  | [`Doc.__init__`](/api/doc#init)                                                                                                                            |
 								| `Doc.merge`, `Span.merge`                                                                                               | [`Doc.retokenize`](/api/doc#retokenize)                                                                                                                    |
 								| `Token.string`, `Span.string`, `Span.upper`, `Span.lower`                                                               | [`Span.text`](/api/span#attributes), [`Token.text`](/api/token#attributes)                                                                                 |
 								| `Language.tagger`, `Language.parser`, `Language.entity`                                                                 | [`Language.get_pipe`](/api/language#get_pipe)                                                                                                              |
 								| keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes`                                | `exclude=["vocab"]`                                                                                                                                        |
 								| `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process`                                                                                                                                                |
 								| `SentenceSegmenter` hook, `SimilarityHook`                                                                              | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentenceregognizer) |
-												Update docs [ci skip]

											
										
										
											2020-07-25 16:51:12 +00:00
-												Add new in v3.0

											
										
										
											2020-07-01 11:02:17 +00:00
+								## Migrating from v2.x {#migrating}
-												Update docs [ci skip]

											
										
										
											2020-07-26 22:29:45 +00:00
-												Update docstrings, docs and types

											
										
										
											2020-07-29 09:36:42 +00:00
+								### Downloading and loading models {#migrating-downloading-models}
 								Model symlinks and shortcuts like `en` are now officially deprecated. There are
 								[many different models](/models) with different capabilities and not just one
 								"English model". In order to download and load a model, you should always use
 								its full name – for instance, `en_core_web_sm`.
 								```diff
 								- python -m spacy download en
 								+ python -m spacy download en_core_web_sm
 								```
 								```diff
 								- nlp = spacy.load("en")
 								+ nlp = spacy.load("en_core_web_sm")
 								```
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 16:11:45 +00:00
+								### Custom pipeline components and factories {#migrating-pipeline-components}
 								Custom pipeline components now have to be registered explicitly using the
 								[`@Language.component`](/api/language#component) or
 								[`@Language.factory`](/api/language#factory) decorator. For simple functions
 								that take a `Doc` and return it, all you have to do is add the
 								`@Language.component` decorator to it and assign it a name:
 								```diff
 								### Stateless function components
 								+ from spacy.language import Language
 								+ @Language.component("my_component")
 								def my_component(doc):
 								    return doc
 								```
 								For class components that are initialized with settings and/or the shared `nlp`
 								object, you can use the `@Language.factory` decorator. Also make sure that that
 								the method used to initialize the factory has **two named arguments**: `nlp`
 								(the current `nlp` object) and `name` (the string name of the component
 								instance).
 								```diff
 								### Stateful class components
 								+ from spacy.language import Language
 								+ @Language.factory("my_component")
 								class MyComponent:
 								-   def __init__(self, nlp):
 								+   def __init__(self, nlp, name):
 								        self.nlp = nlp
 								    def __call__(self, doc):
 								        return doc
 								```
 								Instead of decorating your class, you could also add a factory function that
 								takes the arguments `nlp` and `name` and returns an instance of your component:
 								```diff
 								### Stateful class components with factory function
 								+ from spacy.language import Language
 								+ @Language.factory("my_component")
 								+ def create_my_component(nlp, name):
 								+     return MyComponent(nlp)
 								class MyComponent:
 								    def __init__(self, nlp):
 								        self.nlp = nlp
 								    def __call__(self, doc):
 								        return doc
 								```
 								The `@Language.component` and `@Language.factory` decorators now take care of
 								adding an entry to the component factories, so spaCy knows how to load a
 								component back in from its string name. You won't have to write to
 								`Language.factories` manually anymore.
 								```diff
 								- Language.factories["my_component"] = lambda nlp, **cfg: MyComponent(nlp)
 								```
 								#### Adding components to the pipeline {#migrating-add-pipe}
 								The [`nlp.add_pipe`](/api/language#add_pipe) method now takes the **string
 								name** of the component factory instead of a callable component. This allows
 								spaCy to track and serialize components that have been added and their settings.
 								```diff
 								+ @Language.component("my_component")
 								def my_component(doc):
 								    return doc
 								- nlp.add_pipe(my_component)
 								+ nlp.add_pipe("my_component")
 								```
 								[`nlp.add_pipe`](/api/language#add_pipe) now also returns the pipeline component
 								itself, so you can access its attributes. The
 								[`nlp.create_pipe`](/api/language#create_pipe) method is now mostly internals
 								and you typically shouldn't have to use it in your code.
 								```diff
 								- parser = nlp.create_pipe("parser")
 								- nlp.add_pipe(parser)
 								+ parser = nlp.add_pipe("parser")
 								```
 								### Training models {#migrating-training}
 								To train your models, you should now pretty much always use the
 								[`spacy train`](/api/cli#train) CLI. You shouldn't have to put together your own
 								training scripts anymore, unless you _really_ want to. The training commands now
 								use a [flexible config file](/usage/training#config) that describes all training
 								settings and hyperparameters, as well as your pipeline, model components and
 								architectures to use. The `--code` argument lets you pass in code containing
 								[custom registered functions](/usage/training#custom-code) that you can
 								reference in your config.
 								#### Binary .spacy training data format {#migrating-training-format}
 								spaCy now uses a new
 								[binary training data format](/api/data-formats#binary-training), which is much
 								smaller and consists of `Doc` objects, serialized via the
 								[`DocBin`](/api/docbin). You can convert your existing JSON-formatted data using
 								the [`spacy convert`](/api/cli#convert) command, which outputs `.spacy` files:
 								```bash
 								$ python -m spacy convert ./training.json ./output
 								```
 								#### Training config {#migrating-training-config}
 								<!-- TODO: update once we have recommended "getting started with a new config" workflow -->
 								```diff
 								### {wrap="true"}
 								- python -m spacy train en ./output ./train.json ./dev.json --pipeline tagger,parser --cnn-window 1 --bilstm-depth 0
-												Update docs

											
										
										
											2020-08-06 17:30:43 +00:00
+								+ python -m spacy train ./config.cfg --output ./output
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 16:11:45 +00:00
+								```
 								<Project id="some_example_project">
 								The easiest way to get started with an end-to-end training process is to clone a
 								[project](/usage/projects) template. Projects let you manage multi-step
 								workflows, from data preprocessing to training and packaging your model.
 								</Project>
 								#### Migrating training scripts to CLI command and config {#migrating-training-scripts}
 								<!-- TODO: write -->
-												Update docstrings, docs and types

											
										
										
											2020-07-29 09:36:42 +00:00
+								#### Training via the Python API {#migrating-training-python}
 								<!-- TODO: this should explain the GoldParse -> Example stuff -->
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 16:11:45 +00:00
+								#### Packaging models {#migrating-training-packaging}
 								The [`spacy package`](/api/cli#package) command now automatically builds the
 								installable `.tar.gz` sdist of the Python package, so you don't have to run this
 								step manually anymore. You can disable the behavior by setting the `--no-sdist`
 								flag.
 								```diff
 								python -m spacy package ./model ./packages
 								- cd /output/en_model-0.0.0
 								- python setup.py sdist
 								```
-												Update docs [ci skip]

											
										
										
											2020-07-26 22:29:45 +00:00
+								## Migration notes for plugin maintainers {#plugins}
 								Thanks to everyone who's been contributing to the spaCy ecosystem by developing
 								and maintaining one of the many awesome [plugins and extensions](/universe).
 								We've tried to keep breaking changes to a minimum and make it as easy as
 								possible for you to upgrade your packages for spaCy v3.
 								### Custom pipeline components
 								The most common use case for plugins is providing pipeline components and
 								extension attributes.
 								- Use the [`@Language.factory`](/api/language#factory) decorator to register
 								  your component and assign it a name. This allows users to refer to your
 								  components by name and serialize pipelines referencing them. Remove all manual
 								  entries to the `Language.factories`.
 								- Make sure your component factories take at least two **named arguments**:
 								  `nlp` (the current `nlp` object) and `name` (the instance name of the added
 								  component so you can identify multiple instances of the same component).
 								- Update all references to [`nlp.add_pipe`](/api/language#add_pipe) in your docs
 								  to use **string names** instead of the component functions.
 								```python
 								### {highlight="1-5"}
 								from spacy.language import Language
 								@Language.factory("my_component", default_config={"some_setting": False})
 								def create_component(nlp: Language, name: str, some_setting: bool):
 								    return MyCoolComponent(some_setting=some_setting)
 								class MyCoolComponent:
 								    def __init__(self, some_setting):
 								        self.some_setting = some_setting
 								    def __call__(self, doc):
 								        # Do something to the doc
 								        return doc
 								```
 								> #### Result in config.cfg
 								>
 								> ```ini
 								> [components.my_component]
 								> factory = "my_component"
 								> some_setting = true
 								> ```
 								```diff
 								import spacy
 								from your_plugin import MyCoolComponent
 								nlp = spacy.load("en_core_web_sm")
 								- component = MyCoolComponent(some_setting=True)
 								- nlp.add_pipe(component)
 								+ nlp.add_pipe("my_component", config={"some_setting": True})
 								```
 								<Infobox title="Important note on registering factories" variant="warning">
 								The [`@Language.factory`](/api/language#factory) decorator takes care of letting
 								spaCy know that a component of that name is available. This means that your
 								users can add it to the pipeline using its **string name**. However, this
 								requires the decorator to be executed – so users will still have to **import
 								your plugin**. Alternatively, your plugin could expose an
 								[entry point](/usage/saving-loading#entry-points), which spaCy can read from.
 								This means that spaCy knows how to initialize `my_component`, even if your
 								package isn't imported.
 								</Infobox>