spaCy/website/docs/usage/v2-3.md

---
title: What's New in v2.3
teaser: New features, backwards incompatibilities and migration guide
menu:
  - ['New Features', 'features']
  - ['Backwards Incompatibilities', 'incompat']
  - ['Migrating from v2.2', 'migrating']
---

## New Features {#features hidden="true"}

spaCy v2.3 features new pretrained models for five languages, word vectors for
all language models, and decreased model size and loading times for models with
vectors. We've added pretrained models for **Chinese, Danish, Japanese, Polish
and Romanian** and updated the training data and vectors for most languages.
Model packages with vectors are about **2&times** smaller on disk and load
**2-4&times;** faster. For the full changelog, see the [release notes on
GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0). For more
details and a behind-the-scenes look at the new release, [see our blog
post](https://explosion.ai/blog/spacy-v2-3).

### Expanded model families with vectors {#models}

> #### Example
>
> ```bash
> python -m spacy download da_core_news_sm
> python -m spacy download ja_core_news_sm
> python -m spacy download pl_core_news_sm
> python -m spacy download ro_core_news_sm
> python -m spacy download zh_core_web_sm
> ```

With new model families for Chinese, Danish, Polish, Romanian and Chinese plus
`md` and `lg` models with word vectors for all languages, this release provides
a total of 46 model packages. For models trained using [Universal
Dependencies](https://universaldependencies.org) corpora, the training data has
been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish) and Dutch has been
extended to include both UD Dutch Alpino and LassySmall.

<Infobox>

**Models:** [Models directory](/models) **Benchmarks: **
[Release notes](https://github.com/explosion/spaCy/releases/tag/v2.3.0)

</Infobox>

### Chinese {#chinese}

> #### Example
> ```python
> from spacy.lang.zh import Chinese
>
> # Load with "default" model provided by pkuseg
> cfg = {"pkuseg_model": "default", "require_pkuseg": True}
> nlp = Chinese(meta={"tokenizer": {"config": cfg}})
>
> # Append words to user dict
> nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])

This release adds support for
[pkuseg](https://github.com/lancopku/pkuseg-python) for word segmentation and
the new Chinese models ship with a custom pkuseg model trained on OntoNotes.
The Chinese tokenizer can be initialized with both `pkuseg` and custom models
and the `pkuseg` user dictionary is easy to customize.

<Infobox>

**Chinese:** [Chinese tokenizer usage](/usage/models#chinese)

</Infobox>

### Japanese {#japanese}

The updated Japanese language class switches to
[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
segmentation and part-of-speech tagging. Using `sudachipy` greatly simplifies
installing spaCy for Japanese, which is now possible with a single command:
`pip install spacy[ja]`.

<Infobox>

**Japanese:** [Japanese tokenizer usage](/usage/models#japanese)

</Infobox>

### Small CLI updates

- `spacy debug-data` provides the coverage of the vectors in a base model with
  `spacy debug-data lang train dev -b base_model`
- `spacy evaluate` supports `blank:lg` (e.g. `spacy evaluate blank:en
  dev.json`) to evaluate the tokenization accuracy without loading a model
- `spacy train` on GPU restricts the CPU timing evaluation to the first
  iteration

## Backwards incompatibilities {#incompat}

<Infobox title="Important note on models" variant="warning">

If you've been training **your own models**, you'll need to **retrain** them
with the new version. Also don't forget to upgrade all models to the latest
versions. Models for earlier v2 releases (v2.0, v2.1, v2.2) aren't compatible
with models for v2.3. To check if all of your models are up to date, you can
run the [`spacy validate`](/api/cli#validate) command.

</Infobox>

> #### Install with lookups data
>
> ```bash
> $ pip install spacy[lookups]
> ```
>
> You can also install
> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
> directly.

- If you're training new models, you'll want to install the package
  [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data),
  which now includes both the lemmatization tables (as in v2.2) and the
  normalization tables (new in v2.3). If you're using pretrained models,
  **nothing changes**, because the relevant tables are included in the model
  packages.
- Due to the updated Universal Dependencies training data, the fine-grained
  part-of-speech tags will change for many provided language models. The
  coarse-grained part-of-speech tagset remains the same, but the mapping from
  particular fine-grained to coarse-grained tags may show minor differences.
- For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech
  tagsets contain new merged tags related to contracted forms, such as
  `ADP_DET` for French `"au"`, which maps to UPOS `ADP` based on the head
  `"à"`. This increases the accuracy of the models by improving the alignment
  between spaCy's tokenization and Universal Dependencies multi-word tokens
  used for contractions.

### Migrating from spaCy 2.2 {#migrating}

#### Tokenizer settings

In spaCy v2.2.2-v2.2.4, there was a change to the precedence of `token_match`
that gave prefixes and suffixes priority over `token_match`, which caused
problems for many custom tokenizer configurations. This has been reverted in
v2.3 so that `token_match` has priority over prefixes and suffixes as in v2.2.1
and earlier versions.

A new tokenizer setting `url_match` has been introduced in v2.3.0 to handle
cases like URLs where the tokenizer should remove prefixes and suffixes (e.g.,
a comma at the end of a URL) before applying the match. See the full [tokenizer
documentation](/usage/linguistic-features#tokenization) and try out
[`nlp.tokenizer.explain()`](/usage/linguistic-features#tokenizer-debug) when
debugging your tokenizer configuration.

#### Warnings configuration

spaCy's custom warnings have been replaced with native python
[`warnings`](https://docs.python.org/3/library/warnings.html). Instead of
setting `SPACY_WARNING_IGNORE`, use the [warnings
filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter)
to manage warnings.

#### Normalization tables

The normalization tables have moved from the language data in
[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to
the package
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). If
you're adding data for a new language, the normalization table should be added
to `spacy-lookups-data`. See [adding norm
exceptions](/usage/adding-languages#norm-exceptions).

#### Probability and cluster features

> #### Load and save extra prob lookups table
>
> ```python
> from spacy.lang.en import English
> nlp = English()
> doc = nlp("the")
> print(doc[0].prob) # lazily loads extra prob table
> nlp.to_disk("/path/to/model") # includes prob table
> ```

The `Token.prob` and `Token.cluster` features, which are no longer used by the
core pipeline components as of spaCy v2, are no longer provided in the
pretrained models to reduce the model size. To keep these features available
for users relying on them, the `prob` and `cluster` features for the most
frequent 1M tokens have been moved to
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) as
`extra` features for the relevant languages (English, German, Greek and
Spanish).

The extra tables are loaded lazily, so if you have `spacy-lookups-data`
installed and your code accesses `Token.prob`, the full table is loaded into
the model vocab, which will take a few seconds on initial loading. When you
save this model after loading the `prob` table, the full `prob` table will be
saved as part of the model vocab.

If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as
part of a new model, add the data to
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
the entry point `lg_extra`, e.g. `en_extra` for English. Alternatively, you can
initialize your [`Vocab`](/api/vocab) with the `lookups_extra` argument with a
[`Lookups`](/api/lookups) object that includes the tables `lexeme_cluster`,
`lexeme_prob`, `lexeme_sentiment` or `lexeme_settings`. `lexeme_settings` is
currently only used to provide a custom `oov_prob`. See examples in the [`data`
directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data)
in `spacy-lookups-data`.

#### Initializing new models without extra lookups tables

When you initialize a new model with [`spacy init-model`](/api/cli#init-model),
the `prob` table from `spacy-lookups-data` may be loaded as part of the
initialization. If you'd like to omit this extra data as in spaCy's provided
v2.3 models, use the new flag `--omit-extra-lookups`.
Documentation updates for v2.3.0 (#5593) * Update website models for v2.3.0 * Add docs for Chinese word segmentation * Tighten up Chinese docs section * Merge branch 'master' into docs/v2.3.0 [ci skip] * Merge branch 'master' into docs/v2.3.0 [ci skip] * Auto-format and update version * Update matcher.md * Update languages and sorting * Typo in landing page * Infobox about token_match behavior * Add meta and basic docs for Japanese * POS -> TAG in models table * Add info about lookups for normalization * Updates to API docs for v2.3 * Update adding norm exceptions for adding languages * Add --omit-extra-lookups to CLI API docs * Add initial draft of "What's New in v2.3" * Add new in v2.3 tags to Chinese and Japanese sections * Add tokenizer to migration section * Add new in v2.3 flags to init-model * Typo * More what's new in v2.3 Co-authored-by: Ines Montani <ines@ines.io> 2020-06-16 13:37:35 +00:00			`---`
			`title: What's New in v2.3`
			`teaser: New features, backwards incompatibilities and migration guide`
			`menu:`
			`- ['New Features', 'features']`
			`- ['Backwards Incompatibilities', 'incompat']`
			`- ['Migrating from v2.2', 'migrating']`
			`---`

			`## New Features {#features hidden="true"}`

			`spaCy v2.3 features new pretrained models for five languages, word vectors for`
			`all language models, and decreased model size and loading times for models with`
			`vectors. We've added pretrained models for **Chinese, Danish, Japanese, Polish`
			`and Romanian** and updated the training data and vectors for most languages.`
			`Model packages with vectors are about 2&times smaller on disk and load`
			`2-4× faster. For the full changelog, see the [release notes on`
			`GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0). For more`
			`details and a behind-the-scenes look at the new release, [see our blog`
			`post](https://explosion.ai/blog/spacy-v2-3).`

			`### Expanded model families with vectors {#models}`

			`> #### Example`
			`>`
			> ```bash
			`> python -m spacy download da_core_news_sm`
			`> python -m spacy download ja_core_news_sm`
			`> python -m spacy download pl_core_news_sm`
			`> python -m spacy download ro_core_news_sm`
			`> python -m spacy download zh_core_web_sm`
			> ```

			`With new model families for Chinese, Danish, Polish, Romanian and Chinese plus`
			`md` and `lg` models with word vectors for all languages, this release provides
			`a total of 46 model packages. For models trained using [Universal`
			`Dependencies](https://universaldependencies.org) corpora, the training data has`
			`been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish) and Dutch has been`
			`extended to include both UD Dutch Alpino and LassySmall.`

			`<Infobox>`

			`Models: [Models directory](/models) Benchmarks: `
			`[Release notes](https://github.com/explosion/spaCy/releases/tag/v2.3.0)`

			`</Infobox>`

			`### Chinese {#chinese}`

			`> #### Example`
			> ```python
			`> from spacy.lang.zh import Chinese`
			`>`
			`> # Load with "default" model provided by pkuseg`
			`> cfg = {"pkuseg_model": "default", "require_pkuseg": True}`
			`> nlp = Chinese(meta={"tokenizer": {"config": cfg}})`
			`>`
			`> # Append words to user dict`
			`> nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])`

			`This release adds support for`
			`[pkuseg](https://github.com/lancopku/pkuseg-python) for word segmentation and`
			`the new Chinese models ship with a custom pkuseg model trained on OntoNotes.`
			The Chinese tokenizer can be initialized with both `pkuseg` and custom models
			and the `pkuseg` user dictionary is easy to customize.

			`<Infobox>`

			`Chinese: [Chinese tokenizer usage](/usage/models#chinese)`

			`</Infobox>`

			`### Japanese {#japanese}`

			`The updated Japanese language class switches to`
			`[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word`
			segmentation and part-of-speech tagging. Using `sudachipy` greatly simplifies
			`installing spaCy for Japanese, which is now possible with a single command:`
			`pip install spacy[ja]`.

			`<Infobox>`

			`Japanese: [Japanese tokenizer usage](/usage/models#japanese)`

			`</Infobox>`

			`### Small CLI updates`

			- `spacy debug-data` provides the coverage of the vectors in a base model with
			`spacy debug-data lang train dev -b base_model`
			- `spacy evaluate` supports `blank:lg` (e.g. `spacy evaluate blank:en
			dev.json`) to evaluate the tokenization accuracy without loading a model
			- `spacy train` on GPU restricts the CPU timing evaluation to the first
			`iteration`

			`## Backwards incompatibilities {#incompat}`

			`<Infobox title="Important note on models" variant="warning">`

			`If you've been training your own models, you'll need to retrain them`
			`with the new version. Also don't forget to upgrade all models to the latest`
			`versions. Models for earlier v2 releases (v2.0, v2.1, v2.2) aren't compatible`
			`with models for v2.3. To check if all of your models are up to date, you can`
			run the [`spacy validate`](/api/cli#validate) command.

			`</Infobox>`

			`> #### Install with lookups data`
			`>`
			> ```bash
			`> $ pip install spacy[lookups]`
			> ```
			`>`
			`> You can also install`
			> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
			`> directly.`

			`- If you're training new models, you'll want to install the package`
			[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data),
			`which now includes both the lemmatization tables (as in v2.2) and the`
			`normalization tables (new in v2.3). If you're using pretrained models,`
			`nothing changes, because the relevant tables are included in the model`
			`packages.`
			`- Due to the updated Universal Dependencies training data, the fine-grained`
			`part-of-speech tags will change for many provided language models. The`
			`coarse-grained part-of-speech tagset remains the same, but the mapping from`
			`particular fine-grained to coarse-grained tags may show minor differences.`
			`- For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech`
			`tagsets contain new merged tags related to contracted forms, such as`
			`ADP_DET` for French `"au"`, which maps to UPOS `ADP` based on the head
			`"à"`. This increases the accuracy of the models by improving the alignment
			`between spaCy's tokenization and Universal Dependencies multi-word tokens`
			`used for contractions.`

			`### Migrating from spaCy 2.2 {#migrating}`

			`#### Tokenizer settings`

			In spaCy v2.2.2-v2.2.4, there was a change to the precedence of `token_match`
			that gave prefixes and suffixes priority over `token_match`, which caused
			`problems for many custom tokenizer configurations. This has been reverted in`
			v2.3 so that `token_match` has priority over prefixes and suffixes as in v2.2.1
			`and earlier versions.`

			A new tokenizer setting `url_match` has been introduced in v2.3.0 to handle
			`cases like URLs where the tokenizer should remove prefixes and suffixes (e.g.,`
			`a comma at the end of a URL) before applying the match. See the full [tokenizer`
			`documentation](/usage/linguistic-features#tokenization) and try out`
			[`nlp.tokenizer.explain()`](/usage/linguistic-features#tokenizer-debug) when
			`debugging your tokenizer configuration.`

			`#### Warnings configuration`

			`spaCy's custom warnings have been replaced with native python`
			[`warnings`](https://docs.python.org/3/library/warnings.html). Instead of
			setting `SPACY_WARNING_IGNORE`, use the [warnings
			`filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter)`
			`to manage warnings.`

			`#### Normalization tables`

			`The normalization tables have moved from the language data in`
			[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to
			`the package`
			[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). If
			`you're adding data for a new language, the normalization table should be added`
			to `spacy-lookups-data`. See [adding norm
			`exceptions](/usage/adding-languages#norm-exceptions).`

			`#### Probability and cluster features`

			`> #### Load and save extra prob lookups table`
			`>`
			> ```python
			`> from spacy.lang.en import English`
			`> nlp = English()`
			`> doc = nlp("the")`
			`> print(doc[0].prob) # lazily loads extra prob table`
			`> nlp.to_disk("/path/to/model") # includes prob table`
			> ```

			The `Token.prob` and `Token.cluster` features, which are no longer used by the
			`core pipeline components as of spaCy v2, are no longer provided in the`
			`pretrained models to reduce the model size. To keep these features available`
			for users relying on them, the `prob` and `cluster` features for the most
			`frequent 1M tokens have been moved to`
			[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) as
			`extra` features for the relevant languages (English, German, Greek and
			`Spanish).`

			The extra tables are loaded lazily, so if you have `spacy-lookups-data`
			installed and your code accesses `Token.prob`, the full table is loaded into
			`the model vocab, which will take a few seconds on initial loading. When you`
			save this model after loading the `prob` table, the full `prob` table will be
			`saved as part of the model vocab.`

			If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as
			`part of a new model, add the data to`
			[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
			the entry point `lg_extra`, e.g. `en_extra` for English. Alternatively, you can
			initialize your [`Vocab`](/api/vocab) with the `lookups_extra` argument with a
			[`Lookups`](/api/lookups) object that includes the tables `lexeme_cluster`,
			`lexeme_prob`, `lexeme_sentiment` or `lexeme_settings`. `lexeme_settings` is
			currently only used to provide a custom `oov_prob`. See examples in the [`data`
			`directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data)`
			in `spacy-lookups-data`.

			`#### Initializing new models without extra lookups tables`

			When you initialize a new model with [`spacy init-model`](/api/cli#init-model),
			the `prob` table from `spacy-lookups-data` may be loaded as part of the
			`initialization. If you'd like to omit this extra data as in spaCy's provided`
			v2.3 models, use the new flag `--omit-extra-lookups`.