9.2 KiB
title | teaser | menu | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
What's New in v2.3 | New features, backwards incompatibilities and migration guide |
|
New Features
spaCy v2.3 features new pretrained models for five languages, word vectors for all language models, and decreased model size and loading times for models with vectors. We've added pretrained models for Chinese, Danish, Japanese, Polish and Romanian and updated the training data and vectors for most languages. Model packages with vectors are about 2× smaller on disk and load 2-4× faster. For the full changelog, see the release notes on GitHub. For more details and a behind-the-scenes look at the new release, see our blog post.
Expanded model families with vectors
Example
python -m spacy download da_core_news_sm python -m spacy download ja_core_news_sm python -m spacy download pl_core_news_sm python -m spacy download ro_core_news_sm python -m spacy download zh_core_web_sm
With new model families for Chinese, Danish, Polish, Romanian and Chinese plus
md
and lg
models with word vectors for all languages, this release provides
a total of 46 model packages. For models trained using
Universal Dependencies corpora, the
training data has been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish)
and Dutch has been extended to include both UD Dutch Alpino and LassySmall.
Models: Models directory **Benchmarks: ** Release notes
Chinese
Example
from spacy.lang.zh import Chinese # Load with "default" model provided by pkuseg cfg = {"pkuseg_model": "default", "require_pkuseg": True} nlp = Chinese(meta={"tokenizer": {"config": cfg}}) # Append words to user dict nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])
This release adds support for
pkuseg
for word segmentation and
the new Chinese models ship with a custom pkuseg model trained on OntoNotes. The
Chinese tokenizer can be initialized with both pkuseg
and custom models and
the pkuseg
user dictionary is easy to customize. Note that
pkuseg
doesn't yet ship with
pre-compiled wheels for Python 3.8. See the
usage documentation for details on how to install it on
Python 3.8.
Models: Chinese models **Usage: ** Chinese tokenizer usage
Japanese
The updated Japanese language class switches to
SudachiPy
for word
segmentation and part-of-speech tagging. Using SudachiPy
greatly simplifies
installing spaCy for Japanese, which is now possible with a single command:
pip install spacy[ja]
.
Models: Japanese models Usage: Japanese tokenizer usage
Small CLI updates
spacy debug-data
provides the coverage of the vectors in a base model withspacy debug-data lang train dev -b base_model
spacy evaluate
supportsblank:lg
(e.g.spacy evaluate blank:en dev.json
) to evaluate the tokenization accuracy without loading a modelspacy train
on GPU restricts the CPU timing evaluation to the first iteration
Backwards incompatibilities
If you've been training your own models, you'll need to retrain them
with the new version. Also don't forget to upgrade all models to the latest
versions. Models for earlier v2 releases (v2.0, v2.1, v2.2) aren't compatible
with models for v2.3. To check if all of your models are up to date, you can run
the spacy validate
command.
Install with lookups data
$ pip install spacy[lookups]
You can also install
spacy-lookups-data
directly.
- If you're training new models, you'll want to install the package
spacy-lookups-data
, which now includes both the lemmatization tables (as in v2.2) and the normalization tables (new in v2.3). If you're using pretrained models, nothing changes, because the relevant tables are included in the model packages. - Due to the updated Universal Dependencies training data, the fine-grained part-of-speech tags will change for many provided language models. The coarse-grained part-of-speech tagset remains the same, but the mapping from particular fine-grained to coarse-grained tags may show minor differences.
- For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech
tagsets contain new merged tags related to contracted forms, such as
ADP_DET
for French"au"
, which maps to UPOSADP
based on the head"à"
. This increases the accuracy of the models by improving the alignment between spaCy's tokenization and Universal Dependencies multi-word tokens used for contractions.
Migrating from spaCy 2.2
Tokenizer settings
In spaCy v2.2.2-v2.2.4, there was a change to the precedence of token_match
that gave prefixes and suffixes priority over token_match
, which caused
problems for many custom tokenizer configurations. This has been reverted in
v2.3 so that token_match
has priority over prefixes and suffixes as in v2.2.1
and earlier versions.
A new tokenizer setting url_match
has been introduced in v2.3.0 to handle
cases like URLs where the tokenizer should remove prefixes and suffixes (e.g., a
comma at the end of a URL) before applying the match. See the full
tokenizer documentation and try out
nlp.tokenizer.explain()
when
debugging your tokenizer configuration.
Warnings configuration
spaCy's custom warnings have been replaced with native Python
warnings
. Instead of
setting SPACY_WARNING_IGNORE
, use the warnings
filters
to manage warnings.
import spacy
+ import warnings
- spacy.errors.SPACY_WARNING_IGNORE.append('W007')
+ warnings.filterwarnings("ignore", message=r"\[W007\]", category=UserWarning)
Normalization tables
The normalization tables have moved from the language data in
spacy/lang
to the
package spacy-lookups-data
.
If you're adding data for a new language, the normalization table should be
added to spacy-lookups-data
. See
adding norm exceptions.
Probability and cluster features
Load and save extra prob lookups table
from spacy.lang.en import English nlp = English() doc = nlp("the") print(doc[0].prob) # lazily loads extra prob table nlp.to_disk("/path/to/model") # includes prob table
The Token.prob
and Token.cluster
features, which are no longer used by the
core pipeline components as of spaCy v2, are no longer provided in the
pretrained models to reduce the model size. To keep these features available for
users relying on them, the prob
and cluster
features for the most frequent
1M tokens have been moved to
spacy-lookups-data
as
extra
features for the relevant languages (English, German, Greek and
Spanish).
The extra tables are loaded lazily, so if you have spacy-lookups-data
installed and your code accesses Token.prob
, the full table is loaded into the
model vocab, which will take a few seconds on initial loading. When you save
this model after loading the prob
table, the full prob
table will be saved
as part of the model vocab.
If you'd like to include custom cluster
, prob
, or sentiment
tables as part
of a new model, add the data to
spacy-lookups-data
under
the entry point lg_extra
, e.g. en_extra
for English. Alternatively, you can
initialize your Vocab
with the lookups_extra
argument with a
Lookups
object that includes the tables lexeme_cluster
,
lexeme_prob
, lexeme_sentiment
or lexeme_settings
. lexeme_settings
is
currently only used to provide a custom oov_prob
. See examples in the
data
directory
in spacy-lookups-data
.
Initializing new models without extra lookups tables
When you initialize a new model with spacy init-model
,
the prob
table from spacy-lookups-data
may be loaded as part of the
initialization. If you'd like to omit this extra data as in spaCy's provided
v2.3 models, use the new flag --omit-extra-lookups
.