spaCy/website/docs/usage/v2-3.md

13 KiB
Raw Blame History

title teaser menu
What's New in v2.3 New features, backwards incompatibilities and migration guide
New Features
features
Backwards Incompatibilities
incompat
Migrating from v2.2
migrating

New Features

spaCy v2.3 features new pretrained models for five languages, word vectors for all language models, and decreased model size and loading times for models with vectors. We've added pretrained models for Chinese, Danish, Japanese, Polish and Romanian and updated the training data and vectors for most languages. Model packages with vectors are about 2&times smaller on disk and load 2-4× faster. For the full changelog, see the release notes on GitHub. For more details and a behind-the-scenes look at the new release, see our blog post.

Expanded model families with vectors

Example

python -m spacy download da_core_news_sm
python -m spacy download ja_core_news_sm
python -m spacy download pl_core_news_sm
python -m spacy download ro_core_news_sm
python -m spacy download zh_core_web_sm

With new model families for Chinese, Danish, Polish, Romanian and Chinese plus md and lg models with word vectors for all languages, this release provides a total of 46 model packages. For models trained using Universal Dependencies corpora, the training data has been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish) and Dutch has been extended to include both UD Dutch Alpino and LassySmall.

Models: Models directory **Benchmarks: ** Release notes

Chinese

Example

from spacy.lang.zh import Chinese

# Load with "default" model provided by pkuseg
cfg = {"pkuseg_model": "default", "require_pkuseg": True}
nlp = Chinese(meta={"tokenizer": {"config": cfg}})

# Append words to user dict
nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])

This release adds support for pkuseg for word segmentation and the new Chinese models ship with a custom pkuseg model trained on OntoNotes. The Chinese tokenizer can be initialized with both pkuseg and custom models and the pkuseg user dictionary is easy to customize. Note that pkuseg doesn't yet ship with pre-compiled wheels for Python 3.8. See the usage documentation for details on how to install it on Python 3.8.

Models: Chinese models **Usage: ** Chinese tokenizer usage

Japanese

The updated Japanese language class switches to SudachiPy for word segmentation and part-of-speech tagging. Using SudachiPy greatly simplifies installing spaCy for Japanese, which is now possible with a single command: pip install spacy[ja].

Models: Japanese models Usage: Japanese tokenizer usage

Small CLI updates

  • spacy debug-data provides the coverage of the vectors in a base model with spacy debug-data lang train dev -b base_model
  • spacy evaluate supports blank:lg (e.g. spacy evaluate blank:en dev.json) to evaluate the tokenization accuracy without loading a model
  • spacy train on GPU restricts the CPU timing evaluation to the first iteration

Backwards incompatibilities

If you've been training your own models, you'll need to retrain them with the new version. Also don't forget to upgrade all models to the latest versions. Models for earlier v2 releases (v2.0, v2.1, v2.2) aren't compatible with models for v2.3. To check if all of your models are up to date, you can run the spacy validate command.

Install with lookups data

$ pip install spacy[lookups]

You can also install spacy-lookups-data directly.

  • If you're training new models, you'll want to install the package spacy-lookups-data, which now includes both the lemmatization tables (as in v2.2) and the normalization tables (new in v2.3). If you're using pretrained models, nothing changes, because the relevant tables are included in the model packages.
  • Due to the updated Universal Dependencies training data, the fine-grained part-of-speech tags will change for many provided language models. The coarse-grained part-of-speech tagset remains the same, but the mapping from particular fine-grained to coarse-grained tags may show minor differences.
  • For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech tagsets contain new merged tags related to contracted forms, such as ADP_DET for French "au", which maps to UPOS ADP based on the head "à". This increases the accuracy of the models by improving the alignment between spaCy's tokenization and Universal Dependencies multi-word tokens used for contractions.

Migrating from spaCy 2.2

Tokenizer settings

In spaCy v2.2.2-v2.2.4, there was a change to the precedence of token_match that gave prefixes and suffixes priority over token_match, which caused problems for many custom tokenizer configurations. This has been reverted in v2.3 so that token_match has priority over prefixes and suffixes as in v2.2.1 and earlier versions.

A new tokenizer setting url_match has been introduced in v2.3.0 to handle cases like URLs where the tokenizer should remove prefixes and suffixes (e.g., a comma at the end of a URL) before applying the match. See the full tokenizer documentation and try out nlp.tokenizer.explain() when debugging your tokenizer configuration.

Warnings configuration

spaCy's custom warnings have been replaced with native Python warnings. Instead of setting SPACY_WARNING_IGNORE, use the warnings filters to manage warnings.

import spacy
+ import warnings

- spacy.errors.SPACY_WARNING_IGNORE.append('W007')
+ warnings.filterwarnings("ignore", message=r"\\[W007\\]", category=UserWarning)

Normalization tables

The normalization tables have moved from the language data in spacy/lang to the package spacy-lookups-data. If you're adding data for a new language, the normalization table should be added to spacy-lookups-data. See adding norm exceptions.

No preloaded vocab for models with vectors

To reduce the initial loading time, the lexemes in nlp.vocab are no longer loaded on initialization for models with vectors. As you process texts, the lexemes will be added to the vocab automatically, just as in small models without vectors.

To see the number of unique vectors and number of words with vectors, see nlp.meta['vectors'], for example for en_core_web_md there are 20000 unique vectors and 684830 words with vectors:

{
    'width': 300,
    'vectors': 20000,
    'keys': 684830,
    'name': 'en_core_web_md.vectors'
}

If required, for instance if you are working directly with word vectors rather than processing texts, you can load all lexemes for words with vectors at once:

for orth in nlp.vocab.vectors:
    _ = nlp.vocab[orth]

If your workflow previously iterated over nlp.vocab, a similar alternative is to iterate over words with vectors instead:

- lexemes = [w for w in nlp.vocab]
+ lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]

Be aware that the set of preloaded lexemes in a v2.2 model is not equivalent to the set of words with vectors. For English, v2.2 md/lg models have 1.3M provided lexemes but only 685K words with vectors. The vectors have been updated for most languages in v2.2, but the English models contain the same vectors for both v2.2 and v2.3.

Lexeme.is_oov and Token.is_oov

Due to a bug, the values for is_oov are reversed in v2.3.0, but this will be fixed in the next patch release v2.3.1.

In v2.3, Lexeme.is_oov and Token.is_oov are True if the lexeme does not have a word vector. This is equivalent to token.orth not in nlp.vocab.vectors.

Previously in v2.2, is_oov corresponded to whether a lexeme had stored probability and cluster features. The probability and cluster features are no longer included in the provided medium and large models (see the next section).

Probability and cluster features

Load and save extra prob lookups table

from spacy.lang.en import English
nlp = English()
doc = nlp("the")
print(doc[0].prob) # lazily loads extra prob table
nlp.to_disk("/path/to/model") # includes prob table

The Token.prob and Token.cluster features, which are no longer used by the core pipeline components as of spaCy v2, are no longer provided in the pretrained models to reduce the model size. To keep these features available for users relying on them, the prob and cluster features for the most frequent 1M tokens have been moved to spacy-lookups-data as extra features for the relevant languages (English, German, Greek and Spanish).

The extra tables are loaded lazily, so if you have spacy-lookups-data installed and your code accesses Token.prob, the full table is loaded into the model vocab, which will take a few seconds on initial loading. When you save this model after loading the prob table, the full prob table will be saved as part of the model vocab.

To load the probability table into a provided model, first make sure you have spacy-lookups-data installed. To load the table, remove the empty provided lexeme_prob table and then access Lexeme.prob for any word to load the table from spacy-lookups-data:

+ # prerequisite: pip install spacy-lookups-data
import spacy

nlp = spacy.load("en_core_web_md")

# remove the empty placeholder prob table
+ if nlp.vocab.lookups_extra.has_table("lexeme_prob"):
+     nlp.vocab.lookups_extra.remove_table("lexeme_prob")

# access any `.prob` to load the full table into the model
assert nlp.vocab["a"].prob == -3.9297883511

# if desired, save this model with the probability table included
nlp.to_disk("/path/to/model")

If you'd like to include custom cluster, prob, or sentiment tables as part of a new model, add the data to spacy-lookups-data under the entry point lg_extra, e.g. en_extra for English. Alternatively, you can initialize your Vocab with the lookups_extra argument with a Lookups object that includes the tables lexeme_cluster, lexeme_prob, lexeme_sentiment or lexeme_settings. lexeme_settings is currently only used to provide a custom oov_prob. See examples in the data directory in spacy-lookups-data.

Initializing new models without extra lookups tables

When you initialize a new model with spacy init-model, the prob table from spacy-lookups-data may be loaded as part of the initialization. If you'd like to omit this extra data as in spaCy's provided v2.3 models, use the new flag --omit-extra-lookups.

Tag maps in provided models vs. blank models

The tag maps in the provided models may differ from the tag maps in the spaCy library. You can access the tag map in a loaded model under nlp.vocab.morphology.tag_map.

The tag map from spacy.lang.lg.tag_map is still used when a blank model is initialized. If you want to provide an alternate tag map, update nlp.vocab.morphology.tag_map after initializing the model or if you're using the train CLI, you can use the new --tag-map-path option to provide in the tag map as a JSON dict.

If you want to export a tag map from a provided model for use with the train CLI, you can save it as a JSON dict. To only use string keys as required by JSON and to make it easier to read and edit, any internal integer IDs need to be converted back to strings:

import spacy
import srsly

nlp = spacy.load("en_core_web_sm")
tag_map = {}

# convert any integer IDs to strings for JSON
for tag, morph in nlp.vocab.morphology.tag_map.items():
    tag_map[tag] = {}
    for feat, val in morph.items():
        feat = nlp.vocab.strings.as_string(feat)
        if not isinstance(val, bool):
            val = nlp.vocab.strings.as_string(val)
        tag_map[tag][feat] = val

srsly.write_json("tag_map.json", tag_map)