From c4d02094726a7e92325f9fc0911fcfad7f43db75 Mon Sep 17 00:00:00 2001 From: Adriane Boyd Date: Fri, 26 Jun 2020 14:12:29 +0200 Subject: [PATCH] Extend v2.3 migration guide (#5653) * Extend preloaded vocab section * Add section on tag maps --- website/docs/usage/v2-3.md | 78 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 75 insertions(+), 3 deletions(-) diff --git a/website/docs/usage/v2-3.md b/website/docs/usage/v2-3.md index e6b88c779..b6c4d7dfb 100644 --- a/website/docs/usage/v2-3.md +++ b/website/docs/usage/v2-3.md @@ -182,12 +182,12 @@ If you're adding data for a new language, the normalization table should be added to `spacy-lookups-data`. See [adding norm exceptions](/usage/adding-languages#norm-exceptions). -#### No preloaded lexemes/vocab for models with vectors +#### No preloaded vocab for models with vectors To reduce the initial loading time, the lexemes in `nlp.vocab` are no longer loaded on initialization for models with vectors. As you process texts, the -lexemes will be added to the vocab automatically, just as in models without -vectors. +lexemes will be added to the vocab automatically, just as in small models +without vectors. To see the number of unique vectors and number of words with vectors, see `nlp.meta['vectors']`, for example for `en_core_web_md` there are `20000` @@ -210,6 +210,20 @@ for orth in nlp.vocab.vectors: _ = nlp.vocab[orth] ``` +If your workflow previously iterated over `nlp.vocab`, a similar alternative +is to iterate over words with vectors instead: + +```diff +- lexemes = [w for w in nlp.vocab] ++ lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors] +``` + +Be aware that the set of preloaded lexemes in a v2.2 model is not equivalent to +the set of words with vectors. For English, v2.2 `md/lg` models have 1.3M +provided lexemes but only 685K words with vectors. The vectors have been +updated for most languages in v2.2, but the English models contain the same +vectors for both v2.2 and v2.3. + #### Lexeme.is_oov and Token.is_oov @@ -254,6 +268,28 @@ model vocab, which will take a few seconds on initial loading. When you save this model after loading the `prob` table, the full `prob` table will be saved as part of the model vocab. +To load the probability table into a provided model, first make sure you have +`spacy-lookups-data` installed. To load the table, remove the empty provided +`lexeme_prob` table and then access `Lexeme.prob` for any word to load the +table from `spacy-lookups-data`: + +```diff ++ # prerequisite: pip install spacy-lookups-data +import spacy + +nlp = spacy.load("en_core_web_md") + +# remove the empty placeholder prob table ++ if nlp.vocab.lookups_extra.has_table("lexeme_prob"): ++ nlp.vocab.lookups_extra.remove_table("lexeme_prob") + +# access any `.prob` to load the full table into the model +assert nlp.vocab["a"].prob == -3.9297883511 + +# if desired, save this model with the probability table included +nlp.to_disk("/path/to/model") +``` + If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as part of a new model, add the data to [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under @@ -271,3 +307,39 @@ When you initialize a new model with [`spacy init-model`](/api/cli#init-model), the `prob` table from `spacy-lookups-data` may be loaded as part of the initialization. If you'd like to omit this extra data as in spaCy's provided v2.3 models, use the new flag `--omit-extra-lookups`. + +#### Tag maps in provided models vs. blank models + +The tag maps in the provided models may differ from the tag maps in the spaCy +library. You can access the tag map in a loaded model under +`nlp.vocab.morphology.tag_map`. + +The tag map from `spacy.lang.lg.tag_map` is still used when a blank model is +initialized. If you want to provide an alternate tag map, update +`nlp.vocab.morphology.tag_map` after initializing the model or if you're using +the [train CLI](/api/cli#train), you can use the new `--tag-map-path` option to +provide in the tag map as a JSON dict. + +If you want to export a tag map from a provided model for use with the train +CLI, you can save it as a JSON dict. To only use string keys as required by +JSON and to make it easier to read and edit, any internal integer IDs need to +be converted back to strings: + +```python +import spacy +import srsly + +nlp = spacy.load("en_core_web_sm") +tag_map = {} + +# convert any integer IDs to strings for JSON +for tag, morph in nlp.vocab.morphology.tag_map.items(): + tag_map[tag] = {} + for feat, val in morph.items(): + feat = nlp.vocab.strings.as_string(feat) + if not isinstance(val, bool): + val = nlp.vocab.strings.as_string(val) + tag_map[tag][feat] = val + +srsly.write_json("tag_map.json", tag_map) +```