Extend v2.3 migration guide (#5653)

* Extend preloaded vocab section * Add section on tag maps
2020-06-26 14:12:29 +02:00 · 2020-06-26 14:12:29 +02:00 · d777d9cc38
parent a2660bd9c6
commit d777d9cc38
1 changed files with 75 additions and 3 deletions
--- a/website/docs/usage/v2-3.md
+++ b/website/docs/usage/v2-3.md
@ -182,12 +182,12 @@ If you're adding data for a new language, the normalization table should be
 added to `spacy-lookups-data`. See
 [adding norm exceptions](/usage/adding-languages#norm-exceptions).
-#### No preloaded lexemes/vocab for models with vectors
+#### No preloaded vocab for models with vectors
 To reduce the initial loading time, the lexemes in `nlp.vocab` are no longer
 loaded on initialization for models with vectors. As you process texts, the
-lexemes will be added to the vocab automatically, just as in models without
+lexemes will be added to the vocab automatically, just as in small models
-vectors.
+without vectors.
 To see the number of unique vectors and number of words with vectors, see
 `nlp.meta['vectors']`, for example for `en_core_web_md` there are `20000`
@ -210,6 +210,20 @@ for orth in nlp.vocab.vectors:
    _ = nlp.vocab[orth]
 ```
 If your workflow previously iterated over `nlp.vocab`, a similar alternative
 is to iterate over words with vectors instead:
 ```diff
 - lexemes = [w for w in nlp.vocab]
 + lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
 ```
 Be aware that the set of preloaded lexemes in a v2.2 model is not equivalent to
 the set of words with vectors. For English, v2.2 `md/lg` models have 1.3M
 provided lexemes but only 685K words with vectors. The vectors have been
 updated for most languages in v2.2, but the English models contain the same
 vectors for both v2.2 and v2.3.
 #### Lexeme.is_oov and Token.is_oov
 <Infobox title="Important note" variant="warning">
@ -254,6 +268,28 @@ model vocab, which will take a few seconds on initial loading. When you save
 this model after loading the `prob` table, the full `prob` table will be saved
 as part of the model vocab.
 To load the probability table into a provided model, first make sure you have
 `spacy-lookups-data` installed. To load the table, remove the empty provided
 `lexeme_prob` table and then access `Lexeme.prob` for any word to load the
 table from `spacy-lookups-data`:
 ```diff
 + # prerequisite: pip install spacy-lookups-data
 import spacy
 nlp = spacy.load("en_core_web_md")
 # remove the empty placeholder prob table
 + if nlp.vocab.lookups_extra.has_table("lexeme_prob"):
 +     nlp.vocab.lookups_extra.remove_table("lexeme_prob")
 # access any `.prob` to load the full table into the model
 assert nlp.vocab["a"].prob == -3.9297883511
 # if desired, save this model with the probability table included
 nlp.to_disk("/path/to/model")
 ```
 If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as part
 of a new model, add the data to
 [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
@ -271,3 +307,39 @@ When you initialize a new model with [`spacy init-model`](/api/cli#init-model),
 the `prob` table from `spacy-lookups-data` may be loaded as part of the
 initialization. If you'd like to omit this extra data as in spaCy's provided
 v2.3 models, use the new flag `--omit-extra-lookups`.
 #### Tag maps in provided models vs. blank models
 The tag maps in the provided models may differ from the tag maps in the spaCy
 library. You can access the tag map in a loaded model under
 `nlp.vocab.morphology.tag_map`.
 The tag map from `spacy.lang.lg.tag_map` is still used when a blank model is
 initialized. If you want to provide an alternate tag map, update
 `nlp.vocab.morphology.tag_map` after initializing the model or if you're using
 the [train CLI](/api/cli#train), you can use the new `--tag-map-path` option to
 provide in the tag map as a JSON dict.
 If you want to export a tag map from a provided model for use with the train
 CLI, you can save it as a JSON dict. To only use string keys as required by
 JSON and to make it easier to read and edit, any internal integer IDs need to
 be converted back to strings:
 ```python
 import spacy
 import srsly
 nlp = spacy.load("en_core_web_sm")
 tag_map = {}
 # convert any integer IDs to strings for JSON
 for tag, morph in nlp.vocab.morphology.tag_map.items():
    tag_map[tag] = {}
    for feat, val in morph.items():
        feat = nlp.vocab.strings.as_string(feat)
        if not isinstance(val, bool):
            val = nlp.vocab.strings.as_string(val)
        tag_map[tag][feat] = val
 srsly.write_json("tag_map.json", tag_map)
 ```