Extend v2.3 migration guide (#5653)

* Extend preloaded vocab section * Add section on tag maps
2020-06-26 14:12:29 +02:00 · 2020-06-26 14:12:29 +02:00 · c4d0209472
parent 90c7eb0e2f
commit c4d0209472
1 changed files with 75 additions and 3 deletions
--- a/website/docs/usage/v2-3.md
+++ b/website/docs/usage/v2-3.md
@ -182,12 +182,12 @@ If you're adding data for a new language, the normalization table should be
 added to `spacy-lookups-data`. See
 [adding norm exceptions](/usage/adding-languages#norm-exceptions).

-#### No preloaded lexemes/vocab for models with vectors
+#### No preloaded vocab for models with vectors

 To reduce the initial loading time, the lexemes in `nlp.vocab` are no longer
 loaded on initialization for models with vectors. As you process texts, the
-lexemes will be added to the vocab automatically, just as in models without
-vectors.
+lexemes will be added to the vocab automatically, just as in small models
+without vectors.

 To see the number of unique vectors and number of words with vectors, see
 `nlp.meta['vectors']`, for example for `en_core_web_md` there are `20000`
@ -210,6 +210,20 @@ for orth in nlp.vocab.vectors:
    _ = nlp.vocab[orth]
 ```

+If your workflow previously iterated over `nlp.vocab`, a similar alternative
+is to iterate over words with vectors instead:
+
+```diff
+- lexemes = [w for w in nlp.vocab]
+ lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
+```
+
+Be aware that the set of preloaded lexemes in a v2.2 model is not equivalent to
+the set of words with vectors. For English, v2.2 `md/lg` models have 1.3M
+provided lexemes but only 685K words with vectors. The vectors have been
+updated for most languages in v2.2, but the English models contain the same
+vectors for both v2.2 and v2.3.
+
 #### Lexeme.is_oov and Token.is_oov

 <Infobox title="Important note" variant="warning">
@ -254,6 +268,28 @@ model vocab, which will take a few seconds on initial loading. When you save
 this model after loading the `prob` table, the full `prob` table will be saved
 as part of the model vocab.

+To load the probability table into a provided model, first make sure you have
+`spacy-lookups-data` installed. To load the table, remove the empty provided
+`lexeme_prob` table and then access `Lexeme.prob` for any word to load the
+table from `spacy-lookups-data`:
+
+```diff
+ # prerequisite: pip install spacy-lookups-data
+import spacy
+
+nlp = spacy.load("en_core_web_md")
+
+# remove the empty placeholder prob table
+ if nlp.vocab.lookups_extra.has_table("lexeme_prob"):
+     nlp.vocab.lookups_extra.remove_table("lexeme_prob")
+
+# access any `.prob` to load the full table into the model
+assert nlp.vocab["a"].prob == -3.9297883511
+
+# if desired, save this model with the probability table included
+nlp.to_disk("/path/to/model")
+```
+
 If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as part
 of a new model, add the data to
 [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
@ -271,3 +307,39 @@ When you initialize a new model with [`spacy init-model`](/api/cli#init-model),
 the `prob` table from `spacy-lookups-data` may be loaded as part of the
 initialization. If you'd like to omit this extra data as in spaCy's provided
 v2.3 models, use the new flag `--omit-extra-lookups`.
+
+#### Tag maps in provided models vs. blank models
+
+The tag maps in the provided models may differ from the tag maps in the spaCy
+library. You can access the tag map in a loaded model under
+`nlp.vocab.morphology.tag_map`.
+
+The tag map from `spacy.lang.lg.tag_map` is still used when a blank model is
+initialized. If you want to provide an alternate tag map, update
+`nlp.vocab.morphology.tag_map` after initializing the model or if you're using
+the [train CLI](/api/cli#train), you can use the new `--tag-map-path` option to
+provide in the tag map as a JSON dict.
+
+If you want to export a tag map from a provided model for use with the train
+CLI, you can save it as a JSON dict. To only use string keys as required by
+JSON and to make it easier to read and edit, any internal integer IDs need to
+be converted back to strings:
+
+```python
+import spacy
+import srsly
+
+nlp = spacy.load("en_core_web_sm")
+tag_map = {}
+
+# convert any integer IDs to strings for JSON
+for tag, morph in nlp.vocab.morphology.tag_map.items():
+    tag_map[tag] = {}
+    for feat, val in morph.items():
+        feat = nlp.vocab.strings.as_string(feat)
+        if not isinstance(val, bool):
+            val = nlp.vocab.strings.as_string(val)
+        tag_map[tag][feat] = val
+
+srsly.write_json("tag_map.json", tag_map)
+```