mirror of https://github.com/explosion/spaCy.git
Extend v2.3 migration guide (#5653)
* Extend preloaded vocab section * Add section on tag maps
This commit is contained in:
parent
a2660bd9c6
commit
d777d9cc38
|
@ -182,12 +182,12 @@ If you're adding data for a new language, the normalization table should be
|
||||||
added to `spacy-lookups-data`. See
|
added to `spacy-lookups-data`. See
|
||||||
[adding norm exceptions](/usage/adding-languages#norm-exceptions).
|
[adding norm exceptions](/usage/adding-languages#norm-exceptions).
|
||||||
|
|
||||||
#### No preloaded lexemes/vocab for models with vectors
|
#### No preloaded vocab for models with vectors
|
||||||
|
|
||||||
To reduce the initial loading time, the lexemes in `nlp.vocab` are no longer
|
To reduce the initial loading time, the lexemes in `nlp.vocab` are no longer
|
||||||
loaded on initialization for models with vectors. As you process texts, the
|
loaded on initialization for models with vectors. As you process texts, the
|
||||||
lexemes will be added to the vocab automatically, just as in models without
|
lexemes will be added to the vocab automatically, just as in small models
|
||||||
vectors.
|
without vectors.
|
||||||
|
|
||||||
To see the number of unique vectors and number of words with vectors, see
|
To see the number of unique vectors and number of words with vectors, see
|
||||||
`nlp.meta['vectors']`, for example for `en_core_web_md` there are `20000`
|
`nlp.meta['vectors']`, for example for `en_core_web_md` there are `20000`
|
||||||
|
@ -210,6 +210,20 @@ for orth in nlp.vocab.vectors:
|
||||||
_ = nlp.vocab[orth]
|
_ = nlp.vocab[orth]
|
||||||
```
|
```
|
||||||
|
|
||||||
|
If your workflow previously iterated over `nlp.vocab`, a similar alternative
|
||||||
|
is to iterate over words with vectors instead:
|
||||||
|
|
||||||
|
```diff
|
||||||
|
- lexemes = [w for w in nlp.vocab]
|
||||||
|
+ lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
|
||||||
|
```
|
||||||
|
|
||||||
|
Be aware that the set of preloaded lexemes in a v2.2 model is not equivalent to
|
||||||
|
the set of words with vectors. For English, v2.2 `md/lg` models have 1.3M
|
||||||
|
provided lexemes but only 685K words with vectors. The vectors have been
|
||||||
|
updated for most languages in v2.2, but the English models contain the same
|
||||||
|
vectors for both v2.2 and v2.3.
|
||||||
|
|
||||||
#### Lexeme.is_oov and Token.is_oov
|
#### Lexeme.is_oov and Token.is_oov
|
||||||
|
|
||||||
<Infobox title="Important note" variant="warning">
|
<Infobox title="Important note" variant="warning">
|
||||||
|
@ -254,6 +268,28 @@ model vocab, which will take a few seconds on initial loading. When you save
|
||||||
this model after loading the `prob` table, the full `prob` table will be saved
|
this model after loading the `prob` table, the full `prob` table will be saved
|
||||||
as part of the model vocab.
|
as part of the model vocab.
|
||||||
|
|
||||||
|
To load the probability table into a provided model, first make sure you have
|
||||||
|
`spacy-lookups-data` installed. To load the table, remove the empty provided
|
||||||
|
`lexeme_prob` table and then access `Lexeme.prob` for any word to load the
|
||||||
|
table from `spacy-lookups-data`:
|
||||||
|
|
||||||
|
```diff
|
||||||
|
+ # prerequisite: pip install spacy-lookups-data
|
||||||
|
import spacy
|
||||||
|
|
||||||
|
nlp = spacy.load("en_core_web_md")
|
||||||
|
|
||||||
|
# remove the empty placeholder prob table
|
||||||
|
+ if nlp.vocab.lookups_extra.has_table("lexeme_prob"):
|
||||||
|
+ nlp.vocab.lookups_extra.remove_table("lexeme_prob")
|
||||||
|
|
||||||
|
# access any `.prob` to load the full table into the model
|
||||||
|
assert nlp.vocab["a"].prob == -3.9297883511
|
||||||
|
|
||||||
|
# if desired, save this model with the probability table included
|
||||||
|
nlp.to_disk("/path/to/model")
|
||||||
|
```
|
||||||
|
|
||||||
If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as part
|
If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as part
|
||||||
of a new model, add the data to
|
of a new model, add the data to
|
||||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
|
||||||
|
@ -271,3 +307,39 @@ When you initialize a new model with [`spacy init-model`](/api/cli#init-model),
|
||||||
the `prob` table from `spacy-lookups-data` may be loaded as part of the
|
the `prob` table from `spacy-lookups-data` may be loaded as part of the
|
||||||
initialization. If you'd like to omit this extra data as in spaCy's provided
|
initialization. If you'd like to omit this extra data as in spaCy's provided
|
||||||
v2.3 models, use the new flag `--omit-extra-lookups`.
|
v2.3 models, use the new flag `--omit-extra-lookups`.
|
||||||
|
|
||||||
|
#### Tag maps in provided models vs. blank models
|
||||||
|
|
||||||
|
The tag maps in the provided models may differ from the tag maps in the spaCy
|
||||||
|
library. You can access the tag map in a loaded model under
|
||||||
|
`nlp.vocab.morphology.tag_map`.
|
||||||
|
|
||||||
|
The tag map from `spacy.lang.lg.tag_map` is still used when a blank model is
|
||||||
|
initialized. If you want to provide an alternate tag map, update
|
||||||
|
`nlp.vocab.morphology.tag_map` after initializing the model or if you're using
|
||||||
|
the [train CLI](/api/cli#train), you can use the new `--tag-map-path` option to
|
||||||
|
provide in the tag map as a JSON dict.
|
||||||
|
|
||||||
|
If you want to export a tag map from a provided model for use with the train
|
||||||
|
CLI, you can save it as a JSON dict. To only use string keys as required by
|
||||||
|
JSON and to make it easier to read and edit, any internal integer IDs need to
|
||||||
|
be converted back to strings:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import spacy
|
||||||
|
import srsly
|
||||||
|
|
||||||
|
nlp = spacy.load("en_core_web_sm")
|
||||||
|
tag_map = {}
|
||||||
|
|
||||||
|
# convert any integer IDs to strings for JSON
|
||||||
|
for tag, morph in nlp.vocab.morphology.tag_map.items():
|
||||||
|
tag_map[tag] = {}
|
||||||
|
for feat, val in morph.items():
|
||||||
|
feat = nlp.vocab.strings.as_string(feat)
|
||||||
|
if not isinstance(val, bool):
|
||||||
|
val = nlp.vocab.strings.as_string(val)
|
||||||
|
tag_map[tag][feat] = val
|
||||||
|
|
||||||
|
srsly.write_json("tag_map.json", tag_map)
|
||||||
|
```
|
||||||
|
|
Loading…
Reference in New Issue