mirror of https://github.com/explosion/spaCy.git
Extend what's new in v2.3 with vocab / is_oov (#5635)
This commit is contained in:
parent
d94e961f14
commit
7ce451c211
|
@ -182,6 +182,51 @@ If you're adding data for a new language, the normalization table should be
|
||||||
added to `spacy-lookups-data`. See
|
added to `spacy-lookups-data`. See
|
||||||
[adding norm exceptions](/usage/adding-languages#norm-exceptions).
|
[adding norm exceptions](/usage/adding-languages#norm-exceptions).
|
||||||
|
|
||||||
|
#### No preloaded lexemes/vocab for models with vectors
|
||||||
|
|
||||||
|
To reduce the initial loading time, the lexemes in `nlp.vocab` are no longer
|
||||||
|
loaded on initialization for models with vectors. As you process texts, the
|
||||||
|
lexemes will be added to the vocab automatically, just as in models without
|
||||||
|
vectors.
|
||||||
|
|
||||||
|
To see the number of unique vectors and number of words with vectors, see
|
||||||
|
`nlp.meta['vectors']`, for example for `en_core_web_md` there are `20000`
|
||||||
|
unique vectors and `684830` words with vectors:
|
||||||
|
|
||||||
|
```python
|
||||||
|
{
|
||||||
|
'width': 300,
|
||||||
|
'vectors': 20000,
|
||||||
|
'keys': 684830,
|
||||||
|
'name': 'en_core_web_md.vectors'
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
If required, for instance if you are working directly with word vectors rather
|
||||||
|
than processing texts, you can load all lexemes for words with vectors at once:
|
||||||
|
|
||||||
|
```python
|
||||||
|
for orth in nlp.vocab.vectors:
|
||||||
|
_ = nlp.vocab[orth]
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Lexeme.is_oov and Token.is_oov
|
||||||
|
|
||||||
|
<Infobox title="Important note" variant="warning">
|
||||||
|
|
||||||
|
Due to a bug, the values for `is_oov` are reversed in v2.3.0, but this will be
|
||||||
|
fixed in the next patch release v2.3.1.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
|
In v2.3, `Lexeme.is_oov` and `Token.is_oov` are `True` if the lexeme does not
|
||||||
|
have a word vector. This is equivalent to `token.orth not in
|
||||||
|
nlp.vocab.vectors`.
|
||||||
|
|
||||||
|
Previously in v2.2, `is_oov` corresponded to whether a lexeme had stored
|
||||||
|
probability and cluster features. The probability and cluster features are no
|
||||||
|
longer included in the provided medium and large models (see the next section).
|
||||||
|
|
||||||
#### Probability and cluster features
|
#### Probability and cluster features
|
||||||
|
|
||||||
> #### Load and save extra prob lookups table
|
> #### Load and save extra prob lookups table
|
||||||
|
|
Loading…
Reference in New Issue