mirror of https://github.com/explosion/spaCy.git
Extend what's new in v2.3 with vocab / is_oov (#5635)
This commit is contained in:
parent
d94e961f14
commit
7ce451c211
|
@ -182,6 +182,51 @@ If you're adding data for a new language, the normalization table should be
|
|||
added to `spacy-lookups-data`. See
|
||||
[adding norm exceptions](/usage/adding-languages#norm-exceptions).
|
||||
|
||||
#### No preloaded lexemes/vocab for models with vectors
|
||||
|
||||
To reduce the initial loading time, the lexemes in `nlp.vocab` are no longer
|
||||
loaded on initialization for models with vectors. As you process texts, the
|
||||
lexemes will be added to the vocab automatically, just as in models without
|
||||
vectors.
|
||||
|
||||
To see the number of unique vectors and number of words with vectors, see
|
||||
`nlp.meta['vectors']`, for example for `en_core_web_md` there are `20000`
|
||||
unique vectors and `684830` words with vectors:
|
||||
|
||||
```python
|
||||
{
|
||||
'width': 300,
|
||||
'vectors': 20000,
|
||||
'keys': 684830,
|
||||
'name': 'en_core_web_md.vectors'
|
||||
}
|
||||
```
|
||||
|
||||
If required, for instance if you are working directly with word vectors rather
|
||||
than processing texts, you can load all lexemes for words with vectors at once:
|
||||
|
||||
```python
|
||||
for orth in nlp.vocab.vectors:
|
||||
_ = nlp.vocab[orth]
|
||||
```
|
||||
|
||||
#### Lexeme.is_oov and Token.is_oov
|
||||
|
||||
<Infobox title="Important note" variant="warning">
|
||||
|
||||
Due to a bug, the values for `is_oov` are reversed in v2.3.0, but this will be
|
||||
fixed in the next patch release v2.3.1.
|
||||
|
||||
</Infobox>
|
||||
|
||||
In v2.3, `Lexeme.is_oov` and `Token.is_oov` are `True` if the lexeme does not
|
||||
have a word vector. This is equivalent to `token.orth not in
|
||||
nlp.vocab.vectors`.
|
||||
|
||||
Previously in v2.2, `is_oov` corresponded to whether a lexeme had stored
|
||||
probability and cluster features. The probability and cluster features are no
|
||||
longer included in the provided medium and large models (see the next section).
|
||||
|
||||
#### Probability and cluster features
|
||||
|
||||
> #### Load and save extra prob lookups table
|
||||
|
|
Loading…
Reference in New Issue