From 4f73ced9140e9bd4d1628e123d7c087b16a88b3c Mon Sep 17 00:00:00 2001 From: Adriane Boyd Date: Tue, 23 Jun 2020 16:48:59 +0200 Subject: [PATCH] Extend what's new in v2.3 with vocab / is_oov (#5635) --- website/docs/usage/v2-3.md | 45 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/website/docs/usage/v2-3.md b/website/docs/usage/v2-3.md index 378b1ec34..c56b44267 100644 --- a/website/docs/usage/v2-3.md +++ b/website/docs/usage/v2-3.md @@ -182,6 +182,51 @@ If you're adding data for a new language, the normalization table should be added to `spacy-lookups-data`. See [adding norm exceptions](/usage/adding-languages#norm-exceptions). +#### No preloaded lexemes/vocab for models with vectors + +To reduce the initial loading time, the lexemes in `nlp.vocab` are no longer +loaded on initialization for models with vectors. As you process texts, the +lexemes will be added to the vocab automatically, just as in models without +vectors. + +To see the number of unique vectors and number of words with vectors, see +`nlp.meta['vectors']`, for example for `en_core_web_md` there are `20000` +unique vectors and `684830` words with vectors: + +```python +{ + 'width': 300, + 'vectors': 20000, + 'keys': 684830, + 'name': 'en_core_web_md.vectors' +} +``` + +If required, for instance if you are working directly with word vectors rather +than processing texts, you can load all lexemes for words with vectors at once: + +```python +for orth in nlp.vocab.vectors: + _ = nlp.vocab[orth] +``` + +#### Lexeme.is_oov and Token.is_oov + + + +Due to a bug, the values for `is_oov` are reversed in v2.3.0, but this will be +fixed in the next patch release v2.3.1. + + + +In v2.3, `Lexeme.is_oov` and `Token.is_oov` are `True` if the lexeme does not +have a word vector. This is equivalent to `token.orth not in +nlp.vocab.vectors`. + +Previously in v2.2, `is_oov` corresponded to whether a lexeme had stored +probability and cluster features. The probability and cluster features are no +longer included in the provided medium and large models (see the next section). + #### Probability and cluster features > #### Load and save extra prob lookups table