spaCy

History

adrianeboyd a5cd203284 Reduce stored lexemes data, move feats to lookups (#5238 ) * Reduce stored lexemes data, move feats to lookups * Move non-derivable lexemes features (`norm / cluster / prob`) to `spacy-lookups-data` as lookups * Get/set `norm` in both lookups and `LexemeC`, serialize in lookups * Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in lookups only * Remove serialization of lexemes data as `vocab/lexemes.bin` * Remove `SerializedLexemeC` * Remove `Lexeme.to_bytes/from_bytes` * Modify normalization exception loading: * Always create `Vocab.lookups` table `lexeme_norm` for normalization exceptions * Load base exceptions from `lang.norm_exceptions`, but load language-specific exceptions from lookups * Set `lex_attr_getter[NORM]` including new lookups table in `BaseDefaults.create_vocab()` and when deserializing `Vocab` * Remove all cached lexemes when deserializing vocab to override existing normalizations with the new normalizations (as a replacement for the previous step that replaced all lexemes data with the deserialized data) * Skip English normalization test Skip English normalization test because the data is now in `spacy-lookups-data`. * Remove norm exceptions Moved to spacy-lookups-data. * Move norm exceptions test to spacy-lookups-data * Load extra lookups from spacy-lookups-data lazily Load extra lookups (currently for cluster and prob) lazily from the entry point `lg_extra` as `Vocab.lookups_extra`. * Skip creating lexeme cache on load To improve model loading times, do not create the full lexeme cache when loading. The lexemes will be created on demand when processing. * Identify numeric values in Lexeme.set_attrs() With the removal of a special case for `PROB`, also identify `float` to avoid trying to convert it with the `StringStore`. * Skip lexeme cache init in from_bytes * Unskip and update lookups tests for python3.6+ * Update vocab pickle to include lookups_extra * Update vocab serialization tests Check strings rather than lexemes since lexemes aren't initialized automatically, account for addition of "_SP". * Re-skip lookups test because of python3.5 * Skip PROB/float values in Lexeme.set_attrs * Convert is_oov from lexeme flag to lex in vectors Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether the lexeme has a vector. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>		2020-05-19 15:59:14 +02:00
..
cli	Reduce stored lexemes data, move feats to lookups (#5238 )	2020-05-19 15:59:14 +02:00
data	…
displacy	Add missing import	2020-04-28 13:48:37 +02:00
lang	Reduce stored lexemes data, move feats to lookups (#5238 )	2020-05-19 15:59:14 +02:00
matcher	Normalize TokenC.sent_start values for Matcher (#5346 )	2020-04-29 12:57:30 +02:00
ml	Replace function registries with catalogue (#4584 )	2019-11-07 11:45:22 +01:00
pipeline	Simplify warnings	2020-04-28 13:37:37 +02:00
syntax	prevent updating cfg if the Model was already defined (#5078 )	2020-03-03 13:58:56 +01:00
tests	Reduce stored lexemes data, move feats to lookups (#5238 )	2020-05-19 15:59:14 +02:00
tokens	Reduce stored lexemes data, move feats to lookups (#5238 )	2020-05-19 15:59:14 +02:00
__init__.pxd	…
__init__.py	Simplify warnings	2020-04-28 13:37:37 +02:00
__main__.py	Use latest wasabi	2019-11-04 02:38:45 +01:00
_ml.py	Skip duplicate lexeme rank setting (#5401 )	2020-05-14 18:26:12 +02:00
about.py	Set version to v2.2.4	2020-03-12 11:30:41 +01:00
analysis.py	Simplify warnings	2020-04-28 13:37:37 +02:00
attrs.pxd	Reduce stored lexemes data, move feats to lookups (#5238 )	2020-05-19 15:59:14 +02:00
attrs.pyx	Reduce stored lexemes data, move feats to lookups (#5238 )	2020-05-19 15:59:14 +02:00
compat.py	Replace function registries with catalogue (#4584 )	2019-11-07 11:45:22 +01:00
errors.py	fixup! Fix ErrorsWithCodes().__class__ return value	2020-05-14 15:45:58 +02:00
glossary.py	Update tag maps and docs for English and German (#4501 )	2019-10-24 12:56:05 +02:00
gold.pxd	Merge changes from master	2019-08-21 14:18:52 +02:00
gold.pyx	prevent None in gold fields (#5425 )	2020-05-13 22:08:50 +02:00
kb.pxd	rename entity frequency	2019-07-19 17:40:28 +02:00
kb.pyx	Simplify warnings	2020-04-28 13:37:37 +02:00
language.py	Reduce stored lexemes data, move feats to lookups (#5238 )	2020-05-19 15:59:14 +02:00
lemmatizer.py	Remove duplicated branch in if/else-if statement (#5234 )	2020-04-02 14:47:42 +02:00
lexeme.pxd	Reduce stored lexemes data, move feats to lookups (#5238 )	2020-05-19 15:59:14 +02:00
lexeme.pyx	Reduce stored lexemes data, move feats to lookups (#5238 )	2020-05-19 15:59:14 +02:00
lookups.py	Reduce stored lexemes data, move feats to lookups (#5238 )	2020-05-19 15:59:14 +02:00
morphology.pxd	…
morphology.pyx	Improve Morphology errors (#4314 )	2019-09-21 14:37:06 +02:00
parts_of_speech.pxd	…
parts_of_speech.pyx	…
scorer.py	Fix GoldParse init when token count differs (#5191 )	2020-03-26 10:46:23 +01:00
strings.pxd	…
strings.pyx	…
structs.pxd	Reduce stored lexemes data, move feats to lookups (#5238 )	2020-05-19 15:59:14 +02:00
symbols.pxd	Reduce stored lexemes data, move feats to lookups (#5238 )	2020-05-19 15:59:14 +02:00
symbols.pyx	Reduce stored lexemes data, move feats to lookups (#5238 )	2020-05-19 15:59:14 +02:00
tokenizer.pxd	Flush tokenizer cache when necessary (#4258 )	2019-09-08 20:52:46 +02:00
tokenizer.pyx	Simplify warnings	2020-04-28 13:37:37 +02:00
typedefs.pxd	…
typedefs.pyx	…
util.py	Fix passing of component configuration (#5374 )	2020-04-29 12:56:17 +02:00
vectors.pyx	Check that row is within bounds when adding vector (#5430 )	2020-05-13 22:08:28 +02:00
vocab.pxd	Reduce stored lexemes data, move feats to lookups (#5238 )	2020-05-19 15:59:14 +02:00
vocab.pyx	Reduce stored lexemes data, move feats to lookups (#5238 )	2020-05-19 15:59:14 +02:00