spaCy

Commit Graph

Author	SHA1	Message	Date
Ines Montani	f91e9e8c84	Remove F841 [ci skip]	2020-06-20 14:47:17 +02:00
Ines Montani	8283df80e9	Tidy up and auto-format	2020-06-20 14:15:04 +02:00
Sofie Van Landeghem	c0f4a1e43b	train is from-config by default (#5575 ) * verbose and tag_map options * adding init_tok2vec option and only changing the tok2vec that is specified * adding omit_extra_lookups and verifying textcat config * wip * pretrain bugfix * add replace and resume options * train_textcat fix * raw text functionality * improve UX when KeyError or when input data can't be parsed * avoid unnecessary access to goldparse in TextCat pipe * save performance information in nlp.meta * add noise_level to config * move nn_parser's defaults to config file * multitask in config - doesn't work yet * scorer offering both F and AUC options, need to be specified in config * add textcat verification code from old train script * small fixes to config files * clean up * set default config for ner/parser to allow create_pipe to work as before * two more test fixes * small fixes * cleanup * fix NER pickling + additional unit test * create_pipe as before	2020-06-12 02:02:07 +02:00
Matthew Honnibal	8411d4f4e6	Merge pull request #5543 from svlandeg/feature/pretrain-config pretrain from config	2020-06-04 19:07:12 +02:00
svlandeg	eac12cbb77	make dropout in embed layers configurable	2020-06-03 11:50:16 +02:00
svlandeg	e0f9f448f1	remove Tensorizer	2020-06-01 23:38:48 +02:00
Adriane Boyd	17ee9ab53a	Fix _SP/POS=SPACE in strings serialization tests	2020-05-21 19:49:08 +02:00
Ines Montani	24f72c669c	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
Ines Montani	d8f3190c0a	Tidy up and auto-format	2020-05-21 14:14:01 +02:00
Sofie Van Landeghem	f00de445dd	default models defined in component decorator (#5452 ) * move defaults to pipeline and use in component decorator * black formatting * relative import	2020-05-19 16:20:03 +02:00
adrianeboyd	a5cd203284	Reduce stored lexemes data, move feats to lookups (#5238 ) * Reduce stored lexemes data, move feats to lookups * Move non-derivable lexemes features (`norm / cluster / prob`) to `spacy-lookups-data` as lookups * Get/set `norm` in both lookups and `LexemeC`, serialize in lookups * Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in lookups only * Remove serialization of lexemes data as `vocab/lexemes.bin` * Remove `SerializedLexemeC` * Remove `Lexeme.to_bytes/from_bytes` * Modify normalization exception loading: * Always create `Vocab.lookups` table `lexeme_norm` for normalization exceptions * Load base exceptions from `lang.norm_exceptions`, but load language-specific exceptions from lookups * Set `lex_attr_getter[NORM]` including new lookups table in `BaseDefaults.create_vocab()` and when deserializing `Vocab` * Remove all cached lexemes when deserializing vocab to override existing normalizations with the new normalizations (as a replacement for the previous step that replaced all lexemes data with the deserialized data) * Skip English normalization test Skip English normalization test because the data is now in `spacy-lookups-data`. * Remove norm exceptions Moved to spacy-lookups-data. * Move norm exceptions test to spacy-lookups-data * Load extra lookups from spacy-lookups-data lazily Load extra lookups (currently for cluster and prob) lazily from the entry point `lg_extra` as `Vocab.lookups_extra`. * Skip creating lexeme cache on load To improve model loading times, do not create the full lexeme cache when loading. The lexemes will be created on demand when processing. * Identify numeric values in Lexeme.set_attrs() With the removal of a special case for `PROB`, also identify `float` to avoid trying to convert it with the `StringStore`. * Skip lexeme cache init in from_bytes * Unskip and update lookups tests for python3.6+ * Update vocab pickle to include lookups_extra * Update vocab serialization tests Check strings rather than lexemes since lexemes aren't initialized automatically, account for addition of "_SP". * Re-skip lookups test because of python3.5 * Skip PROB/float values in Lexeme.set_attrs * Convert is_oov from lexeme flag to lex in vectors Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether the lexeme has a vector. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-19 15:59:14 +02:00
Matthew Honnibal	333b1a308b	Adapt parser and NER for transformers (#5449 ) * Draft layer for BILUO actions * Fixes to biluo layer * WIP on BILUO layer * Add tests for BILUO layer * Format * Fix transitions * Update test * Link in the simple_ner * Update BILUO tagger * Update __init__ * Import simple_ner * Update test * Import * Add files * Add config * Fix label passing for BILUO and tagger * Fix label handling for simple_ner component * Update simple NER test * Update config * Hack train script * Update BILUO layer * Fix SimpleNER component * Update train_from_config * Add biluo_to_iob helper * Add IOB layer * Add IOBTagger model * Update biluo layer * Update SimpleNER tagger * Update BILUO * Read random seed in train-from-config * Update use of normal_init * Fix normalization of gradient in SimpleNER * Update IOBTagger * Remove print * Tweak masking in BILUO * Add dropout in SimpleNER * Update thinc * Tidy up simple_ner * Fix biluo model * Unhack train-from-config * Update setup.cfg and requirements * Add tb_framework.py for parser model * Try to avoid memory leak in BILUO * Move ParserModel into spacy.ml, avoid need for subclass. * Use updated parser model * Remove incorrect call to model.initializre in PrecomputableAffine * Update parser model * Avoid divide by zero in tagger * Add extra dropout layer in tagger * Refine minibatch_by_words function to avoid oom * Fix parser model after refactor * Try to avoid div-by-zero in SimpleNER * Fix infinite loop in minibatch_by_words * Use SequenceCategoricalCrossentropy in Tagger * Fix parser model when hidden layer * Remove extra dropout from tagger * Add extra nan check in tagger * Fix thinc version * Update tests and imports * Fix test * Update test * Update tests * Fix tests * Fix test Co-authored-by: Ines Montani <ines@ines.io>	2020-05-18 22:23:33 +02:00
Ines Montani	46568f40a7	Merge branch 'master' into tmp/sync	2020-03-26 13:38:14 +01:00
Ines Montani	828acffc12	Tidy up and auto-format	2020-03-25 12:28:12 +01:00
adrianeboyd	c95ce96c44	Update sentence recognizer (#5109 ) * Update sentence recognizer * rename `sentrec` to `senter` * use `spacy.HashEmbedCNN.v1` by default * update to follow `Tagger` modifications * remove component methods that can be inherited from `Tagger` * add simple initialization and overfitting pipeline tests * Update serialization test for senter	2020-03-06 14:45:02 +01:00
adrianeboyd	2281c4708c	Restore empty tokenizer properties (#5026 ) * Restore empty tokenizer properties * Check for types in tokenizer.from_bytes() * Add test for setting empty tokenizer rules	2020-03-02 11:55:02 +01:00
Ines Montani	5da3ad682a	Tidy up and auto-format	2020-02-28 11:57:41 +01:00
Sofie Van Landeghem	06f0a8daa0	Default settings to configurations (#4995 ) * fix grad_clip naming * cleaning up pretrained_vectors out of cfg * further refactoring Model init's * move Model building out of pipes * further refactor to require a model config when creating a pipe * small fixes * making cfg in nn_parser more consistent * fixing nr_class for parser * fixing nn_parser's nO * fix printing of loss * architectures in own file per type, consistent naming * convenience methods default_tagger_config and default_tok2vec_config * let create_pipe access default config if available for that component * default_parser_config * move defaults to separate folder * allow reading nlp from package or dir with argument 'name' * architecture spacy.VocabVectors.v1 to read static vectors from file * cleanup * default configs for nel, textcat, morphologizer, tensorizer * fix imports * fixing unit tests * fixes and clean up * fixing defaults, nO, fix unit tests * restore parser IO * fix IO * 'fix' serialization test * add .cfg to manifest fix example configs with additional arguments * replace Morpohologizer with Tagger * add IO bit when testing overfitting of tagger (currently failing) * fix IO - don't initialize when reading from disk * expand overfitting tests to also check IO goes OK * remove dropout from HashEmbed to fix Tagger performance * add defaults for sentrec * update thinc * always pass a Model instance to a Pipe * fix piped_added statement * remove obsolete W029 * remove obsolete errors * restore byte checking tests (work again) * clean up test * further test cleanup * convert from config to Model in create_pipe * bring back error when component is not initialized * cleanup * remove calls for nlp2.begin_training * use thinc.api in imports * allow setting charembed's nM and nC * fix for hardcoded nM/nC + unit test * formatting fixes * trigger build	2020-02-27 18:42:27 +01:00
Ines Montani	de11ea753a	Merge branch 'master' into develop	2020-02-18 14:47:23 +01:00
Sofie Van Landeghem	a1b22e90cd	serialize ENT_ID (#4852 ) * expand serialization test for custom token attribute * add failing test for issue 4849 * define ENT_ID as attr and use in doc serialization * fix few typos	2020-01-06 14:57:34 +01:00
Ines Montani	db55577c45	Drop Python 2.7 and 3.5 (#4828 ) * Remove unicode declarations * Remove Python 3.5 and 2.7 from CI * Don't require pathlib * Replace compat helpers * Remove OrderedDict * Use f-strings * Set Cython compiler language level * Fix typo * Re-add OrderedDict for Table * Update setup.cfg * Revert CONTRIBUTING.md * Revert lookups.md * Revert top-level.md * Small adjustments and docs [ci skip]	2019-12-22 01:53:56 +01:00
Ines Montani	158b98a3ef	Merge branch 'master' into develop	2019-12-21 18:55:03 +01:00
adrianeboyd	676e75838f	Include Doc.cats in serialization of Doc and DocBin (#4774 ) * Include Doc.cats in to_bytes() * Include Doc.cats in DocBin serialization * Add tests for serialization of cats Test serialization of cats for Doc and DocBin.	2019-12-06 14:07:39 +01:00
adrianeboyd	b841d3fe75	Add a tagger-based SentenceRecognizer (#4713 ) * Add sent_starts to GoldParse * Add SentTagger pipeline component Add `SentTagger` pipeline component as a subclass of `Tagger`. * Model reduces default parameters from `Tagger` to be small and fast * Hard-coded set of two labels: * S (1): token at beginning of sentence * I (0): all other sentence positions * Sets `token.sent_start` values * Add sentence segmentation to Scorer Report `sent_p/r/f` for sentence boundaries, which may be provided by various pipeline components. * Add sentence segmentation to CLI evaluate * Add senttagger metrics/scoring to train CLI * Rename SentTagger to SentenceRecognizer * Add SentenceRecognizer to spacy.pipes imports * Add SentenceRecognizer serialization test * Shorten component name to sentrec * Remove duplicates from train CLI output metrics	2019-11-28 11:10:07 +01:00
Ines Montani	181c01f629	Tidy up and auto-format	2019-10-18 11:27:38 +02:00
Sofie Van Landeghem	4e7259c6cf	Bugfix initializing DocBin with attributes (#4368 ) * docbin init fix + documentation fix + unit tests * newline * try with zlib instead of gzip (python 2 incompatibilities)	2019-10-03 14:48:45 +02:00
Ines Montani	0226b3bf0e	Fix test imports	2019-09-29 17:34:56 +02:00
Ines Montani	3d8fd4b461	Revert #4334	2019-09-29 17:32:12 +02:00
Ines Montani	c9cd516d96	Move tests out of package (#4334 ) * Move tests out of package * Fix typo	2019-09-28 18:05:00 +02:00
Ines Montani	669a7d37ce	Exclude vocab when testing to_bytes	2019-09-10 19:45:16 +02:00
Ines Montani	3e8f136ba7	💫 WIP: Basic lookup class scaffolding and JSON for all lemmatizer data (#4178 ) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance * Update docstrings * Update docstrings and errors * Update test * Add Lookups.__len__ * Add serialization methods * Add Lookups.remove_table * Use msgpack for serialization to disk * Fix file exists check * Try using OrderedDict for everything * Update .flake8 [ci skip] * Try fixing serialization * Update test_lookups.py * Update test_serialize_vocab_strings.py * Fix serialization for lookups * Fix lookups * Fix lookups * Fix lookups * Try to fix serialization * Try to fix serialization * Try to fix serialization * Try to fix serialization * Give up on serialization test * Xfail more serialization tests for 3.5 * Fix lookups for 2.7	2019-09-09 19:17:55 +02:00
Ines Montani	f580302673	Tidy up and auto-format	2019-08-20 17:36:34 +02:00
Sofie Van Landeghem	0ba1b5eebc	CLI scripts for entity linking (wikipedia & generic) (#4091 ) * document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * turn kb_creator into CLI script (wip) * proper parameters for training entity vectors * wikidata pipeline split up into two executable scripts * remove context_width * move wikidata scripts in bin directory, remove old dummy script * refine KB script with logs and preprocessing options * small edits * small improvements to logging of EL CLI script	2019-08-13 15:38:59 +02:00
svlandeg	dae8a21282	rename entity frequency	2019-07-19 17:40:28 +02:00
svlandeg	ddc73b11a9	fix unicode literals	2019-06-24 12:58:18 +02:00
svlandeg	b76a43bee4	unicode strings	2019-06-19 13:26:33 +02:00
svlandeg	0b0959b363	UTF8 encoding	2019-06-19 13:11:39 +02:00
svlandeg	5c723c32c3	entity vectors in the KB + serialization of them	2019-06-05 18:29:18 +02:00
svlandeg	19e8f339cb	deduce entity freq from WP corpus and serialize vocab in WP test	2019-04-29 17:37:29 +02:00
svlandeg	387263d618	simplify chains	2019-04-29 13:58:07 +02:00
svlandeg	54d0cea062	unit test for KB serialization	2019-04-24 23:52:34 +02:00
Ines Montani	7ba3a5d95c	💫 Make serialization methods consistent (#3385 ) * Make serialization methods consistent exclude keyword argument instead of random named keyword arguments and deprecation handling * Update docs and add section on serialization fields	2019-03-10 19:16:45 +01:00
Matthew Honnibal	27dd820753	Fix vocab deserialization when loading already present lexemes (#3383 ) * Fix vocab deserialization bug. Closes #2153 * Un-xfail test for #2153	2019-03-10 17:21:19 +01:00
Matthew Honnibal	61e5ce02a4	Add xfailing test for #2153	2019-03-10 16:36:29 +01:00
Ines Montani	323fc26880	Tidy up and format remaining files	2018-11-30 17:43:08 +01:00
Ines Montani	b6e991440c	💫 Tidy up and auto-format tests (#2967 ) * Auto-format tests with black * Add flake8 config * Tidy up and remove unused imports * Fix redefinitions of test functions * Replace orths_and_spaces with words and spaces * Fix compatibility with pytest 4.0 * xfail test for now Test was previously overwritten by following test due to naming conflict, so failure wasn't reported * Unfail passing test * Only use fixture via arguments Fixes pytest 4.0 compatibility	2018-11-27 01:09:36 +01:00
Ines Montani	75f3234404	💫 Refactor test suite (#2568 ) ## Description Related issues: #2379 (should be fixed by separating model tests) * total execution time down from > 300 seconds to under 60 seconds 🎉 * removed all model-specific tests that could only really be run manually anyway – those will now live in a separate test suite in the [`spacy-models`](https://github.com/explosion/spacy-models) repository and are already integrated into our new model training infrastructure * changed all relative imports to absolute imports to prepare for moving the test suite from `/spacy/tests` to `/tests` (it'll now always test against the installed version) * merged old regression tests into collections, e.g. `test_issue1001-1500.py` (about 90% of the regression tests are very short anyways) * tidied up and rewrote existing tests wherever possible ### Todo - [ ] move tests to `/tests` and adjust CI commands accordingly - [x] move model test suite from internal repo to `spacy-models` - [x] ~~investigate why `pipeline/test_textcat.py` is flakey~~ - [x] review old regression tests (leftover files) and see if they can be merged, simplified or deleted - [ ] update documentation on how to run tests ### Types of change enhancement, tests ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-07-24 23:38:44 +02:00
ines	38e07ade4c	Add test for custom tokenizer serialization (resolves #2494 )	2018-07-06 12:40:51 +02:00
ines	c2581f9172	Tidy up tokenizer test	2018-07-06 12:40:28 +02:00
ines	526be40823	Add test for `46d8a66`	2018-06-29 14:33:12 +02:00

1 2

100 Commits