spaCy

Commit Graph

Author	SHA1	Message	Date
Ines Montani	828acffc12	Tidy up and auto-format	2020-03-25 12:28:12 +01:00
adrianeboyd	2281c4708c	Restore empty tokenizer properties (#5026 ) * Restore empty tokenizer properties * Check for types in tokenizer.from_bytes() * Add test for setting empty tokenizer rules	2020-03-02 11:55:02 +01:00
Sofie Van Landeghem	a1b22e90cd	serialize ENT_ID (#4852 ) * expand serialization test for custom token attribute * add failing test for issue 4849 * define ENT_ID as attr and use in doc serialization * fix few typos	2020-01-06 14:57:34 +01:00
adrianeboyd	676e75838f	Include Doc.cats in serialization of Doc and DocBin (#4774 ) * Include Doc.cats in to_bytes() * Include Doc.cats in DocBin serialization * Add tests for serialization of cats Test serialization of cats for Doc and DocBin.	2019-12-06 14:07:39 +01:00
Ines Montani	181c01f629	Tidy up and auto-format	2019-10-18 11:27:38 +02:00
Sofie Van Landeghem	4e7259c6cf	Bugfix initializing DocBin with attributes (#4368 ) * docbin init fix + documentation fix + unit tests * newline * try with zlib instead of gzip (python 2 incompatibilities)	2019-10-03 14:48:45 +02:00
Ines Montani	0226b3bf0e	Fix test imports	2019-09-29 17:34:56 +02:00
Ines Montani	3d8fd4b461	Revert #4334	2019-09-29 17:32:12 +02:00
Ines Montani	c9cd516d96	Move tests out of package (#4334 ) * Move tests out of package * Fix typo	2019-09-28 18:05:00 +02:00
Ines Montani	669a7d37ce	Exclude vocab when testing to_bytes	2019-09-10 19:45:16 +02:00
Ines Montani	3e8f136ba7	💫 WIP: Basic lookup class scaffolding and JSON for all lemmatizer data (#4178 ) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance * Update docstrings * Update docstrings and errors * Update test * Add Lookups.__len__ * Add serialization methods * Add Lookups.remove_table * Use msgpack for serialization to disk * Fix file exists check * Try using OrderedDict for everything * Update .flake8 [ci skip] * Try fixing serialization * Update test_lookups.py * Update test_serialize_vocab_strings.py * Fix serialization for lookups * Fix lookups * Fix lookups * Fix lookups * Try to fix serialization * Try to fix serialization * Try to fix serialization * Try to fix serialization * Give up on serialization test * Xfail more serialization tests for 3.5 * Fix lookups for 2.7	2019-09-09 19:17:55 +02:00
Ines Montani	f580302673	Tidy up and auto-format	2019-08-20 17:36:34 +02:00
Sofie Van Landeghem	0ba1b5eebc	CLI scripts for entity linking (wikipedia & generic) (#4091 ) * document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * turn kb_creator into CLI script (wip) * proper parameters for training entity vectors * wikidata pipeline split up into two executable scripts * remove context_width * move wikidata scripts in bin directory, remove old dummy script * refine KB script with logs and preprocessing options * small edits * small improvements to logging of EL CLI script	2019-08-13 15:38:59 +02:00
svlandeg	dae8a21282	rename entity frequency	2019-07-19 17:40:28 +02:00
svlandeg	ddc73b11a9	fix unicode literals	2019-06-24 12:58:18 +02:00
svlandeg	b76a43bee4	unicode strings	2019-06-19 13:26:33 +02:00
svlandeg	0b0959b363	UTF8 encoding	2019-06-19 13:11:39 +02:00
svlandeg	5c723c32c3	entity vectors in the KB + serialization of them	2019-06-05 18:29:18 +02:00
svlandeg	19e8f339cb	deduce entity freq from WP corpus and serialize vocab in WP test	2019-04-29 17:37:29 +02:00
svlandeg	387263d618	simplify chains	2019-04-29 13:58:07 +02:00
svlandeg	54d0cea062	unit test for KB serialization	2019-04-24 23:52:34 +02:00
Ines Montani	7ba3a5d95c	💫 Make serialization methods consistent (#3385 ) * Make serialization methods consistent exclude keyword argument instead of random named keyword arguments and deprecation handling * Update docs and add section on serialization fields	2019-03-10 19:16:45 +01:00
Matthew Honnibal	27dd820753	Fix vocab deserialization when loading already present lexemes (#3383 ) * Fix vocab deserialization bug. Closes #2153 * Un-xfail test for #2153	2019-03-10 17:21:19 +01:00
Matthew Honnibal	61e5ce02a4	Add xfailing test for #2153	2019-03-10 16:36:29 +01:00
Ines Montani	323fc26880	Tidy up and format remaining files	2018-11-30 17:43:08 +01:00
Ines Montani	b6e991440c	💫 Tidy up and auto-format tests (#2967 ) * Auto-format tests with black * Add flake8 config * Tidy up and remove unused imports * Fix redefinitions of test functions * Replace orths_and_spaces with words and spaces * Fix compatibility with pytest 4.0 * xfail test for now Test was previously overwritten by following test due to naming conflict, so failure wasn't reported * Unfail passing test * Only use fixture via arguments Fixes pytest 4.0 compatibility	2018-11-27 01:09:36 +01:00
Ines Montani	75f3234404	💫 Refactor test suite (#2568 ) ## Description Related issues: #2379 (should be fixed by separating model tests) * total execution time down from > 300 seconds to under 60 seconds 🎉 * removed all model-specific tests that could only really be run manually anyway – those will now live in a separate test suite in the [`spacy-models`](https://github.com/explosion/spacy-models) repository and are already integrated into our new model training infrastructure * changed all relative imports to absolute imports to prepare for moving the test suite from `/spacy/tests` to `/tests` (it'll now always test against the installed version) * merged old regression tests into collections, e.g. `test_issue1001-1500.py` (about 90% of the regression tests are very short anyways) * tidied up and rewrote existing tests wherever possible ### Todo - [ ] move tests to `/tests` and adjust CI commands accordingly - [x] move model test suite from internal repo to `spacy-models` - [x] ~~investigate why `pipeline/test_textcat.py` is flakey~~ - [x] review old regression tests (leftover files) and see if they can be merged, simplified or deleted - [ ] update documentation on how to run tests ### Types of change enhancement, tests ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-07-24 23:38:44 +02:00
ines	38e07ade4c	Add test for custom tokenizer serialization (resolves #2494 )	2018-07-06 12:40:51 +02:00
ines	c2581f9172	Tidy up tokenizer test	2018-07-06 12:40:28 +02:00
ines	526be40823	Add test for `46d8a66`	2018-06-29 14:33:12 +02:00
Matthew Honnibal	cf5fcf0546	Update serialization test	2018-03-28 20:12:53 +02:00
Claudiu-Vlad Ursache	e28de12cbd	Ensure files opened in `from_disk` are closed Fixes [issue 1706](https://github.com/explosion/spaCy/issues/1706).	2018-02-13 20:49:43 +01:00
ines	1c218397f6	Ensure path in Doc.to_disk/from_disk (resolves ##1521) Also add Doc serialization tests with both Path and string path options	2017-11-09 02:29:03 +01:00
Matthew Honnibal	4194bc5744	Xfail flakey serialization test	2017-11-08 13:55:13 +01:00
Matthew Honnibal	b0f3ea2200	Fix names of pipeline components NeuralDependencyParser --> DependencyParser NeuralEntityRecognizer --> EntityRecognizer TokenVectorEncoder --> Tensorizer NeuralLabeller --> MultitaskObjective	2017-10-26 12:38:23 +02:00
ines	bf415fd778	Add test for serializing extension attrs (see #1085 )	2017-10-19 00:53:08 +02:00
Matthew Honnibal	f0f2739ae3	Add test for serialization issue raised in #1105	2017-10-10 03:57:58 +02:00
Matthew Honnibal	74f08e1ad5	Update test	2017-09-26 06:45:56 -05:00
Matthew Honnibal	42d47c1e5c	Fix tagger serialization	2017-08-19 04:16:32 +02:00
Matthew Honnibal	a7309a217d	Update tagger serialization	2017-08-18 23:12:05 +02:00
ines	a66cf24ee8	xfail tokenizer serialization tests for now Tests pass locally, but not on Travis – needs more investigation	2017-06-04 13:58:20 +02:00
ines	3152ee5ca2	Update serialization tests for tokenizer	2017-06-03 17:05:28 +02:00
ines	de974f7bef	Add serializer tests for tokenizer	2017-06-03 13:26:34 +02:00
ines	d21459f87d	Update serializer tests	2017-06-02 21:42:26 +02:00
ines	d86e7cde93	Add entity recognizer to parser serialization tests	2017-06-02 18:40:06 +02:00
ines	0051c05964	Add tests for serializing parser	2017-06-02 18:37:19 +02:00
ines	cef547a9f0	Add serialization tests for tensorizer	2017-06-02 18:18:30 +02:00
ines	f74a45c1fe	Remove unnecessary argument	2017-06-02 18:17:46 +02:00
ines	43b4d63f85	Add serialization tests for tagger	2017-06-02 17:29:34 +02:00
ines	acd65c00f6	Add serialization tests for StringStore and Vocab	2017-06-02 10:57:42 +02:00

1 2

80 Commits