spaCy

Commit Graph

Author	SHA1	Message	Date
adrianeboyd	90c52128dc	Improve train CLI with base model (#4911 ) Improve train CLI with a provided base model so that you can: * add a new component * extend an existing component * replace an existing component When the final model and best model are saved, reenable any disabled components and merge the meta information to include the full pipeline and accuracy information for all components in the base model plus the newly added components if needed.	2020-01-16 01:58:51 +01:00
svlandeg	ee828d5a9a	bugfix typo conv_window	2020-01-14 09:02:58 +01:00
adrianeboyd	d24bca62f6	Add CJK to character classes (#4884 ) * Add CJK character class as uncased * Incorporate Chinese URL test case Un-xfail Chinese URL test instance	2020-01-08 16:50:19 +01:00
adrianeboyd	aef83e8070	Mark most Hungarian tokenizer test cases as slow (#4883 ) * Mark most Hungarian tokenizer test cases as slow Mark most Hungarian tokenizer test cases as slow to reduce the runtime of the test suite in ordinary usage: * for normal tests: run default tests plus 10% of the detailed tests * for slow tests: run all tests * Rework to mark individual tests as slow	2020-01-08 12:34:06 +01:00
Sofie Van Landeghem	7b96a5e10f	Reduce mem usage in training Entity Linker (#4811 ) * move nlp processing for el pipe to batch training instead of preprocessing * adding dev eval back in, and limit in articles instead of entities * use pipe whenever possible * few more small doc changes * access dev data through generator * tqdm description * small fixes * update documentation	2020-01-06 14:59:50 +01:00
Sofie Van Landeghem	6e9b61b49d	add warning in debug_data for punctuation in entities (#4853 )	2020-01-06 14:59:28 +01:00
adrianeboyd	d652ff215d	Add trailing whitespace to multiline test text (#4877 )	2020-01-06 14:58:59 +01:00
adrianeboyd	de69bc6509	Fix and improve URL pattern (#4882 ) * match domains longer than `hostname.domain.tld` like `www.foo.co.uk` * expand allowed characters in domain names while only matching lowercase TLDs so that "this.That" isn't matched as a URL and can be split on the period as an infix (relevant for at least English, German, and Tatar)	2020-01-06 14:58:30 +01:00
Sofie Van Landeghem	a1b22e90cd	serialize ENT_ID (#4852 ) * expand serialization test for custom token attribute * add failing test for issue 4849 * define ENT_ID as attr and use in doc serialization * fix few typos	2020-01-06 14:57:34 +01:00
Al Johri	1aa2d4dac9	stop rendering mathjax by default in displacy (#4840 ) * stop rendering mathjax by default in displacy * Replace f-string and add comment Co-authored-by: Ines Montani <ines@ines.io>	2020-01-01 13:15:05 +01:00
Anastasiia Iurshina	1830a12578	Fixes typos (#4843 ) * Fixes typos * Fixes typo * Contributor agreement	2019-12-29 14:24:13 +01:00
Ivan Echevarria	ef13e0c038	Add n_process to Language.pipe documentation (#4842 ) [ci skip] * Add n_process to documentation * Auto-format and add default [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2019-12-29 14:23:33 +01:00
Ines Montani	3431ac42de	Fix typo	2019-12-21 21:17:45 +01:00
Ines Montani	7c69d30de5	Tidy up and expect warning	2019-12-21 21:14:52 +01:00
Sofie Van Landeghem	732142bf28	facilitate larger training files (#4827 ) * add warning for large file and change start var to long * type for file_length	2019-12-21 21:12:19 +01:00
Ines Montani	cb4145adc7	Tidy up and auto-format	2019-12-21 19:04:17 +01:00
Olamilekan Wahab	a741de7cf6	Adding support for Yoruba Language (#4614 ) * Adding Support for Yoruba * test text * Updated test string. * Fixing encoding declaration. * Adding encoding to stop_words.py * Added contributor agreement and removed iranlowo. * Added removed test files and removed iranlowo to keep project bare. * Returned CONTRIBUTING.md to default state. * Added delted conftest entries * Tidy up and auto-format * Revert CONTRIBUTING.md Co-authored-by: Ines Montani <ines@ines.io>	2019-12-21 14:11:50 +01:00
Ines Montani	0750d59e5a	Allow setting ner_missing_tag on docs_to_json	2019-12-21 13:47:21 +01:00
Sofie Van Landeghem	8ebbb85117	Documentation for PhraseMatcher constructor (#4826 ) * add max_length as argument for init PhraseMatcher * improve error message too	2019-12-20 23:00:04 +01:00
Sofie Van Landeghem	12158c1e3a	Restore tqdm imports (#4804 ) * set 4.38.0 to minimal version with color bug fix * set imports back to proper place * add upper range for tqdm	2019-12-16 13:12:19 +01:00
Sofie Van Landeghem	557dcf5659	NEL requires sentences to be set (#4801 )	2019-12-13 15:55:18 +01:00
tamuhey	1707e77c5e	add char_span to Span (#4793 )	2019-12-13 15:54:58 +01:00
Sofie Van Landeghem	f9b541f9ef	More robust set entities method in KB (#4794 ) * add unit test for setting entities with duplicate identifiers * count the number of actual unique identifiers and throw duplicate warning	2019-12-13 10:45:29 +01:00
Sofie Van Landeghem	5355b0038f	Update EL example (#4789 ) * update EL example script after sentence-central refactor * version bump * set incl_prior to False for quick demo purposes * clean up	2019-12-11 18:19:42 +01:00
adrianeboyd	38e1bc19f4	Add destructors for states in TransitionSystem (#4686 )	2019-12-10 13:23:27 +01:00
adrianeboyd	c208eb6e4d	Fix int value handling in Matcher (#4749 ) Add `int` values (for `LENGTH`) in _get_attr_values() instead of treating `int` like `dict`.	2019-12-06 19:22:57 +01:00
Sofie Van Landeghem	780d43aac7	fix bug in EL predict (#4779 )	2019-12-06 19:18:14 +01:00
adrianeboyd	676e75838f	Include Doc.cats in serialization of Doc and DocBin (#4774 ) * Include Doc.cats in to_bytes() * Include Doc.cats in DocBin serialization * Add tests for serialization of cats Test serialization of cats for Doc and DocBin.	2019-12-06 14:07:39 +01:00
Antti Ajanki	e626a011cc	Improvements to the Finnish language data (#4738 ) * Enable lex_attrs on Finnish * Copy the Danish tokenizer rules to Finnish Specifically, don't break hyphenated compound words * Contributor agreement * A new file for Finnish tokenizer rules instead of including the Danish ones	2019-12-03 12:55:28 +01:00
Christoph Purschke	a7ee4b6f17	new tests & tokenization fixes (#4734 ) - added some tests for tokenization issues - fixed some issues with tokenization of words with hyphen infix - rewrote the "tokenizer_exceptions.py" file (stemming from the German version)	2019-12-01 23:08:21 +01:00
adrianeboyd	48ea2e8d0f	Restructure Sentencizer to follow Pipe API (#4721 ) * Restructure Sentencizer to follow Pipe API Restructure Sentencizer to follow Pipe API so that it can be scored with `nlp.evaluate()`. * Add Sentencizer pipe() test	2019-11-27 16:33:34 +01:00
Jari Bakken	16cb19e960	update nb tag_map (#4711 )	2019-11-25 21:26:26 +01:00
Ines Montani	5b36dec7eb	Auto-exclude disabled when calling from_disk during load (#4708 )	2019-11-25 16:01:22 +01:00
Ines Montani	2160ecfc92	Fix typo [ci skip]	2019-11-25 13:08:19 +01:00
adrianeboyd	2d8c6e1124	Iterate over lr_edges until sents are correct (#4702 ) Iterate over lr_edges until all heads are within the current sentence. Instead of iterating over them for a fixed number of iterations, check whether the sentence boundaries are correct for the heads and stop when all are correct. Stop after a maximum of 10 iterations, providing a warning in this case since the sentence boundaries may not be correct.	2019-11-25 13:06:36 +01:00
Matt Maybeno	c9f1e99787	Agnostic vocab array fix (#4680 ) * Use get_array_module instead of numpy * add contributor agreement	2019-11-23 14:59:52 +01:00
adrianeboyd	46250f60ac	Add missing tags to el/es/pt tag maps (#4696 ) * Add missing tags to pt tag map * Add missing tags to es tag map * Add missing tags to el tag map * Add missing symbol in el tag map	2019-11-23 14:57:21 +01:00
Paul O'Leary McCann	f0e3e606a6	Replace python-mecab3 with fugashi for Japanese (#4621 ) * Switch from mecab-python3 to fugashi mecab-python3 has been the best MeCab binding for a long time but it's not very actively maintained, and since it's based on old SWIG code distributed with MeCab there's a limit to how effectively it can be maintained. Fugashi is a new Cython-based MeCab wrapper I wrote. Since it's not based on the old SWIG code it's easier to keep it current and make small deviations from the MeCab C/C++ API where that makes sense. * Change mecab-python3 to fugashi in setup.cfg * Change "mecab tags" to "unidic tags" The tags come from MeCab, but the tag schema is specified by Unidic, so it's more proper to refer to it that way. * Update conftest * Add fugashi link to external deps list for Japanese	2019-11-23 14:31:04 +01:00
Ines Montani	a0fb1acb10	Update version [ci skip]	2019-11-21 18:19:37 +01:00
Ines Montani	b570d5d2ed	Increment version [ci skip]	2019-11-21 17:02:32 +01:00
Matthew Honnibal	50f89cb85d	Make vectors.find() return keys in correct order (#4691 ) * Make vectors.find() return keys in correct order * Update spacy/vectors.pyx	2019-11-21 16:58:32 +01:00
Ines Montani	5d4eede1e4	Fix test util imports	2019-11-21 16:28:29 +01:00
GuiGel	8f7ab70870	Bugfix/fix entity ruler from disk (#4670 ) * fix EntityRuler from_disk bug * add contributor file * Test EntityRuler PhraseMatcher deserialization (#4651) * newline at end of file * fix copy paste error * serializing the EntityRuler by itself * Add unicode declarations for Python 2 and auto-format	2019-11-21 16:26:37 +01:00
adrianeboyd	054df5d90a	Add error for non-string labels (#4690 ) Add error when attempting to add non-string labels to `Tagger` or `TextCategorizer`.	2019-11-21 16:24:10 +01:00
adrianeboyd	d7f32b285c	Detect more empty matches in tokenizer.explain() (#4675 ) * Detect more empty matches in tokenizer.explain() * Include a few languages in explain non-slow tests Mark a few languages in tokenizer.explain() tests as not slow so they're run by default.	2019-11-20 16:31:29 +01:00
Ines Montani	5bf9ab5b03	Tidy up and auto-format	2019-11-20 13:16:33 +01:00
Ines Montani	7f3b00164a	Re-add slow marker	2019-11-20 13:15:59 +01:00
Ines Montani	6e303de717	Auto-format	2019-11-20 13:15:24 +01:00
Ines Montani	2e7c896fe5	Update Tokenizer.explain tests	2019-11-20 13:14:11 +01:00
adrianeboyd	2c876eb672	Add tokenizer explain() debugging method (#4596 ) * Expose tokenizer rules as a property Expose the tokenizer rules property in the same way as the other core properties. (The cache resetting is overkill, but consistent with `from_bytes` for now.) Add tests and update Tokenizer API docs. * Update Hungarian punctuation to remove empty string Update Hungarian punctuation definitions so that `_units` does not match an empty string. * Use _load_special_tokenization consistently Use `_load_special_tokenization()` and have it to handle `None` checks. * Fix precedence of `token_match` vs. special cases Remove `token_match` check from `_split_affixes()` so that special cases have precedence over `token_match`. `token_match` is checked only before infixes are split. * Add `make_debug_doc()` to the Tokenizer Add `make_debug_doc()` to the Tokenizer as a working implementation of the pseudo-code in the docs. Add a test (marked as slow) that checks that `nlp.tokenizer()` and `nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens for all languages that have `examples.sentences` that can be imported. * Update tokenization usage docs Update pseudo-code and algorithm description to correspond to `nlp.tokenizer.make_debug_doc()` with example debugging usage. Add more examples for customizing tokenizers while preserving the existing defaults. Minor edits / clarifications. * Revert "Update Hungarian punctuation to remove empty string" This reverts commit `f0a577f7a5`. * Rework `make_debug_doc()` as `explain()` Rework `make_debug_doc()` as `explain()`, which returns a list of `(pattern_string, token_string)` tuples rather than a non-standard `Doc`. Update docs and tests accordingly, leaving the visualization for future work. * Handle cases with bad tokenizer patterns Detect when tokenizer patterns match empty prefixes and suffixes so that `explain()` does not hang on bad patterns. * Remove unused displacy image * Add tokenizer.explain() to usage docs	2019-11-20 13:07:25 +01:00

1 2 3 4 5 ...

6609 Commits