spaCy

Commit Graph

Author	SHA1	Message	Date
Matthew Honnibal	bb911e5f4e	Fix #3830 : 'subtok' label being added even if learn_tokens=False (#4188 ) * Prevent subtok label if not learning tokens The parser introduces the subtok label to mark tokens that should be merged during post-processing. Previously this happened even if we did not have the --learn-tokens flag set. This patch passes the config through to the parser, to prevent the problem. * Make merge_subtokens a parser post-process if learn_subtokens * Fix train script * Add test for 3830: subtok problem * Fix handlign of non-subtok in parser training	2019-08-23 17:54:00 +02:00
Sofie Van Landeghem	c417c380e3	Matcher ID fixes (#4179 ) * allow phrasematcher to link one match to multiple original patterns * small fix for defining ent_id in the matcher (anti-ghost prevention) * cleanup * formatting	2019-08-22 17:17:07 +02:00
Ines Montani	f5d3afb1a3	Fix typo in docstrings [ci skip]	2019-08-22 16:24:15 +02:00
Ines Montani	5ca7dd0f94	💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167 ) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance	2019-08-22 14:21:32 +02:00
Sofie Van Landeghem	73b38c33e4	Small retokenizer fix (#4174 )	2019-08-22 12:23:54 +02:00
Ines Montani	a8752a569d	Auto-format [ci skip]	2019-08-22 11:44:39 +02:00
Pavle Vidanović	60e10a9f93	Serbian language improvement (#4169 ) * Serbian stopwords added. (cyrillic alphabet) * spaCy Contribution agreement included. * Test initialize updated * Serbian language code update. --bugfix * Tokenizer exceptions added. Init file updated. * Norm exceptions and lexical attributes added. * Examples added. * Tests added. * sr_lang examples update. * Tokenizer exceptions updated. (Serbian)	2019-08-22 11:43:07 +02:00
Sofie Van Landeghem	de272f8b82	adding double match for optional operator at the end (#4166 )	2019-08-21 22:46:56 +02:00
Sofie Van Landeghem	01c5980187	Serialize POS attribute when doc.is_tagged (#4092 ) * fix and unit test for issue 3959 * additional unit test for manifestation of the same (resolved) bug	2019-08-21 21:59:30 +02:00
Sofie Van Landeghem	7539a4f3a8	use states[q] in while retry loop (#4162 )	2019-08-21 21:58:04 +02:00
adrianeboyd	2d17b047e2	Check for is_tagged/is_parsed for Matcher attrs (#4163 ) Check for relevant components in the pipeline when Matcher is called, similar to the checks for PhraseMatcher in #4105. * keep track of attributes seen in patterns * when Matcher is called on a Doc, check for is_tagged for LEMMA, TAG, POS and for is_parsed for DEP	2019-08-21 20:52:36 +02:00
Pavle Vidanović	4fe9329bfb	Serbian language code update "rs" -> "sr" (#4159 ) * Serbian stopwords added. (cyrillic alphabet) * spaCy Contribution agreement included. * Test initialize updated * Serbian language code update. --bugfix	2019-08-21 19:57:37 +02:00
adrianeboyd	8fe7bdd0fa	Improve token pattern checking without validation (#4105 ) * Fix typo in rule-based matching docs * Improve token pattern checking without validation Add more detailed token pattern checks without full JSON pattern validation and provide more detailed error messages. Addresses #4070 (also related: #4063, #4100). * Check whether top-level attributes in patterns and attr for PhraseMatcher are in token pattern schema * Check whether attribute value types are supported in general (as opposed to per attribute with full validation) * Report various internal error types (OverflowError, AttributeError, KeyError) as ValueError with standard error messages * Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS, LEMMA, and DEP * Add error messages with relevant details on how to use validate=True or nlp() instead of nlp.make_doc() * Support attr=TEXT for PhraseMatcher * Add NORM to schema * Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler * Remove unnecessary .keys() * Rephrase error messages * Add another type check to Matcher Add another type check to Matcher for more understandable error messages in some rare cases. * Support phrase_matcher_attr=TEXT for EntityRuler * Don't use spacy.errors in examples and bin scripts * Fix error code * Auto-format Also try get Azure pipelines to finally start a build :( * Update errors.py Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2019-08-21 14:00:37 +02:00
Ines Montani	f580302673	Tidy up and auto-format	2019-08-20 17:36:34 +02:00
Ines Montani	364aaf5bc2	Simplify test	2019-08-20 16:41:58 +02:00
Sofie Van Landeghem	68ee0384fd	Unit test for Issue 3879 (#4153 ) * failing unit test for Issue #3879 * mark test as failing	2019-08-20 16:40:25 +02:00
Ines Montani	86cd7f0efd	Add regression test for #4120	2019-08-20 16:33:09 +02:00
Ines Montani	104125edd2	Tidy up errors	2019-08-20 16:03:45 +02:00
Ines Montani	cc76a26fe8	Raise error for negative arc indices (closes #3917 )	2019-08-20 15:51:37 +02:00
Ines Montani	69e70ffae1	Merge branch 'master' of https://github.com/explosion/spaCy	2019-08-20 15:09:52 +02:00
Ines Montani	f65e36925d	Fix absolute imports and avoid importing from cli	2019-08-20 15:08:59 +02:00
Ines Montani	7e8be44218	Auto-format	2019-08-20 15:06:31 +02:00
Paul O'Leary McCann	756b66b7c0	Reduce size of language data (#4141 ) * Move Turkish lemmas to a json file Rather than a large dict in Python source, the data is now a big json file. This includes a method for loading the json file, falling back to a compressed file, and an update to MANIFEST.in that excludes json in the spacy/lang directory. This focuses on Turkish specifically because it has the most language data in core. * Transition all lemmatizer.py files to json This covers all lemmatizer.py files of a significant size (>500k or so). Small files were left alone. None of the affected files have logic, so this was pretty straightforward. One unusual thing is that the lemma data for Urdu doesn't seem to be used anywhere. That may require further investigation. * Move large lang data to json for fr/nb/nl/sv These are the languages that use a lemmatizer directory (rather than a single file) and are larger than English. For most of these languages there were many language data files, in which case only the large ones (>500k or so) were converted to json. It may or may not be a good idea to migrate the remaining Python files to json in the future. * Fix id lemmas.json The contents of this file were originally just copied from the Python source, but that used single quotes, so it had to be properly converted to json first. * Add .json.gz to gitignore This covers the json.gz files built as part of distribution. * Add language data gzip to build process Currently this gzip data on every build; it works, but it should be changed to only gzip when the source file has been updated. * Remove Danish lemmatizer.py Missed this when I added the json. * Update to match latest explosion/srsly#9 The way gzipped json is loaded/saved in srsly changed a bit. * Only compress language data if necessary If a .json.gz file exists and is newer than the corresponding json file, it's not recompressed. * Move en/el language data to json This only affected files >500kb, which was nouns for both languages and the generic lookup table for English. * Remove empty files in Norwegian tokenizer It's unclear why, but the Norwegian (nb) tokenizer had empty files for adj/adv/noun/verb lemmas. This may have been a result of copying the structure of the English lemmatizer. This removed the files, but still creates the empty sets in the lemmatizer. That may not actually be necessary. * Remove dubious entries in English lookup.json " furthest" and " skilled" - both prefixed with a space - were in the English lookup table. That seems obviously wrong so I have removed them. * Fix small issues with en/fr lemmatizers The en tokenizer was including the removed _nouns.py file, so that's removed. The fr tokenizer is unusual in that it has a lemmatizer directory with both __init__.py and lemmatizer.py. lemmatizer.py had not been converted to load the json language data, so that was fixed. * Auto-format * Auto-format * Update srsly pin * Consistently use pathlib paths	2019-08-20 14:54:11 +02:00
Ivan Šarić	434f6fa6c1	Issue #1107 - adds examples.py for Croatian language (#4143 ) * adds contributor agreement for isaric * adds examples.py for croatian language	2019-08-18 23:04:41 +02:00
Paul O'Leary McCann	7f82a1fe1b	Make the emoticon list a raw string (#4139 ) While working on an unrelated task I got warnings about an unsupported escape sequence (`"\("`) in the tokenizer exceptions. Making the tokenizer exceptions a raw string makes this warning go away. The specific string that triggered this is `¯\(ツ)/¯`.	2019-08-18 15:17:13 +02:00
Ines Montani	009280fbc5	Tidy up and auto-format	2019-08-18 15:09:16 +02:00
Ines Montani	89f2b87266	Open file as utf-8 (closes #4138 )	2019-08-18 13:55:34 +02:00
Ines Montani	f35a8221d8	Move generation of parses out of with blocks	2019-08-18 13:54:26 +02:00
yanaiela	ec0beccaf1	Custom entity render (#4117 ) * customizable template for entities display, allowing to pass additional parameters along each entity * contributor agreement * simpler naming for the additional parameters given to the span entities renderer Co-Authored-By: Ines Montani <ines@ines.io> * change of default parameter, as suggested Co-Authored-By: Ines Montani <ines@ines.io>	2019-08-16 18:39:25 +02:00
Ines Montani	e5c7e19e82	Fix typo and auto-format [ci skip]	2019-08-16 10:53:38 +02:00
adrianeboyd	a58cb023d7	WIP: Extending debug-data (#4114 ) * Extending debug-data with dependency checks, etc. * Modify debug-data to load with GoldCorpus to iterate over .json/.jsonl files within directories * Add GoldCorpus iterator train_docs_without_preprocessing to load original train docs without shuffling and projectivizing * Report number of misaligned tokens * Add more dependency checks and messages * Update spacy/cli/debug_data.py Co-Authored-By: Ines Montani <ines@ines.io> * Fixed conflict * Move counts to _compile_gold() * Move all dependency nonproj/sent/head/cycle counting to _compile_gold() * Unclobber previous merges * Update variable names * Update more variable names, fix misspelling * Don't clobber loading error messages * Only warn about misaligned tokens if present	2019-08-16 10:52:46 +02:00
Ziming He	eea7d4f4a8	biluo_tags_from_offsets throw exception for overlapping entities (#4021 ) * Check whether two entities overlap - biluo_gold_biluo_overlap now throw exception when entities passed in have overlaps - added unit test * SCA agreement	2019-08-15 18:13:32 +02:00
adrianeboyd	2f9b28c218	Provide more info in cycle error message E069 (#4123 ) Provide the tokens in the cycle and the first 50 tokens from document in the error message so it's easier to track down the location of the cycle in the data. Addresses feature request in #3698.	2019-08-15 18:08:28 +02:00
AJ Rader	2f3648700c	Correction of default lemmatizer lookup in English (Issue # 4104) (#4110 ) * pytest file for issue4104 established * edited default lookup english lemmatizer for spun; fixes issue 4102 * eliminated parameterization and sorted dictionary dependnency in issue 4104 test * added contributor agreement	2019-08-15 11:39:10 +02:00
Ines Montani	1711b5eb62	💫 Support displaCy user colors via entry point (#4113 )	2019-08-13 15:59:55 +02:00
Sofie Van Landeghem	0ba1b5eebc	CLI scripts for entity linking (wikipedia & generic) (#4091 ) * document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * turn kb_creator into CLI script (wip) * proper parameters for training entity vectors * wikidata pipeline split up into two executable scripts * remove context_width * move wikidata scripts in bin directory, remove old dummy script * refine KB script with logs and preprocessing options * small edits * small improvements to logging of EL CLI script	2019-08-13 15:38:59 +02:00
黎谢鹏	250a54414b	update lang/zh (#4103 ) * update lang/zh * update lang/zh	2019-08-12 10:37:48 +02:00
Sofie Van Landeghem	963ea5e8d0	Update lemma and vector information after splitting a token (#4097 ) * fixing vector and lemma attributes after retokenizer.split * fixing unit test with mockup tensor * xp instead of numpy	2019-08-08 15:09:44 +02:00
Matthew Honnibal	04113a844d	Set version to v2.1.8	2019-08-07 13:53:58 +02:00
Ines Montani	6bec24cdd0	Require downloaded model in pkg_resources (#4090 )	2019-08-07 13:18:11 +02:00
adrianeboyd	69aca7d839	Add validate option to EntityRuler (#4089 ) * Add validate option to EntityRuler * Add validate to EntityRuler, passed to Matcher and PhraseMatcher * Add validate to usage and API docs * Update website/docs/usage/rule-based-matching.md Co-Authored-By: Ines Montani <ines@ines.io> * Update website/docs/usage/rule-based-matching.md Co-Authored-By: Ines Montani <ines@ines.io>	2019-08-07 00:40:53 +02:00
Jeno	15be09ceb0	Raise error if annotation dict in simple training style has unexpected keys #4074 (#4079 ) * adding enhancement #4074. * modified behavior to strictly require top level dictionary keys - issue #4074 * pass expected keys to error message and add links as expected top level key	2019-08-06 11:01:25 +02:00
Sofie Van Landeghem	ad09b0d6f3	fetch norm from lex if necessary for matching (#4080 )	2019-08-05 23:51:04 +02:00
Pavle Vidanović	e1a935d71c	Stopwords for Serbian language. (#4078 ) * Serbian stopwords added. (cyrillic alphabet) * spaCy Contribution agreement included. * Test initialize updated	2019-08-05 10:22:27 +02:00
veer-bains	874bd8c8dd	Fixed syntax error in lang/ko when using python 2 (#4082 ) (closes #4068 ) * fixed syntax error in declaring variables with python 2.7 in spacy/lang/ko/__init__.py * fixed syntax error in declaring variables with python 2.7 in spacy/lang/ko/__init__.py * Update __init__.py * Create veer-bains.md * Update __init__.py fixed syntax errors in variable datatype assignment when calling spacy.blank("ko") with python 2.7	2019-08-05 10:19:32 +02:00
Ines Montani	87ddbdc33e	Fix handling of kwargs in Language.evaluate Makes it consistent with other methods	2019-08-04 13:44:21 +02:00
Muhammad Irfan	d1d30b0442	added missing punctuation following conventions. (#4066 )	2019-08-04 13:41:18 +02:00
Anastassia	33b14724a5	Update gold corpus code to properly ingest a directory of jsonl… (#4067 ) * Update gold corpus code to properly ingest a directory of jsonlines files In response to: https://github.com/explosion/spaCy/issues/3975 * Update spacy/gold.pyx Co-Authored-By: Ines Montani <ines@ines.io>	2019-08-02 09:58:51 +02:00
Matthew Honnibal	944a66c326	Add span.tensor and token.tensor attributes	2019-08-01 18:30:50 +02:00
Matthew Honnibal	d3071ecdbc	Set version to v2.1.7	2019-08-01 18:09:19 +02:00

1 2 3 4 5 ...

6168 Commits