spaCy

Commit Graph

Author	SHA1	Message	Date
Matthew Honnibal	f8d740bfb1	Fix --gold-preproc train cli command (#4392 ) * Fix get labels for textcat * Fix char_embed for gpu * Revert "Fix char_embed for gpu" This reverts commit `055b9a9e85`. * Fix passing of cats in gold.pyx * Revert "Match pop with append for training format (#4516)" This reverts commit `8e7414dace`. * Fix popping gold parses * Fix handling of cats in gold tuples * Fix name * Fix ner_multitask_objective script * Add test for 4402	2019-10-27 21:58:50 +01:00
Sofie Van Landeghem	8e7414dace	Match pop with append for training format (#4516 ) * trying to fix script - not succesful yet * match pop() with extend() to avoid changing the data * few more pop-extend fixes * reinsert deleted print statement * fix print statement * add last tested version * append instead of extend * add in few comments * quick fix for 4402 + unit test * fixing number of docs (not counting cats) * more fixes * fix len * print tmp file instead of using data from examples dir * print tmp file instead of using data from examples dir (2)	2019-10-27 16:01:32 +01:00
Ines Montani	a9c6104047	Component decorator and component analysis (#4517 ) * Add work in progress * Update analysis helpers and component decorator * Fix porting of docstrings for Python 2 * Fix docstring stuff on Python 2 * Support meta factories when loading model * Put auto pipeline analysis behind flag for now * Analyse pipes on remove_pipe and replace_pipe * Move analysis to root for now Try to find a better place for it, but it needs to go for now to avoid circular imports * Simplify decorator Don't return a wrapped class and instead just write to the object * Update existing components and factories * Add condition in factory for classes vs. functions * Add missing from_nlp classmethods * Add "retokenizes" to printed overview * Update assigns/requires declarations of builtins * Only return data if no_print is enabled * Use multiline table for overview * Don't support Span * Rewrite errors/warnings and move them to spacy.errors	2019-10-27 13:35:49 +01:00
Ines Montani	cfffdba7b1	Implement new API for {Phrase}Matcher.add (backwards-compatible) (#4522 ) * Implement new API for {Phrase}Matcher.add (backwards-compatible) * Update docs * Also update DependencyMatcher.add * Update internals * Rewrite tests to use new API * Add basic check for common mistake Raise error with suggestion if user likely passed in a pattern instead of a list of patterns * Fix typo [ci skip]	2019-10-25 22:21:08 +02:00
Ines Montani	d2da117114	Also support passing list to Language.disable_pipes (#4521 ) * Also support passing list to Language.disable_pipes * Adjust internals	2019-10-25 16:19:08 +02:00
Ines Montani	e91366a216	Adjust formatting [ci skip]	2019-10-25 11:25:44 +02:00
Ines Montani	f31876154d	Adjust formatting [ci skip]	2019-10-25 11:19:46 +02:00
Kabir Khan	93640373c7	Make entity_ruler ent_id resolution 2x faster and add docs for… (#4513 ) * Update entityruler.py * Making ent_id resolution 2x faster and adding docs * Fixing newlines in docstrings * Fixing newlines in docstrings	2019-10-25 11:16:42 +02:00
Zhuoru Lin	10d88b09bb	Bugfix/fix wikidata train entity linker (#4509 ) * Fix labels_discard Nonetype iteration error * Contributor agreement for Zhuoru Lin * Enhance EntityLinker.predict() to handle labels_discard is None case.	2019-10-24 12:52:59 +02:00
Ines Montani	181c01f629	Tidy up and auto-format	2019-10-18 11:27:38 +02:00
Sofie Van Landeghem	2d249a9502	KB extensions and better parsing of WikiData (#4375 ) * fix overflow error on windows * more documentation & logging fixes * md fix * 3 different limit parameters to play with execution time * bug fixes directory locations * small fixes * exclude dev test articles from prior probabilities stats * small fixes * filtering wikidata entities, removing numeric and meta items * adding aliases from wikidata also to the KB * fix adding WD aliases * adding also new aliases to previously added entities * fixing comma's * small doc fixes * adding subclassof filtering * append alias functionality in KB * prevent appending the same entity-alias pair * fix for appending WD aliases * remove date filter * remove unnecessary import * small corrections and reformatting * remove WD aliases for now (too slow) * removing numeric entities from training and evaluation * small fixes * shortcut during prediction if there is only one candidate * add counts and fscore logging, remove FP NER from evaluation * fix entity_linker.predict to take docs instead of single sentences * remove enumeration sentences from the WP dataset * entity_linker.update to process full doc instead of single sentence * spelling corrections and dump locations in readme * NLP IO fix * reading KB is unnecessary at the end of the pipeline * small logging fix * remove empty files	2019-10-14 12:28:53 +02:00
Matthew Honnibal	29f9fec267	Improve spacy pretrain (#4393 ) * Support bilstm_depth arg in spacy pretrain * Add option to ignore zero vectors in get_cossim_loss * Use cosine loss in Cloze multitask	2019-10-07 23:34:58 +02:00
Sofie Van Landeghem	9d3ce7cba2	Ensure training doesn't crash with empty batches (#4360 ) * unit test for previously resolved unflatten issue * prevent batch of empty docs to cause problems	2019-10-02 12:50:47 +02:00
Ines Montani	b6670bf0c2	Use consistent spelling	2019-10-02 10:37:39 +02:00
Ines Montani	3297a19545	Warn in Tagger.begin_training if no lemma tables are available (#4351 )	2019-10-01 15:13:55 +02:00
Sofie Van Landeghem	22b9e12159	Ensure the NER remains consistent after resizing (#4330 ) * test and fix for second bug of issue 4042 * fix for first bug in 4042 * crashing test for Issue 4313 * forgot one instance of resize * remove prints * undo uncomment * delete test for 4313 (uses third party lib) * add fix for Issue 4313 * unit test for 4313	2019-09-27 20:57:13 +02:00
Matthew Honnibal	46c02d25b1	Merge changes to test_ner	2019-09-18 21:41:24 +02:00
tamuhey	875f3e5d8c	remove redundant __call__ method in pipes.TextCategorizer (#4305 ) * remove redundant __call__ method in pipes.TextCategorizer Because the parent __call__ method behaves in the same way. * fix: Pipe.__call__ arg * fix: invalid arg in Pipe.__call__ * modified: spacy/tests/regression/test_issue4278.py (#4278) * deleted: Pipfile	2019-09-18 21:31:27 +02:00
adrianeboyd	b5d999e510	Add textcat to train CLI (#4226 ) * Add doc.cats to spacy.gold at the paragraph level Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in the spacy JSON training format at the paragraph level. * `spacy.gold.docs_to_json()` writes `docs.cats` * `GoldCorpus` reads in cats in each `GoldParse` * Update instances of gold_tuples to handle cats Update iteration over gold_tuples / gold_parses to handle addition of cats at the paragraph level. * Add textcat to train CLI * Add textcat options to train CLI * Add textcat labels in `TextCategorizer.begin_training()` * Add textcat evaluation to `Scorer`: * For binary exclusive classes with provided label: F1 for label * For 2+ exclusive classes: F1 macro average * For multilabel (not exclusive): ROC AUC macro average (currently relying on sklearn) * Provide user info on textcat evaluation settings, potential incompatibilities * Provide pipeline to Scorer in `Language.evaluate` for textcat config * Customize train CLI output to include only metrics relevant to current pipeline * Add textcat evaluation to evaluate CLI * Fix handling of unset arguments and config params Fix handling of unset arguments and model confiug parameters in Scorer initialization. * Temporarily add sklearn requirement * Remove sklearn version number * Improve Scorer handling of models without textcats * Fixing Scorer handling of models without textcats * Update Scorer output for python 2.7 * Modify inf in Scorer for python 2.7 * Auto-format Also make small adjustments to make auto-formatting with black easier and produce nicer results * Move error message to Errors * Update documentation * Add cats to annotation JSON format [ci skip] * Fix tpl flag and docs [ci skip] * Switch to internal roc_auc_score Switch to internal `roc_auc_score()` adapted from scikit-learn. * Add AUCROCScore tests and improve errors/warnings * Add tests for AUCROCScore and roc_auc_score * Add missing error for only positive/negative values * Remove unnecessary warnings and errors * Make reduced roc_auc_score functions private Because most of the checks and warnings have been stripped for the internal functions and access is only intended through `ROCAUCScore`, make the functions for roc_auc_score adapted from scikit-learn private. * Check that data corresponds with multilabel flag Check that the training instances correspond with the multilabel flag, adding the multilabel flag if required. * Add textcat score to early stopping check * Add more checks to debug-data for textcat * Add example training data for textcat * Add more checks to textcat train CLI * Check configuration when extending base model * Fix typos * Update textcat example data * Provide licensing details and licenses for data * Remove two labels with no positive instances from jigsaw-toxic-comment data. Co-authored-by: Ines Montani <ines@ines.io>	2019-09-15 22:31:31 +02:00
Ines Montani	16c2522791	Merge branch 'master' into develop	2019-09-14 16:42:01 +02:00
adrianeboyd	6942a6a69b	Extend default punct for sentencizer (#4290 ) Most of these characters are for languages / writing systems that aren't supported by spacy, but I don't think it causes problems to include them. In the UD evals, Hindi and Urdu improve a lot as expected (from 0-10% to 70-80%) and Persian improves a little (90% to 96%). Tamil improves in combination with #4288. The punctuation list is converted to a set internally because of its increased length. Sentence final punctuation generated with: ``` unichars -gas '[\p{Sentence_Break=STerm}\p{Sentence_Break=ATerm}]' '\p{Terminal_Punctuation}' ``` See: https://stackoverflow.com/a/9508766/461847 Fixes #4269.	2019-09-14 15:25:48 +02:00
Ines Montani	27106d6528	Merge branch 'master' into develop	2019-09-13 17:07:17 +02:00
Sofie Van Landeghem	2ae5db580e	dim bugfix when incl_prior is False (#4285 )	2019-09-13 16:30:05 +02:00
Ines Montani	3c3658ef9f	Merge branch 'master' into develop	2019-09-12 18:03:01 +02:00
Ines Montani	228bbf506d	Improve label properties on pipes	2019-09-12 18:02:44 +02:00
Ines Montani	655b434553	Merge branch 'master' into develop	2019-09-12 11:39:18 +02:00
tamuhey	71909cdf22	Fix iss4278 (#4279 ) * fix: len(tuple) == 2 * (#4278) add fail test * add contributor's aggreement	2019-09-12 10:44:49 +02:00
Ines Montani	8ebc3711dc	Fix bug in Parser.labels and add test (#4275 )	2019-09-11 18:29:35 +02:00
Matthew Honnibal	c308cf3e3e	Merge branch 'master' into feature/lemmatizer	2019-08-25 13:52:27 +02:00
Matthew Honnibal	bb911e5f4e	Fix #3830 : 'subtok' label being added even if learn_tokens=False (#4188 ) * Prevent subtok label if not learning tokens The parser introduces the subtok label to mark tokens that should be merged during post-processing. Previously this happened even if we did not have the --learn-tokens flag set. This patch passes the config through to the parser, to prevent the problem. * Make merge_subtokens a parser post-process if learn_subtokens * Fix train script * Add test for 3830: subtok problem * Fix handlign of non-subtok in parser training	2019-08-23 17:54:00 +02:00
Ines Montani	f5d3afb1a3	Fix typo in docstrings [ci skip]	2019-08-22 16:24:15 +02:00
Matthew Honnibal	bcd08f20af	Merge changes from master	2019-08-21 14:18:52 +02:00
adrianeboyd	8fe7bdd0fa	Improve token pattern checking without validation (#4105 ) * Fix typo in rule-based matching docs * Improve token pattern checking without validation Add more detailed token pattern checks without full JSON pattern validation and provide more detailed error messages. Addresses #4070 (also related: #4063, #4100). * Check whether top-level attributes in patterns and attr for PhraseMatcher are in token pattern schema * Check whether attribute value types are supported in general (as opposed to per attribute with full validation) * Report various internal error types (OverflowError, AttributeError, KeyError) as ValueError with standard error messages * Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS, LEMMA, and DEP * Add error messages with relevant details on how to use validate=True or nlp() instead of nlp.make_doc() * Support attr=TEXT for PhraseMatcher * Add NORM to schema * Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler * Remove unnecessary .keys() * Rephrase error messages * Add another type check to Matcher Add another type check to Matcher for more understandable error messages in some rare cases. * Support phrase_matcher_attr=TEXT for EntityRuler * Don't use spacy.errors in examples and bin scripts * Fix error code * Auto-format Also try get Azure pipelines to finally start a build :( * Update errors.py Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2019-08-21 14:00:37 +02:00
Ines Montani	f65e36925d	Fix absolute imports and avoid importing from cli	2019-08-20 15:08:59 +02:00
Sofie Van Landeghem	0ba1b5eebc	CLI scripts for entity linking (wikipedia & generic) (#4091 ) * document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * turn kb_creator into CLI script (wip) * proper parameters for training entity vectors * wikidata pipeline split up into two executable scripts * remove context_width * move wikidata scripts in bin directory, remove old dummy script * refine KB script with logs and preprocessing options * small edits * small improvements to logging of EL CLI script	2019-08-13 15:38:59 +02:00
adrianeboyd	69aca7d839	Add validate option to EntityRuler (#4089 ) * Add validate option to EntityRuler * Add validate to EntityRuler, passed to Matcher and PhraseMatcher * Add validate to usage and API docs * Update website/docs/usage/rule-based-matching.md Co-Authored-By: Ines Montani <ines@ines.io> * Update website/docs/usage/rule-based-matching.md Co-Authored-By: Ines Montani <ines@ines.io>	2019-08-07 00:40:53 +02:00
Matthew Honnibal	4632c597e7	Fix Pipe base class	2019-08-01 17:29:01 +02:00
Sofie Van Landeghem	7de3b129ab	Resolve edge case when calling textcat.predict with empty doc (#4035 ) * resolve edge case where no doc has tokens when calling textcat.predict * more explicit value test	2019-07-30 14:58:01 +02:00
Matthew Honnibal	06eb428ed1	Make pipe base class a bit less presumptuous	2019-07-28 17:56:11 +02:00
Matthew Honnibal	16b5144095	Don't raise NotImplemented in Pipe.update	2019-07-28 17:54:11 +02:00
Matthew Honnibal	73e095923f	💫 Improve error message when model.from_bytes() dies (#4014 ) * Improve error message when model.from_bytes() dies When Thinc's model.from_bytes() is called with a mismatched model, often we get a particularly ungraceful error, e.g. "AttributeError: FunctionLayer has no attribute G" This is because we're trying to load the parameters for something like a LayerNorm layer, and the model architecture has some other layer there instead. This is obviously terrible, especially since the error type is wrong. I've changed it to raise a ValueError. The error message is still probably a bit terse, but it's hard to be sure exactly what's gone wrong. * Update spacy/pipeline/pipes.pyx * Update spacy/pipeline/pipes.pyx * Update spacy/pipeline/pipes.pyx * Update spacy/syntax/nn_parser.pyx * Update spacy/syntax/nn_parser.pyx * Update spacy/pipeline/pipes.pyx Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com> * Update spacy/pipeline/pipes.pyx Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com> Co-authored-by: Ines Montani <ines@ines.io>	2019-07-24 11:27:34 +02:00
svlandeg	4e7ec1ed31	return fix	2019-07-23 14:23:58 +02:00
svlandeg	400ff342cf	replace assert's with custom error messages	2019-07-23 11:52:48 +02:00
svlandeg	20389e4553	format and bugfix	2019-07-22 15:08:17 +02:00
svlandeg	41fb5204ba	output tensors as part of predict	2019-07-19 14:47:36 +02:00
svlandeg	21176517a7	have gold.links correspond exactly to doc.ents	2019-07-19 12:36:15 +02:00
svlandeg	e1213eaf6a	use original gold object in get_loss function	2019-07-18 13:35:10 +02:00
svlandeg	ec55d2fccd	filter training data beforehand (+black formatting)	2019-07-18 10:22:24 +02:00
svlandeg	a63d15a142	code cleanup	2019-07-15 17:36:43 +02:00
svlandeg	60f299374f	set default context width	2019-07-15 12:03:09 +02:00

1 2 3

127 Commits