spaCy

Commit Graph

Author	SHA1	Message	Date
Ines Montani	e257e66ab9	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-29 11:36:45 +02:00
Ines Montani	e0ffe36e79	Update docstrings, docs and types	2020-07-29 11:36:42 +02:00
Sofie Van Landeghem	40c995b1be	Option for returning only greedy matches (#5771 ) * add "greedy" option for match pattern * distinction between greedy FIRST or LONGEST * check for proper values, throw custom warning otherwise * unxfail one more test * add comment in docstring * add test that LONGEST also prefers first match if equal length * use c arrays for more efficient processing * rename 'greediness' to 'greedy'	2020-07-29 11:04:43 +02:00
Adriane Boyd	191a12d75f	Fix score_weights typo in train CLI (#5835 )	2020-07-29 11:04:12 +02:00
Adriane Boyd	0cddb0dbe9	Move timing into Language.evaluate (#5836 ) Move timing into `Language.evaluate` so that only the processing is timing, not processing + scoring. `Language.evaluate` returns `scores["speed"]` as words per second, which should be identical to how the speed was added to the scores previously. Also add the speed to the evaluate CLI output.	2020-07-29 11:02:31 +02:00
Ines Montani	7adffc5361	Remove unused schema	2020-07-28 23:12:47 +02:00
Ines Montani	e5d9eaf79c	Tidy up docstrings and arguments	2020-07-28 23:12:42 +02:00
Ines Montani	2c7a32cf12	Remove unused methods	2020-07-28 16:50:02 +02:00
Ines Montani	ba22111ff4	Move error to Errors	2020-07-28 16:24:14 +02:00
Ines Montani	2748249217	Re-add meta["pipeline"] for now	2020-07-28 16:14:23 +02:00
Ines Montani	b83ead5bf5	Merge pull request #5824 from svlandeg/fix/textcat-v3	2020-07-28 15:04:25 +02:00
Ines Montani	06a97a8766	Support --opt=value format in CLI config overrides	2020-07-28 13:43:15 +02:00
Ines Montani	ae4d8a6ffd	Update docstrings, docs and pipe consistency	2020-07-28 13:37:31 +02:00
Ines Montani	0094cb0d04	Remove scores list from config and document	2020-07-28 11:22:24 +02:00
Ines Montani	894e20c466	Merge branch 'develop' into feature/component-scores	2020-07-27 18:14:39 +02:00
Ines Montani	d8b519c23c	API docs, docstrings and argument consistency	2020-07-27 18:11:45 +02:00
svlandeg	85b2dcfd67	cleanup	2020-07-27 17:54:44 +02:00
svlandeg	61068e0fb1	util function dot_to_object and corresponding unit test	2020-07-27 17:50:12 +02:00
Ines Montani	10b84e1e27	Add flag to toggle sdist creation on package [ci skip]	2020-07-27 16:52:23 +02:00
Adriane Boyd	34c92dfe63	Add missing Scorer imports	2020-07-27 15:08:51 +02:00
Adriane Boyd	8bb0507777	Add and update score methods and score weights Add and update `score` methods, provided `scores`, and default weights `default_score_weights` for pipeline components. * `scores` provides all top-level keys returned by `score` (merely informative, similar to `assigns`). * `default_score_weights` provides the default weights for a default config. * The keys from `default_score_weights` determine which values will be shown in the `spacy train` output, so keys with weight `0.0` will be displayed but not counted toward the overall score.	2020-07-27 14:44:53 +02:00
Adriane Boyd	baf19fd652	Update cats scoring to provide overall score * Provide top-level score as `attr_score` * Provide a description of the score as `attr_score_desc` * Provide all potential scores keys, setting unused keys to `None` * Update CLI evaluate accordingly	2020-07-27 12:26:10 +02:00
Adriane Boyd	f8cf378be9	Combine weights from multiple components Combine weights from multiple components for the same score.	2020-07-27 10:21:31 +02:00
Ines Montani	3d56a3f286	Make more args keyword-only	2020-07-27 00:27:53 +02:00
Matthew Honnibal	80271ac0ba	Update default config	2020-07-26 15:27:39 +02:00
Ines Montani	ed61fb10fc	Rename default textcat arch to TextCatEnsemble	2020-07-26 15:11:43 +02:00
Ines Montani	53d37da29a	Make sure @factories is removed from config	2020-07-26 15:11:24 +02:00
Ines Montani	4060c2d5a6	Fix test	2020-07-26 13:40:19 +02:00
Ines Montani	2470486543	Allow pipeline components to set default scores and weights	2020-07-26 13:18:43 +02:00
Ines Montani	787d066e22	Remove pipes.pyx Probably accidentally re-added in a merge?	2020-07-26 13:08:52 +02:00
Matthew Honnibal	520d25cb50	Add smart_open dependency to fetch project assets (#5812 ) * Use smart_open for project assets * Fix assets.py * Update pyproject.toml	2020-07-26 12:15:00 +02:00
Ines Montani	e92df281ce	Tidy up, autoformat, add types	2020-07-25 15:01:15 +02:00
Matthew Honnibal	71242327b2	Set version to v3.0.0a5	2020-07-25 14:06:01 +02:00
Ines Montani	cdbd6ba912	Merge pull request #5798 from explosion/feature/language-data-config	2020-07-25 13:34:49 +02:00
Ines Montani	49f27a2a7b	Tidy up [ci skip]	2020-07-25 13:00:49 +02:00
Ines Montani	4a0a692875	Add missing lex_attr_getters (resolves #5806 )	2020-07-25 12:55:18 +02:00
Adriane Boyd	2bcceb80c4	Refactor the Scorer to improve flexibility (#5731 ) * Refactor the Scorer to improve flexibility Refactor the `Scorer` to improve flexibility for arbitrary pipeline components. * Individual pipeline components provide their own `evaluate` methods that score a list of `Example`s and return a dictionary of scores * `Scorer` is initialized either: * with a provided pipeline containing components to be scored * with a default pipeline containing the built-in statistical components (senter, tagger, morphologizer, parser, ner) * `Scorer.score` evaluates a list of `Example`s and returns a dictionary of scores referring to the scores provided by the components in the pipeline Significant differences: * `tags_acc` is renamed to `tag_acc` to be consistent with `token_acc` and the new `morph_acc`, `pos_acc`, and `lemma_acc` * Scoring is no longer cumulative: `Scorer.score` scores a list of examples rather than a single example and does not retain any state about previously scored examples * PRF values in the returned scores are no longer multiplied by 100 * Add kwargs to Morphologizer.evaluate * Create generalized scoring methods in Scorer * Generalized static scoring methods are added to `Scorer` * Methods require an attribute (either on Token or Doc) that is used to key the returned scores Naming differences: * `uas`, `las`, and `las_per_type` in the scores dict are renamed to `dep_uas`, `dep_las`, and `dep_las_per_type` Scoring differences: * `Doc.sents` is now scored as spans rather than on sentence-initial token positions so that `Doc.sents` and `Doc.ents` can be scored with the same method (this lowers scores since a single incorrect sentence start results in two incorrect spans) * Simplify / extend hasattr check for eval method * Add hasattr check to tokenizer scoring * Simplify to hasattr check for component scoring * Reset Example alignment if docs are set Reset the Example alignment if either doc is set in case the tokenization has changed. * Add PRF tokenization scoring for tokens as spans Add PRF scores for tokens as character spans. The scores are: * token_acc: # correct tokens / # gold tokens * token_p/r/f: PRF for (token.idx, token.idx + len(token)) * Add docstring to Scorer.score_tokenization * Rename component.evaluate() to component.score() * Update Scorer API docs * Update scoring for positive_label in textcat * Fix TextCategorizer.score kwargs * Update Language.evaluate docs * Update score names in default config	2020-07-25 12:53:02 +02:00
Ines Montani	c003d26b94	Tidy up	2020-07-25 12:21:37 +02:00
Ines Montani	a063a82c40	Tidy up __init__.py	2020-07-25 12:14:37 +02:00
Ines Montani	8d9d28eb8b	Re-add setting for vocab data and tidy up	2020-07-25 12:14:28 +02:00
Ines Montani	b9aaa4e457	Improve vocab data integration and warning	2020-07-25 11:51:30 +02:00
Ines Montani	38f6ea7a78	Simplify language data and revert detailed configs	2020-07-24 14:50:26 +02:00
Adriane Boyd	656574a01a	Update Japanese tests (#5807 ) * Update POS tests to reflect current behavior (it is not entirely clear whether the AUX/VERB mapping is indeed the desired behavior?) * Switch to `from_config` initialization in subtoken test	2020-07-24 12:45:14 +02:00
Adriane Boyd	fdb8815ef5	Minor refactor for Morphology and MorphAnalysis (#5804 ) * `MorphAnalysis.get` returns only the field values * Move `_normalize_props` inside `Morphology` as `Morphology.normalize_attrs` and simplify * Simplify POS field detection/conversion * Convert all non-POS features to strings * `Morphology` returns an empty string for a missing morph to align with the FEATS string returned for an existing morph * Remove unused `list_to_feats`	2020-07-24 09:28:06 +02:00
Ines Montani	87737a5a60	Tidy up	2020-07-23 00:16:23 +02:00
Ines Montani	a624ae0675	Remove POS, TAG and LEMMA from tokenizer exceptions	2020-07-22 23:09:01 +02:00
Ines Montani	14d7d46f89	Merge branch 'develop' into feature/language-data-config	2020-07-22 22:18:53 +02:00
Ines Montani	b507f61629	Tidy up and move noun_chunks, token_match, url_match	2020-07-22 22:18:46 +02:00
Ines Montani	7fc4dadd22	Fix typo	2020-07-22 20:27:22 +02:00
Ines Montani	d0c6d1efc5	@factories -> factory (#5801 )	2020-07-22 17:29:31 +02:00

1 2 3 4 5 ...

7336 Commits