spaCy

Commit Graph

Author	SHA1	Message	Date
Adriane Boyd	1c4df8fd09	Replace pytokenizations with internal alignment (#6293 ) * Replace pytokenizations with internal alignment Replace pytokenizations with internal alignment algorithm that is restricted to only allow differences in whitespace and capitalization. * Rename `spacy.training.align` to `spacy.training.alignment` to contain the `Alignment` dataclass * Implement `get_alignments` in `spacy.training.align` * Refactor trailing whitespace handling * Remove unnecessary exception for empty docs Allow a non-empty whitespace-only doc to be aligned with an empty doc * Remove empty docs exceptions completely	2020-11-03 16:24:38 +01:00
Sofie Van Landeghem	75a202ce65	TextCat updates and fixes (#6263 ) * small fix in example imports * throw error when train_corpus or dev_corpus is not a string * small fix in custom logger example * limit macro_auc to labels with 2 annotations * fix typo * also create parents of output_dir if need be * update documentation of textcat scores * refactor TextCatEnsemble * fix tests for new AUC definition * bump to 3.0.0a42 * update docs * rename to spacy.TextCatEnsemble.v2 * spacy.TextCatEnsemble.v1 in legacy * cleanup * small fix * update to 3.0.0rc2 * fix import that got lost in merge * cursed IDE * fix two typos	2020-10-18 14:50:41 +02:00
Ines Montani	bfa3931c9d	Revert added_strings change (#6236 )	2020-10-10 18:55:07 +02:00
Ines Montani	8ac5f22253	Adjust error message	2020-10-09 18:00:16 +02:00
svlandeg	06b9d213fd	formatting	2020-10-09 12:19:47 +02:00
svlandeg	2cafba5f50	shorten error message for clarity	2020-10-09 12:17:35 +02:00
svlandeg	18dfb27985	Add custom error when evaluation throws a KeyError	2020-10-09 12:05:33 +02:00
Sofie Van Landeghem	d093d6343b	TrainablePipe (#6213 ) * rename Pipe to TrainablePipe * split functionality between Pipe and TrainablePipe * remove unnecessary methods from certain components * cleanup * hasattr(component, "pipe") should be sufficient again * remove serialization and vocab/cfg from Pipe * unify _ensure_examples and validate_examples * small fixes * hasattr checks for self.cfg and self.vocab * make is_resizable and is_trainable properties * serialize strings.json instead of vocab * fix KB IO + tests * fix typos * more typos * _added_strings as a set * few more tests specifically for _added_strings field * bump to 3.0.0a36	2020-10-08 21:33:49 +02:00
Ines Montani	be99f1e4de	Remove output dirs before training (#6204 ) * Remove output dirs before training * Re-raise error if cleaning fails	2020-10-05 20:11:16 +02:00
svlandeg	fd2d48556c	fix E902 and E903 numbering	2020-10-05 13:43:32 +02:00
Ines Montani	d38dc466c5	Adjust error [ci skip]	2020-10-04 15:26:01 +02:00
Ines Montani	bcd52e5486	Tidy up errors and warnings	2020-10-04 11:16:31 +02:00
Ines Montani	d3b3663942	Adjust error message and add test	2020-10-04 10:11:27 +02:00
Ines Montani	cc08c88a89	Merge pull request #6187 from svlandeg/fix/begin_training_pipe	2020-10-04 10:01:02 +02:00
svlandeg	3f657ed3a1	implement warning in __init_subclass__ instead	2020-10-03 22:34:10 +02:00
Ines Montani	dd542ec6a4	Fix label initialization of textcat component (#6190 )	2020-10-03 17:07:38 +02:00
svlandeg	fb48de349c	bwd compat for pipe.begin_training	2020-10-02 20:31:14 +02:00
Sofie Van Landeghem	09dcb75076	small UX fix for DocBin (#6167 ) * add informative warning when messing up store_user_data DocBin flags * add informative warning when messing up store_user_data DocBin flags * cleanup test * rename to patterns_path	2020-10-02 15:43:32 +02:00
Ines Montani	f0b30aedad	Make lemmatizers use initialize logic (#6182 ) * Make lemmatizer use initialize logic and tidy up * Fix typo * Raise for uninitialized tables	2020-10-02 15:42:36 +02:00
Ines Montani	01c1538c72	Integrate file readers	2020-10-02 01:36:06 +02:00
Adriane Boyd	86c3ec9c2b	Refactor Token morph setting (#6175 ) * Refactor Token morph setting * Remove `Token.morph_` * Add `Token.set_morph()` * `0` resets `token.c.morph` to unset * Any other values are passed to `Morphology.add` * Add token.morph setter to set from MorphAnalysis	2020-10-01 22:21:46 +02:00
Ines Montani	381258b75b	Merge pull request #6165 from explosion/feature/update-tokenizers-initialize	2020-10-01 09:49:47 +02:00
Ines Montani	6f29f68f69	Update errors and make Tokenizer.initialize args less strict	2020-09-30 23:48:47 +02:00
Ines Montani	a103ab5f1a	Update augmenter lookups and docs	2020-09-30 23:03:47 +02:00
Adriane Boyd	6b7bb32834	Refactor Chinese initialization	2020-09-30 11:46:45 +02:00
Ines Montani	1aeef3bfbb	Make corpus paths default to None and improve errors	2020-09-29 22:33:46 +02:00
Ines Montani	78021089f9	Merge pull request #6160 from explosion/feature/prepare	2020-09-29 20:55:13 +02:00
Ines Montani	ff9a63bfbd	begin_training -> initialize	2020-09-28 21:35:09 +02:00
Adriane Boyd	11e195d3ed	Update ChineseTokenizer * Allow `pkuseg_model` to be set to `None` on initialization * Don't save config within tokenizer * Force convert pkuseg_model to use pickle protocol 4 by reencoding with `pickle5` on serialization * Update pkuseg serialization test	2020-09-27 14:00:18 +02:00
Sofie Van Landeghem	009ba14aaf	Fix pretraining in train script (#6143 ) * update pretraining API in train CLI * bump thinc to 8.0.0a35 * bump to 3.0.0a26 * doc fixes * small doc fix	2020-09-25 15:47:10 +02:00
Adriane Boyd	59340606b7	Add option to disable Matcher errors (#6125 ) * Add option to disable Matcher errors * Add option to disable Matcher errors when a doc doesn't contain a particular type of annotation Minor additional change: * Update `AttributeRuler.load_from_morph_rules` to allow direct `MORPH` values * Rename suppress_errors to allow_missing Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * Refactor annotation checks in Matcher and PhraseMatcher Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-24 16:54:39 +02:00
Sofie Van Landeghem	c7eedd3534	updates to NEL functionality (#6132 ) * NEL: read sentences and ents from reference * fiddling with sent_start annotations * add KB serialization test * KB write additional file with strings.json * score_links function to calculate NEL P/R/F * formatting * documentation	2020-09-24 16:53:59 +02:00
Ines Montani	58dde293ce	Merge pull request #6089 from adrianeboyd/feature/doc-ents-v3-2	2020-09-24 14:44:42 +02:00
Ines Montani	92f8b6959a	Fix typo	2020-09-24 13:48:41 +02:00
Adriane Boyd	5c13e0cf1b	Remove unused error	2020-09-24 13:41:55 +02:00
Ines Montani	be56c0994b	Add [training.before_to_disk] callback	2020-09-24 12:40:25 +02:00
Ines Montani	f69fea8b25	Improve error handling around non-number scores	2020-09-24 11:29:07 +02:00
Ines Montani	4eb39b5c43	Fix logging	2020-09-24 11:04:35 +02:00
svlandeg	25b34bba94	throw custom error when state_type is invalid	2020-09-23 16:57:14 +02:00
Adriane Boyd	b1a7d6c528	Refactor seen token detection	2020-09-22 14:42:51 +02:00
Adriane Boyd	535842e483	Merge branch 'develop' into feature/doc-ents-v3-2	2020-09-22 13:45:50 +02:00
svlandeg	b556a10808	rename converts in_to_out	2020-09-22 11:50:19 +02:00
Ines Montani	49e80dbcac	Merge pull request #6103 from explosion/chore/tidy-up-tests-docs-get-doc	2020-09-22 09:45:04 +02:00
Ines Montani	81606b29bd	Merge pull request #6104 from svlandeg/fix/debug_model [ci skip]	2020-09-22 09:31:23 +02:00
Ines Montani	67fbcb3da5	Tidy up tests and docs	2020-09-21 20:43:54 +02:00
Adriane Boyd	177df15d89	Implement Doc.set_ents	2020-09-21 15:54:05 +02:00
svlandeg	eb9b447960	Merge remote-tracking branch 'upstream/develop' into fix/debug_model # Conflicts: # spacy/cli/debug_model.py	2020-09-21 14:05:16 +02:00
Adriane Boyd	bc02e86494	Extend Doc.__init__ with additional annotation Mostly copying from `spacy.tests.util.get_doc`, add additional kwargs to `Doc.__init__` to initialize the most common doc/token values.	2020-09-21 13:36:24 +02:00
svlandeg	73ff52b9ec	hack for tok2vec listener	2020-09-18 16:43:15 +02:00
Adriane Boyd	a88106e852	Remove W106: HEAD and SENT_START in doc.from_array (#6086 ) * Remove W106: HEAD and SENT_START in doc.from_array This warning was hacky and being triggered too often. * Fix test	2020-09-18 03:01:29 +02:00

1 2 3 4 5 ...

273 Commits