Commit Graph

273 Commits

Author SHA1 Message Date
Adriane Boyd 1c4df8fd09
Replace pytokenizations with internal alignment (#6293)
* Replace pytokenizations with internal alignment

Replace pytokenizations with internal alignment algorithm that is
restricted to only allow differences in whitespace and capitalization.

* Rename `spacy.training.align` to `spacy.training.alignment` to contain
the `Alignment` dataclass
* Implement `get_alignments` in `spacy.training.align`

* Refactor trailing whitespace handling

* Remove unnecessary exception for empty docs

Allow a non-empty whitespace-only doc to be aligned with an empty doc

* Remove empty docs exceptions completely
2020-11-03 16:24:38 +01:00
Sofie Van Landeghem 75a202ce65
TextCat updates and fixes (#6263)
* small fix in example imports

* throw error when train_corpus or dev_corpus is not a string

* small fix in custom logger example

* limit macro_auc to labels with 2 annotations

* fix typo

* also create parents of output_dir if need be

* update documentation of textcat scores

* refactor TextCatEnsemble

* fix tests for new AUC definition

* bump to 3.0.0a42

* update docs

* rename to spacy.TextCatEnsemble.v2

* spacy.TextCatEnsemble.v1 in legacy

* cleanup

* small fix

* update to 3.0.0rc2

* fix import that got lost in merge

* cursed IDE

* fix two typos
2020-10-18 14:50:41 +02:00
Ines Montani bfa3931c9d
Revert added_strings change (#6236) 2020-10-10 18:55:07 +02:00
Ines Montani 8ac5f22253 Adjust error message 2020-10-09 18:00:16 +02:00
svlandeg 06b9d213fd formatting 2020-10-09 12:19:47 +02:00
svlandeg 2cafba5f50 shorten error message for clarity 2020-10-09 12:17:35 +02:00
svlandeg 18dfb27985 Add custom error when evaluation throws a KeyError 2020-10-09 12:05:33 +02:00
Sofie Van Landeghem d093d6343b
TrainablePipe (#6213)
* rename Pipe to TrainablePipe

* split functionality between Pipe and TrainablePipe

* remove unnecessary methods from certain components

* cleanup

* hasattr(component, "pipe") should be sufficient again

* remove serialization and vocab/cfg from Pipe

* unify _ensure_examples and validate_examples

* small fixes

* hasattr checks for self.cfg and self.vocab

* make is_resizable and is_trainable properties

* serialize strings.json instead of vocab

* fix KB IO + tests

* fix typos

* more typos

* _added_strings as a set

* few more tests specifically for _added_strings field

* bump to 3.0.0a36
2020-10-08 21:33:49 +02:00
Ines Montani be99f1e4de
Remove output dirs before training (#6204)
* Remove output dirs before training

* Re-raise error if cleaning fails
2020-10-05 20:11:16 +02:00
svlandeg fd2d48556c fix E902 and E903 numbering 2020-10-05 13:43:32 +02:00
Ines Montani d38dc466c5 Adjust error [ci skip] 2020-10-04 15:26:01 +02:00
Ines Montani bcd52e5486 Tidy up errors and warnings 2020-10-04 11:16:31 +02:00
Ines Montani d3b3663942 Adjust error message and add test 2020-10-04 10:11:27 +02:00
Ines Montani cc08c88a89
Merge pull request #6187 from svlandeg/fix/begin_training_pipe 2020-10-04 10:01:02 +02:00
svlandeg 3f657ed3a1 implement warning in __init_subclass__ instead 2020-10-03 22:34:10 +02:00
Ines Montani dd542ec6a4
Fix label initialization of textcat component (#6190) 2020-10-03 17:07:38 +02:00
svlandeg fb48de349c bwd compat for pipe.begin_training 2020-10-02 20:31:14 +02:00
Sofie Van Landeghem 09dcb75076
small UX fix for DocBin (#6167)
* add informative warning when messing up store_user_data DocBin flags

* add informative warning when messing up store_user_data DocBin flags

* cleanup test

* rename to patterns_path
2020-10-02 15:43:32 +02:00
Ines Montani f0b30aedad
Make lemmatizers use initialize logic (#6182)
* Make lemmatizer use initialize logic and tidy up

* Fix typo

* Raise for uninitialized tables
2020-10-02 15:42:36 +02:00
Ines Montani 01c1538c72 Integrate file readers 2020-10-02 01:36:06 +02:00
Adriane Boyd 86c3ec9c2b
Refactor Token morph setting (#6175)
* Refactor Token morph setting

* Remove `Token.morph_`
* Add `Token.set_morph()`
  * `0` resets `token.c.morph` to unset
  * Any other values are passed to `Morphology.add`

* Add token.morph setter to set from MorphAnalysis
2020-10-01 22:21:46 +02:00
Ines Montani 381258b75b
Merge pull request #6165 from explosion/feature/update-tokenizers-initialize 2020-10-01 09:49:47 +02:00
Ines Montani 6f29f68f69 Update errors and make Tokenizer.initialize args less strict 2020-09-30 23:48:47 +02:00
Ines Montani a103ab5f1a Update augmenter lookups and docs 2020-09-30 23:03:47 +02:00
Adriane Boyd 6b7bb32834 Refactor Chinese initialization 2020-09-30 11:46:45 +02:00
Ines Montani 1aeef3bfbb Make corpus paths default to None and improve errors 2020-09-29 22:33:46 +02:00
Ines Montani 78021089f9
Merge pull request #6160 from explosion/feature/prepare 2020-09-29 20:55:13 +02:00
Ines Montani ff9a63bfbd begin_training -> initialize 2020-09-28 21:35:09 +02:00
Adriane Boyd 11e195d3ed Update ChineseTokenizer
* Allow `pkuseg_model` to be set to `None` on initialization
* Don't save config within tokenizer
* Force convert pkuseg_model to use pickle protocol 4 by reencoding with
`pickle5` on serialization
* Update pkuseg serialization test
2020-09-27 14:00:18 +02:00
Sofie Van Landeghem 009ba14aaf
Fix pretraining in train script (#6143)
* update pretraining API in train CLI

* bump thinc to 8.0.0a35

* bump to 3.0.0a26

* doc fixes

* small doc fix
2020-09-25 15:47:10 +02:00
Adriane Boyd 59340606b7
Add option to disable Matcher errors (#6125)
* Add option to disable Matcher errors

* Add option to disable Matcher errors when a doc doesn't contain a
particular type of annotation

Minor additional change:

* Update `AttributeRuler.load_from_morph_rules` to allow direct `MORPH`
values

* Rename suppress_errors to allow_missing

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Refactor annotation checks in Matcher and PhraseMatcher

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-24 16:54:39 +02:00
Sofie Van Landeghem c7eedd3534
updates to NEL functionality (#6132)
* NEL: read sentences and ents from reference

* fiddling with sent_start annotations

* add KB serialization test

* KB write additional file with strings.json

* score_links function to calculate NEL P/R/F

* formatting

* documentation
2020-09-24 16:53:59 +02:00
Ines Montani 58dde293ce
Merge pull request #6089 from adrianeboyd/feature/doc-ents-v3-2 2020-09-24 14:44:42 +02:00
Ines Montani 92f8b6959a Fix typo 2020-09-24 13:48:41 +02:00
Adriane Boyd 5c13e0cf1b Remove unused error 2020-09-24 13:41:55 +02:00
Ines Montani be56c0994b Add [training.before_to_disk] callback 2020-09-24 12:40:25 +02:00
Ines Montani f69fea8b25 Improve error handling around non-number scores 2020-09-24 11:29:07 +02:00
Ines Montani 4eb39b5c43 Fix logging 2020-09-24 11:04:35 +02:00
svlandeg 25b34bba94 throw custom error when state_type is invalid 2020-09-23 16:57:14 +02:00
Adriane Boyd b1a7d6c528 Refactor seen token detection 2020-09-22 14:42:51 +02:00
Adriane Boyd 535842e483
Merge branch 'develop' into feature/doc-ents-v3-2 2020-09-22 13:45:50 +02:00
svlandeg b556a10808 rename converts in_to_out 2020-09-22 11:50:19 +02:00
Ines Montani 49e80dbcac
Merge pull request #6103 from explosion/chore/tidy-up-tests-docs-get-doc 2020-09-22 09:45:04 +02:00
Ines Montani 81606b29bd
Merge pull request #6104 from svlandeg/fix/debug_model [ci skip] 2020-09-22 09:31:23 +02:00
Ines Montani 67fbcb3da5 Tidy up tests and docs 2020-09-21 20:43:54 +02:00
Adriane Boyd 177df15d89 Implement Doc.set_ents 2020-09-21 15:54:05 +02:00
svlandeg eb9b447960 Merge remote-tracking branch 'upstream/develop' into fix/debug_model
# Conflicts:
#	spacy/cli/debug_model.py
2020-09-21 14:05:16 +02:00
Adriane Boyd bc02e86494 Extend Doc.__init__ with additional annotation
Mostly copying from `spacy.tests.util.get_doc`, add additional kwargs to
`Doc.__init__` to initialize the most common doc/token values.
2020-09-21 13:36:24 +02:00
svlandeg 73ff52b9ec hack for tok2vec listener 2020-09-18 16:43:15 +02:00
Adriane Boyd a88106e852
Remove W106: HEAD and SENT_START in doc.from_array (#6086)
* Remove W106: HEAD and SENT_START in doc.from_array

This warning was hacky and being triggered too often.

* Fix test
2020-09-18 03:01:29 +02:00