Commit Graph

1966 Commits

Author SHA1 Message Date
Adriane Boyd 86c3ec9c2b
Refactor Token morph setting (#6175)
* Refactor Token morph setting

* Remove `Token.morph_`
* Add `Token.set_morph()`
  * `0` resets `token.c.morph` to unset
  * Any other values are passed to `Morphology.add`

* Add token.morph setter to set from MorphAnalysis
2020-10-01 22:21:46 +02:00
Ines Montani d48ddd6c9a Remove default initialize lookups 2020-10-01 21:54:33 +02:00
Adriane Boyd 73538782a0
Switch Doc.__init__(ents=) to IOB tags (#6173)
* Switch Doc.__init__(ents=) to IOB tags

* Fix check for "-"

* Allow "" or None as missing IOB tag
2020-10-01 16:22:18 +02:00
Ines Montani 381258b75b
Merge pull request #6165 from explosion/feature/update-tokenizers-initialize 2020-10-01 09:49:47 +02:00
Ines Montani a103ab5f1a Update augmenter lookups and docs 2020-09-30 23:03:47 +02:00
Ines Montani 23c63eefaf Tidy up env vars [ci skip] 2020-09-30 15:15:11 +02:00
Adriane Boyd 6b7bb32834 Refactor Chinese initialization 2020-09-30 11:46:45 +02:00
Ines Montani 34f9c26c62 Add lexeme norm defaults 2020-09-30 10:20:14 +02:00
Ines Montani 1aeef3bfbb Make corpus paths default to None and improve errors 2020-09-29 22:33:46 +02:00
Ines Montani fa47f87924 Tidy up and auto-format 2020-09-29 21:39:28 +02:00
Ines Montani 6467a560e3 WIP: Test updating Chinese tokenizer 2020-09-29 21:10:22 +02:00
Ines Montani 78021089f9
Merge pull request #6160 from explosion/feature/prepare 2020-09-29 20:55:13 +02:00
Ines Montani c3f8c09d7d
Merge pull request #6154 from adrianeboyd/bugfix/chinese-tokenizer-pickle 2020-09-29 20:54:59 +02:00
Ines Montani d3c63b7965 Merge branch 'develop' into feature/prepare 2020-09-29 20:53:05 +02:00
Ines Montani 2be80379ec Fix small issues, resolve_dot_names and debug model 2020-09-29 20:38:35 +02:00
Ines Montani 7851020653 Update tests 2020-09-29 18:14:15 +02:00
Ines Montani f2352eb701 Test with default value 2020-09-29 17:00:40 +02:00
Ines Montani 63d1598137 Simplify config use in Language.initialize 2020-09-29 16:05:48 +02:00
Ines Montani 56f8bc73ef Add more tests 2020-09-29 15:23:34 +02:00
Ines Montani 591038b1a4 Add test 2020-09-29 12:54:52 +02:00
Matthew Honnibal e1fdf2b7c5 Upd tests 2020-09-29 12:05:38 +02:00
Ines Montani ff9a63bfbd begin_training -> initialize 2020-09-28 21:35:09 +02:00
Ines Montani 2e9c9e74af Fix config resolution and interpolation
TODO: auto-interpolate in Thinc if config is dict (i.e. likely subsection)
2020-09-28 15:34:00 +02:00
Ines Montani 822ea4ef61 Refactor CLI 2020-09-28 15:09:59 +02:00
Matthew Honnibal a976da168c
Support data augmentation in Corpus (#6155)
* Support data augmentation in Corpus

* Note initial docs for data augmentation

* Add augmenter to quickstart

* Fix flake8

* Format

* Fix test

* Update spacy/tests/training/test_training.py

* Improve data augmentation arguments

* Update templates

* Move randomization out into caller

* Refactor

* Update spacy/training/augment.py

* Update spacy/tests/training/test_training.py

* Fix augment

* Fix test
2020-09-28 03:03:27 +02:00
Ines Montani 9016d23cc5 Fix exclude and add test 2020-09-27 23:34:03 +02:00
Ines Montani 7e938ed63e Update config resolution to use new Thinc 2020-09-27 22:21:31 +02:00
Adriane Boyd 8393dbedad Minor fixes
* Put `cfg` back in serialization
* Add `pickle5` to pytest conf
2020-09-27 15:15:53 +02:00
Adriane Boyd 11e195d3ed Update ChineseTokenizer
* Allow `pkuseg_model` to be set to `None` on initialization
* Don't save config within tokenizer
* Force convert pkuseg_model to use pickle protocol 4 by reencoding with
`pickle5` on serialization
* Update pkuseg serialization test
2020-09-27 14:00:18 +02:00
Ines Montani ca3c997062 Improve CLI config validation with latest Thinc 2020-09-26 13:13:57 +02:00
Adriane Boyd 3c062b3911
Add MORPH handling to Matcher (#6107)
* Add MORPH handling to Matcher

* Add `MORPH` to `Matcher` schema
* Rename `_SetMemberPredicate` to `_SetPredicate`
* Add `ISSUBSET` and `ISSUPERSET` operators to `_SetPredicate`
  * Add special handling for normalization and conversion of morph
    values into sets
  * For other attrs, `ISSUBSET` acts like `IN` and `ISSUPERSET` only
    matches for 0 or 1 values

* Update test

* Rename to IS_SUBSET and IS_SUPERSET
2020-09-24 16:55:09 +02:00
Adriane Boyd 59340606b7
Add option to disable Matcher errors (#6125)
* Add option to disable Matcher errors

* Add option to disable Matcher errors when a doc doesn't contain a
particular type of annotation

Minor additional change:

* Update `AttributeRuler.load_from_morph_rules` to allow direct `MORPH`
values

* Rename suppress_errors to allow_missing

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Refactor annotation checks in Matcher and PhraseMatcher

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-24 16:54:39 +02:00
Sofie Van Landeghem c7eedd3534
updates to NEL functionality (#6132)
* NEL: read sentences and ents from reference

* fiddling with sent_start annotations

* add KB serialization test

* KB write additional file with strings.json

* score_links function to calculate NEL P/R/F

* formatting

* documentation
2020-09-24 16:53:59 +02:00
Ines Montani d0ef4a4cf5 Prevent division by zero in score weights 2020-09-24 16:42:13 +02:00
Ines Montani 58dde293ce
Merge pull request #6089 from adrianeboyd/feature/doc-ents-v3-2 2020-09-24 14:44:42 +02:00
Adriane Boyd 8eaacaae97 Refactor Doc.ents setter to use Doc.set_ents
Additional changes:

* Entity spans with missing labels are ignored
* Fix ent_kb_id setting in `Doc.set_ents`
2020-09-24 12:36:51 +02:00
Ines Montani c6c67b606e
Merge pull request #6133 from explosion/fix/score_weights 2020-09-24 12:00:57 +02:00
Ines Montani 4bbe41f017 Fix combined scores and update test 2020-09-24 10:42:47 +02:00
Sofie Van Landeghem c645c4e7ce
fix micro PRF for textcat (#6130)
* fix micro PRF for textcat

* small fix
2020-09-24 10:31:17 +02:00
Ines Montani ae51f580c1 Fix handling of score_weights 2020-09-24 10:27:33 +02:00
svlandeg b816ace4bb format 2020-09-23 17:33:13 +02:00
svlandeg 5a9fdbc8ad state_type as Literal 2020-09-23 17:32:14 +02:00
svlandeg dd2292793f 'parser' instead of 'deps' for state_type 2020-09-23 16:53:49 +02:00
svlandeg 6c85fab316 state_type and extra_state_tokens instead of nr_feature_tokens 2020-09-23 13:35:09 +02:00
Ines Montani 60a317520a
Merge pull request #6109 from svlandeg/feature/2rename 2020-09-23 09:47:12 +02:00
Sofie Van Landeghem 86a08f819d
tok2vec.update instead of predict (#6113) 2020-09-22 21:54:52 +02:00
Sofie Van Landeghem e0e793be4d
fix KB IO (#6118) 2020-09-22 21:53:06 +02:00
Sofie Van Landeghem d53c84b6d6
avoid None callback (#6100) 2020-09-22 13:54:44 +02:00
Adriane Boyd 535842e483
Merge branch 'develop' into feature/doc-ents-v3-2 2020-09-22 13:45:50 +02:00
Ines Montani 5e3b796b12 Validate section refs in debug config 2020-09-22 12:24:39 +02:00