Commit Graph

27 Commits

Author SHA1 Message Date
Adriane Boyd 4448680750
Fix alignment for 1-to-1 tokens and lowercasing (#6476)
* When checking for token alignments, check not only that the tokens are
identical but that the character positions are both at the start of a
token.

  It's possible for the tokens to be identical even though the two
tokens aren't aligned one-to-one in a case like `["a'", "''"]` vs.
`["a", "''", "'"]`, where the middle tokens are identical but should not
be aligned on the token level at character position 2 since it's the
start of one token but the middle of another.

* Use the lowercased version of the token texts to create the
character-to-token alignment because lowercasing can change the string
length (e.g., for `İ`, see the not-a-bug bug report:
https://bugs.python.org/issue34723)
2020-12-08 14:25:16 +08:00
Sofie Van Landeghem d6c616a125
Fixes in test suite (#6457)
* fix slow test for textcat readers

* cleanup test_issue5551

* add explicit score weight

* cleanup
2020-12-02 12:57:08 +01:00
Adriane Boyd 1c4df8fd09
Replace pytokenizations with internal alignment (#6293)
* Replace pytokenizations with internal alignment

Replace pytokenizations with internal alignment algorithm that is
restricted to only allow differences in whitespace and capitalization.

* Rename `spacy.training.align` to `spacy.training.alignment` to contain
the `Alignment` dataclass
* Implement `get_alignments` in `spacy.training.align`

* Refactor trailing whitespace handling

* Remove unnecessary exception for empty docs

Allow a non-empty whitespace-only doc to be aligned with an empty doc

* Remove empty docs exceptions completely
2020-11-03 16:24:38 +01:00
Sofie Van Landeghem 75a202ce65
TextCat updates and fixes (#6263)
* small fix in example imports

* throw error when train_corpus or dev_corpus is not a string

* small fix in custom logger example

* limit macro_auc to labels with 2 annotations

* fix typo

* also create parents of output_dir if need be

* update documentation of textcat scores

* refactor TextCatEnsemble

* fix tests for new AUC definition

* bump to 3.0.0a42

* update docs

* rename to spacy.TextCatEnsemble.v2

* spacy.TextCatEnsemble.v1 in legacy

* cleanup

* small fix

* update to 3.0.0rc2

* fix import that got lost in merge

* cursed IDE

* fix two typos
2020-10-18 14:50:41 +02:00
Ines Montani 3c36a57e84
Update data augmenters (#6196)
* Draft lower-case augmenter

* Make warning a debug log

* Update lowercase augmenter, docs and tests

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-10-04 17:46:29 +02:00
Ines Montani c41a4332e4 Add test for custom data augmentation 2020-10-02 11:37:56 +02:00
Ines Montani 01c1538c72 Integrate file readers 2020-10-02 01:36:06 +02:00
Adriane Boyd 86c3ec9c2b
Refactor Token morph setting (#6175)
* Refactor Token morph setting

* Remove `Token.morph_`
* Add `Token.set_morph()`
  * `0` resets `token.c.morph` to unset
  * Any other values are passed to `Morphology.add`

* Add token.morph setter to set from MorphAnalysis
2020-10-01 22:21:46 +02:00
Adriane Boyd 73538782a0
Switch Doc.__init__(ents=) to IOB tags (#6173)
* Switch Doc.__init__(ents=) to IOB tags

* Fix check for "-"

* Allow "" or None as missing IOB tag
2020-10-01 16:22:18 +02:00
Ines Montani a103ab5f1a Update augmenter lookups and docs 2020-09-30 23:03:47 +02:00
Ines Montani fa47f87924 Tidy up and auto-format 2020-09-29 21:39:28 +02:00
Ines Montani d3c63b7965 Merge branch 'develop' into feature/prepare 2020-09-29 20:53:05 +02:00
Ines Montani 2be80379ec Fix small issues, resolve_dot_names and debug model 2020-09-29 20:38:35 +02:00
Ines Montani ff9a63bfbd begin_training -> initialize 2020-09-28 21:35:09 +02:00
Ines Montani 2e9c9e74af Fix config resolution and interpolation
TODO: auto-interpolate in Thinc if config is dict (i.e. likely subsection)
2020-09-28 15:34:00 +02:00
Ines Montani 822ea4ef61 Refactor CLI 2020-09-28 15:09:59 +02:00
Matthew Honnibal a976da168c
Support data augmentation in Corpus (#6155)
* Support data augmentation in Corpus

* Note initial docs for data augmentation

* Add augmenter to quickstart

* Fix flake8

* Format

* Fix test

* Update spacy/tests/training/test_training.py

* Improve data augmentation arguments

* Update templates

* Move randomization out into caller

* Refactor

* Update spacy/training/augment.py

* Update spacy/tests/training/test_training.py

* Fix augment

* Fix test
2020-09-28 03:03:27 +02:00
Ines Montani 7e938ed63e Update config resolution to use new Thinc 2020-09-27 22:21:31 +02:00
Sofie Van Landeghem c7eedd3534
updates to NEL functionality (#6132)
* NEL: read sentences and ents from reference

* fiddling with sent_start annotations

* add KB serialization test

* KB write additional file with strings.json

* score_links function to calculate NEL P/R/F

* formatting

* documentation
2020-09-24 16:53:59 +02:00
svlandeg b556a10808 rename converts in_to_out 2020-09-22 11:50:19 +02:00
Ines Montani 67fbcb3da5 Tidy up tests and docs 2020-09-21 20:43:54 +02:00
Ines Montani 1114219ae3 Tidy up and auto-format 2020-09-21 10:59:07 +02:00
svlandeg 427dbecdd6 cleanup and formatting 2020-09-17 11:48:04 +02:00
svlandeg 0c35885751 generalize corpora, dot notation for dev and train corpus 2020-09-17 11:38:59 +02:00
svlandeg 781fae678b Merge remote-tracking branch 'upstream/develop' into fix/corpus 2020-09-17 09:24:36 +02:00
svlandeg 1040e250d8 actual commit with test for custom readers with ml_datasets >= 0.2 2020-09-16 16:41:28 +02:00
svlandeg bd87e8686e move tests to correct subdir 2020-09-15 21:40:38 +02:00