Commit Graph

10 Commits

Author SHA1 Message Date
Sofie Van Landeghem eaeca5eb6a
account for NER labels with a hyphen in the name (#10960)
* account for NER labels with a hyphen in the name

* cleanup

* fix docstring

* add return type to helper method

* shorter method and few more occurrences

* user helper method across repo

* fix circular import

* partial revert to avoid circular import
2022-06-17 20:02:37 +01:00
Adriane Boyd d98d525bc8 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.1-3 2021-10-14 09:41:46 +02:00
Lj Miranda 6425b9a1c4
Include JsonlCorpus from the imports (#9431) 2021-10-12 15:39:14 +02:00
Elia Robyn Lake (Robyn Speer) 5b0b0ca809
Move WandB loggers into spacy-loggers (#9223)
* factor out the WandB logger into spacy-loggers

Signed-off-by: Elia Robyn Speer <gh@arborelia.net>

* depend on spacy-loggers so they are available

Signed-off-by: Elia Robyn Speer <gh@arborelia.net>

* remove docs of spacy.WandbLogger.v2 (moved to spacy-loggers)

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* Version number suggestions from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* update references to WandbLogger

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* make order of deps more consistent

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

Co-authored-by: Elia Robyn Speer <elia@explosion.ai>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-09-29 11:12:50 +02:00
Adriane Boyd bdb485cc80
Add callback to copy vocab/tokenizer from model (#7750)
* Add callback to copy vocab/tokenizer from model

Add callback `spacy.copy_from_base_model.v1` to copy the tokenizer
settings and/or vocab (including vectors) from a base model.

* Move spacy.copy_from_base_model.v1 to spacy.training.callbacks

* Add documentation

* Modify to specify model as tokenizer and vocab params
2021-04-22 12:36:50 +02:00
Adriane Boyd 1c4df8fd09
Replace pytokenizations with internal alignment (#6293)
* Replace pytokenizations with internal alignment

Replace pytokenizations with internal alignment algorithm that is
restricted to only allow differences in whitespace and capitalization.

* Rename `spacy.training.align` to `spacy.training.alignment` to contain
the `Alignment` dataclass
* Implement `get_alignments` in `spacy.training.align`

* Refactor trailing whitespace handling

* Remove unnecessary exception for empty docs

Allow a non-empty whitespace-only doc to be aligned with an empty doc

* Remove empty docs exceptions completely
2020-11-03 16:24:38 +01:00
Sofie Van Landeghem d093d6343b
TrainablePipe (#6213)
* rename Pipe to TrainablePipe

* split functionality between Pipe and TrainablePipe

* remove unnecessary methods from certain components

* cleanup

* hasattr(component, "pipe") should be sufficient again

* remove serialization and vocab/cfg from Pipe

* unify _ensure_examples and validate_examples

* small fixes

* hasattr checks for self.cfg and self.vocab

* make is_resizable and is_trainable properties

* serialize strings.json instead of vocab

* fix KB IO + tests

* fix typos

* more typos

* _added_strings as a set

* few more tests specifically for _added_strings field

* bump to 3.0.0a36
2020-10-08 21:33:49 +02:00
Matthew Honnibal a976da168c
Support data augmentation in Corpus (#6155)
* Support data augmentation in Corpus

* Note initial docs for data augmentation

* Add augmenter to quickstart

* Fix flake8

* Format

* Fix test

* Update spacy/tests/training/test_training.py

* Improve data augmentation arguments

* Update templates

* Move randomization out into caller

* Refactor

* Update spacy/training/augment.py

* Update spacy/tests/training/test_training.py

* Fix augment

* Fix test
2020-09-28 03:03:27 +02:00
svlandeg b556a10808 rename converts in_to_out 2020-09-22 11:50:19 +02:00
Sofie Van Landeghem 8e7557656f
Renaming gold & annotation_setter (#6042)
* version bump to 3.0.0a16

* rename "gold" folder to "training"

* rename 'annotation_setter' to 'set_extra_annotations'

* formatting
2020-09-09 10:31:03 +02:00