Commit Graph

435 Commits

Author SHA1 Message Date
Ines Montani d12be459f6 Raise RegistryError 2021-01-16 12:57:13 +11:00
Ines Montani a203e3dbb8 Support spacy-legacy via the registry 2021-01-15 21:42:40 +11:00
Ines Montani 991669c934 Tidy up and auto-format 2021-01-05 13:41:53 +11:00
Thomas Bird cbb8c66da3 prevent the root logger from inialising 2020-12-15 19:50:34 +00:00
Ines Montani 1980203229 Merge branch 'master' into pr/6444 2020-12-09 11:09:40 +11:00
Koichi Yasuoka 0afb54ac93
JapaneseTokenizer.pipe added (#6515)
* JapaneseTokenizer.pipe added

For [spacymoji](https://spacy.io/universe/project/spacymoji)  with `Japanese()`.

* DummyTokenizer.pipe added instead
2020-12-08 20:02:23 +01:00
Ines Montani d25b1606d6 Allow reading config from sdtin in spacy train 2020-12-08 18:01:40 +11:00
svlandeg 1f465bea18 if-else 2020-10-13 09:27:19 +02:00
Ines Montani bfa3931c9d
Revert added_strings change (#6236) 2020-10-10 18:55:07 +02:00
svlandeg 040c7c0541 fix get_dim calls in build_simple_cnn_text_classifier 2020-10-09 15:40:58 +02:00
Florijan Stamenković 18f5c309dc Fix Issue 6207 (#6208)
* Regression test for issue 6207

* Fix issue 6207

* Sign contributor agreement

* Minor adjustments to test

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-09 10:14:40 +02:00
Sofie Van Landeghem d093d6343b
TrainablePipe (#6213)
* rename Pipe to TrainablePipe

* split functionality between Pipe and TrainablePipe

* remove unnecessary methods from certain components

* cleanup

* hasattr(component, "pipe") should be sufficient again

* remove serialization and vocab/cfg from Pipe

* unify _ensure_examples and validate_examples

* small fixes

* hasattr checks for self.cfg and self.vocab

* make is_resizable and is_trainable properties

* serialize strings.json instead of vocab

* fix KB IO + tests

* fix typos

* more typos

* _added_strings as a set

* few more tests specifically for _added_strings field

* bump to 3.0.0a36
2020-10-08 21:33:49 +02:00
Florijan Stamenković 9db670b996
Fix Issue 6207 (#6208)
* Regression test for issue 6207

* Fix issue 6207

* Sign contributor agreement

* Minor adjustments to test

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-06 11:17:37 +02:00
Ines Montani 0135f6ed95 Enable commit check via env var 2020-10-05 20:51:15 +02:00
Ines Montani 6958510bda Include spaCy version check in project CLI 2020-10-05 13:53:07 +02:00
Ines Montani f758804401 Save one line of code 2020-10-03 11:41:28 +02:00
svlandeg 02247cccaf Merge remote-tracking branch 'upstream/develop' into feature/small-fixes 2020-10-02 20:48:11 +02:00
svlandeg acc391c2a8 remove redundant str() call 2020-10-02 11:05:59 +02:00
Ines Montani 01c1538c72 Integrate file readers 2020-10-02 01:36:06 +02:00
Ines Montani 23c63eefaf Tidy up env vars [ci skip] 2020-09-30 15:15:11 +02:00
Ines Montani a5debb356d Tidy up and adjust logging [ci skip] 2020-09-30 01:22:08 +02:00
Ines Montani da30bae8a6 Use __pyx_vtable__ instead of __reduce_cython__ 2020-09-29 22:04:17 +02:00
Ines Montani fa47f87924 Tidy up and auto-format 2020-09-29 21:39:28 +02:00
Ines Montani d3c63b7965 Merge branch 'develop' into feature/prepare 2020-09-29 20:53:05 +02:00
Ines Montani 2be80379ec Fix small issues, resolve_dot_names and debug model 2020-09-29 20:38:35 +02:00
Ines Montani dba26186ef Handle None default args in Cython methods 2020-09-29 18:08:02 +02:00
Ines Montani 9353a82076 Auto-format 2020-09-29 18:07:48 +02:00
Matthew Honnibal 4ad26f4a2f Move reader 2020-09-29 16:54:53 +02:00
Ines Montani 2e9c9e74af Fix config resolution and interpolation
TODO: auto-interpolate in Thinc if config is dict (i.e. likely subsection)
2020-09-28 15:34:00 +02:00
Ines Montani 02838a1d47 Fix resolve_dot_names 2020-09-28 15:27:10 +02:00
Ines Montani 822ea4ef61 Refactor CLI 2020-09-28 15:09:59 +02:00
Ines Montani e44a7519cd Update CLI and add [initialize] block 2020-09-28 11:56:14 +02:00
Ines Montani d5155376fd Update vocab init 2020-09-28 11:30:18 +02:00
Matthew Honnibal 65448b2e34 Remove schema=None until Optional 2020-09-28 03:42:58 +02:00
Matthew Honnibal a023cf3ecc Add (untested) resolve_dot_names util 2020-09-28 03:06:12 +02:00
Matthew Honnibal a976da168c
Support data augmentation in Corpus (#6155)
* Support data augmentation in Corpus

* Note initial docs for data augmentation

* Add augmenter to quickstart

* Fix flake8

* Format

* Fix test

* Update spacy/tests/training/test_training.py

* Improve data augmentation arguments

* Update templates

* Move randomization out into caller

* Refactor

* Update spacy/training/augment.py

* Update spacy/tests/training/test_training.py

* Fix augment

* Fix test
2020-09-28 03:03:27 +02:00
Ines Montani 9016d23cc5 Fix exclude and add test 2020-09-27 23:34:03 +02:00
Ines Montani 7e938ed63e Update config resolution to use new Thinc 2020-09-27 22:21:31 +02:00
Ines Montani 26e28ed413 Fix combined scores if multiple components report it 2020-09-24 17:11:13 +02:00
Ines Montani d0ef4a4cf5 Prevent division by zero in score weights 2020-09-24 16:42:13 +02:00
Ines Montani 4bbe41f017 Fix combined scores and update test 2020-09-24 10:42:47 +02:00
Ines Montani ae51f580c1 Fix handling of score_weights 2020-09-24 10:27:33 +02:00
Ines Montani f25f05c503 Adjust sort order [ci skip] 2020-09-23 20:03:04 +02:00
Matthew Honnibal 8fb59d958c Format 2020-09-20 16:31:48 +02:00
Matthew Honnibal 889128e5c5 Improve error handling in run_command 2020-09-20 16:20:57 +02:00
Adriane Boyd 47080fba98 Minor renaming / refactoring
* Rename loader to `spacy.LookupsDataLoader.v1`, add debugging message
* Make `Vocab.lookups` a property
2020-09-18 19:43:19 +02:00
Adriane Boyd eed4b785f5 Load vocab lookups tables at beginning of training
Similar to how vectors are handled, move the vocab lookups to be loaded
at the start of training rather than when the vocab is initialized,
since the vocab doesn't have access to the full config when it's
created.

The option moves from `nlp.load_vocab_data` to `training.lookups`.

Typically these tables will come from `spacy-lookups-data`, but any
`Lookups` object can be provided.

The loading from `spacy-lookups-data` is now strict, so configs for each
language should specify the exact tables required. This also makes it
easier to control whether the larger clusters and probs tables are
included.

To load `lexeme_norm` from `spacy-lookups-data`:

```
[training.lookups]
@misc = "spacy.LoadLookupsData.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]
```
2020-09-18 15:59:16 +02:00
Ines Montani c052017025 Fix sparse checkout and error handling 2020-09-14 14:12:58 +02:00
Ines Montani 416deb412f Prevent duplicate traceback on CalledProcessError [ci skip] 2020-09-13 19:28:54 +02:00
Ines Montani f8846c198d Update types and docstrings 2020-09-13 10:52:02 +02:00