Commit Graph

530 Commits

Author SHA1 Message Date
Matthew Honnibal 8ce9f44433 Merge branch 'feature/prepare' of https://github.com/explosion/spaCy into feature/prepare 2020-09-29 16:57:38 +02:00
Matthew Honnibal ca72608059 Fix language 2020-09-29 16:48:33 +02:00
Ines Montani fd594cfb9b Tighten up format 2020-09-29 16:47:55 +02:00
Ines Montani 63d1598137 Simplify config use in Language.initialize 2020-09-29 16:05:48 +02:00
Ines Montani adca08a12f Pass nlp forward 2020-09-29 12:21:52 +02:00
Ines Montani 42f0e4c946 Clean up 2020-09-29 12:14:08 +02:00
Matthew Honnibal 9c8b2524fe Upd initialize args 2020-09-29 12:08:37 +02:00
Matthew Honnibal f2d1b7feb5 Clean up sgd 2020-09-29 12:00:08 +02:00
Ines Montani 78396d137f Integrate initialize settings 2020-09-29 11:57:08 +02:00
Ines Montani dec984a9c1 Update Language.initialize and support components/tokenizer settings 2020-09-29 11:52:45 +02:00
Matthew Honnibal 5276db6f3f Remove 'device' argument from Language, clean up 'sgd' arg 2020-09-29 11:42:19 +02:00
Ines Montani ff9a63bfbd begin_training -> initialize 2020-09-28 21:35:09 +02:00
Ines Montani 658fad428a Fix base schema integration 2020-09-27 22:50:36 +02:00
Ines Montani 7e938ed63e Update config resolution to use new Thinc 2020-09-27 22:21:31 +02:00
Ines Montani 4bbe41f017 Fix combined scores and update test 2020-09-24 10:42:47 +02:00
Ines Montani ae51f580c1 Fix handling of score_weights 2020-09-24 10:27:33 +02:00
Ines Montani 1114219ae3 Tidy up and auto-format 2020-09-21 10:59:07 +02:00
Adriane Boyd 47080fba98 Minor renaming / refactoring
* Rename loader to `spacy.LookupsDataLoader.v1`, add debugging message
* Make `Vocab.lookups` a property
2020-09-18 19:43:19 +02:00
Adriane Boyd eed4b785f5 Load vocab lookups tables at beginning of training
Similar to how vectors are handled, move the vocab lookups to be loaded
at the start of training rather than when the vocab is initialized,
since the vocab doesn't have access to the full config when it's
created.

The option moves from `nlp.load_vocab_data` to `training.lookups`.

Typically these tables will come from `spacy-lookups-data`, but any
`Lookups` object can be provided.

The loading from `spacy-lookups-data` is now strict, so configs for each
language should specify the exact tables required. This also makes it
easier to control whether the larger clusters and probs tables are
included.

To load `lexeme_norm` from `spacy-lookups-data`:

```
[training.lookups]
@misc = "spacy.LoadLookupsData.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]
```
2020-09-18 15:59:16 +02:00
Matthew Honnibal c776594ab1 Fix 2020-09-16 18:15:14 +02:00
Matthew Honnibal 4a573d18b3 Add comment 2020-09-16 17:51:29 +02:00
Matthew Honnibal d31afc8334 Fix Language.link_components when model is None 2020-09-16 17:49:48 +02:00
Ines Montani aaf01689a1 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-09-15 14:24:42 +02:00
Ines Montani 91a6637f74 Remove extra pipe config values before merging 2020-09-15 14:24:17 +02:00
Ines Montani d3d7f92f05 Fix lang check and error handling in Language.from_config 2020-09-15 14:24:06 +02:00
Ines Montani 253ba5ef14 Raise for bad Vocab values 2020-09-15 13:25:34 +02:00
Ines Montani 7dfc4bc062 Allow overriding meta from spacy.blank 2020-09-15 11:12:12 +02:00
Matthew Honnibal b693d2d224 Fix speed report in table 2020-09-13 17:39:31 +02:00
Ines Montani febb99916d Tidy up and auto-format [ci skip] 2020-09-13 10:55:36 +02:00
Sofie Van Landeghem e92e850c72
Raise if empty examples (#6052)
* raise error if no valid Example objects were found during initialization

* fix max_length parameter

* remove commit from other branch

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-12 21:01:53 +02:00
svlandeg 711166a75a prevent overwriting score_weights 2020-09-11 15:12:05 +02:00
Sofie Van Landeghem 8e7557656f
Renaming gold & annotation_setter (#6042)
* version bump to 3.0.0a16

* rename "gold" folder to "training"

* rename 'annotation_setter' to 'set_extra_annotations'

* formatting
2020-09-09 10:31:03 +02:00
Sofie Van Landeghem 60f22e1800
Pipe API (#6034)
* ensure Language passes on valid examples for initialization

* fix tagger model initialization

* check for valid get_examples across components

* assume labels were added before begin_training

* fix senter initialization

* fix morphologizer initialization

* use methods to check arguments

* test textcat init, requires thinc>=8.0.0a31

* fix tok2vec init

* fix entity linker init

* use islice

* fix simple NER

* cleanup debug model

* fix assert statements

* fix tests

* throw error when adding a label if the output layer can't be resized anymore

* fix test

* add failing test for simple_ner

* UX improvements

* morphologizer UX

* assume begin_training gets a representative set and processes the labels

* remove assumptions for output of untrained NER model

* restore test for original purpose
2020-09-08 22:44:25 +02:00
Ines Montani f06eed800e
Merge pull request #6029 from explosion/master-tmp 2020-09-04 15:11:55 +02:00
Ines Montani f9550b4493 Fix components in meta.json and website [ci skip] 2020-09-04 14:42:12 +02:00
Ines Montani 90043a6f9b Tidy up and auto-format 2020-09-04 13:42:33 +02:00
Ines Montani ba600f91c5 Tidy up imports 2020-09-04 13:15:44 +02:00
Ines Montani ab1bb421ed Update docs links in codebase 2020-09-04 12:58:50 +02:00
Ines Montani 896caf45e3
Merge pull request #6023 from explosion/ux/model-terminology-consistency [ci skip] 2020-09-03 17:13:44 +02:00
Ines Montani b5a0657fd6 "model" terminology consistency in docs 2020-09-03 13:13:03 +02:00
Matthew Honnibal ef0d0630a4 Let Langugae.use_params work with falsey inputs
The Language.use_params method was failing if you passed in None, which
meant we had to use awkward conditionals for the parameter averaging.
This solves the problem.
2020-09-03 12:51:04 +02:00
Matthew Honnibal 046c38bd26
Remove 'cleanup' of strings (#6007)
A long time ago we went to some trouble to try to clean up "unused"
strings, to avoid the `StringStore` growing in long-running processes.

This never really worked reliably, and I think it was a really wrong
approach. It's much better to let the user reload the `nlp` object as
necessary, now that the string encoding is stable (in v1, the string IDs
were sequential integers, making reloading the NLP object really
annoying.)

The extra book-keeping does make some performance difference, and the
feature is unsed, so it's past time we killed it.
2020-09-01 16:12:15 +02:00
Ines Montani 45f46a5c85
Merge pull request #5993 from explosion/feature/disabled-components 2020-08-29 15:58:41 +02:00
Ines Montani 34146750d4 Use frozen list with custom errors
We don't want to break backwards compatibility too much but we also want to provide the best possible UX
2020-08-29 15:20:11 +02:00
Ines Montani 6520d1a1df Work around set order in Language.disabled 2020-08-29 12:58:22 +02:00
Ines Montani e0b4984aa4 Make deprecated disable_pipes call into select_pipes 2020-08-29 12:08:46 +02:00
Ines Montani 15d73f4dc3 Make user-facing Language.disabled return list
More consistent with all the other properties
2020-08-29 12:08:33 +02:00
Ines Montani 0687d7148e Rename user-facing API 2020-08-28 21:04:02 +02:00
Ines Montani 6a999c9303 Remove outdated component attr check 2020-08-28 20:59:19 +02:00
Ines Montani 10da74382f Raise if disabled components are removed before DisabledPipes.restore 2020-08-28 20:35:26 +02:00