Commit Graph

392 Commits

Author SHA1 Message Date
Matthew Honnibal 8fb59d958c Format 2020-09-20 16:31:48 +02:00
Matthew Honnibal 889128e5c5 Improve error handling in run_command 2020-09-20 16:20:57 +02:00
Adriane Boyd 47080fba98 Minor renaming / refactoring
* Rename loader to `spacy.LookupsDataLoader.v1`, add debugging message
* Make `Vocab.lookups` a property
2020-09-18 19:43:19 +02:00
Adriane Boyd eed4b785f5 Load vocab lookups tables at beginning of training
Similar to how vectors are handled, move the vocab lookups to be loaded
at the start of training rather than when the vocab is initialized,
since the vocab doesn't have access to the full config when it's
created.

The option moves from `nlp.load_vocab_data` to `training.lookups`.

Typically these tables will come from `spacy-lookups-data`, but any
`Lookups` object can be provided.

The loading from `spacy-lookups-data` is now strict, so configs for each
language should specify the exact tables required. This also makes it
easier to control whether the larger clusters and probs tables are
included.

To load `lexeme_norm` from `spacy-lookups-data`:

```
[training.lookups]
@misc = "spacy.LoadLookupsData.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]
```
2020-09-18 15:59:16 +02:00
Ines Montani c052017025 Fix sparse checkout and error handling 2020-09-14 14:12:58 +02:00
Ines Montani 416deb412f Prevent duplicate traceback on CalledProcessError [ci skip] 2020-09-13 19:28:54 +02:00
Ines Montani f8846c198d Update types and docstrings 2020-09-13 10:52:02 +02:00
Ines Montani 3e83a509bb WIP: fix project clone compatibility 2020-09-10 15:49:13 +02:00
Matthew Honnibal b470062153
Add CLI registry (#6037) 2020-09-08 15:23:34 +02:00
Ines Montani 5afe6447cd registry.assets -> registry.misc 2020-09-03 17:31:14 +02:00
Ines Montani 45f46a5c85
Merge pull request #5993 from explosion/feature/disabled-components 2020-08-29 15:58:41 +02:00
Ines Montani 34146750d4 Use frozen list with custom errors
We don't want to break backwards compatibility too much but we also want to provide the best possible UX
2020-08-29 15:20:11 +02:00
Ines Montani 5de3f8604d
Update spacy/util.py
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-29 13:17:06 +02:00
Ines Montani cad988da7f Allow component decorators to re-run with same function 2020-08-28 16:27:22 +02:00
Ines Montani 3ce5be4b76 Allow loaded but disabled components 2020-08-28 15:20:14 +02:00
Sofie Van Landeghem 79d460e3a2
Weights & Biases logger for train CLI (#5971)
* quick test as part of train script

* train_logger in config, default ConsoleLogger in loggers catalogue

* entitiy typo

* add wandb_logger

* cleanup

* Update spacy/cli/train_logger.py

Co-authored-by: Ines Montani <ines@ines.io>

* move loggers to gold.loggers

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-26 15:24:33 +02:00
Matthew Honnibal 77852d2428 Fix run_command for python 3.6 2020-08-26 05:02:43 +02:00
Matthew Honnibal 884cac5fb5 Make run_command backwards compatible 2020-08-26 04:33:42 +02:00
Matthew Honnibal 2771e4f2b3
Fix the git "sparse checkout" functionality (#5973)
* Fix the git sparse checkout functionality

* Format
2020-08-26 04:00:14 +02:00
Matthew Honnibal e559867605
Allow spacy project to push and pull to/from remote storage (#5949)
* Add utils for working with remote storage

* WIP add remote_cache for project

* WIP add push and pull commands

* Use pathy in remote_cache

* Updarte util

* Update remote_cache

* Update util

* Update project assets

* Update pull script

* Update push script

* Fix type annotation in util

* Work on remote storage

* Remove site and env hash

* Fix imports

* Fix type annotation

* Require pathy

* Require pathy

* Fix import

* Add a util to handle project variable substitution

* Import push and pull commands

* Fix pull command

* Fix push command

* Fix tarfile in remote_storage

* Improve printing

* Fiddle with status messages

* Set version to v3.0.0a9

* Draft docs for spacy project remote storages

* Update docs [ci skip]

* Use Thinc config to simplify and unify template variables

* Auto-format

* Don't import Pathy globally for now

Causes slow and annoying Google Cloud warning

* Tidy up test

* Tidy up and update tests

* Update to latest Thinc

* Update docs

* variables -> vars

* Update docs [ci skip]

* Update docs [ci skip]

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-23 18:32:09 +02:00
Ines Montani 1c3bcfb488 Update docs and util consistency 2020-08-18 01:22:59 +02:00
Ines Montani 3ae5e02f4f Update docs, types and API consistency 2020-08-17 16:45:24 +02:00
Ines Montani 45f13cbf64
Merge pull request #5916 from explosion/feature/new-thinc-config 2020-08-16 15:24:12 +02:00
Ines Montani 8128e5eb35 Replace lexeme_norm warning with logging 2020-08-14 15:00:52 +02:00
Ines Montani 37814b608d Remove env_opt and simplfy default Optimizer 2020-08-14 14:59:54 +02:00
Ines Montani 67cc39af7f Update Thinc and include section order 2020-08-14 14:06:22 +02:00
Ines Montani 88b0a96801 Update for new Thinc and adjust config 2020-08-13 17:38:30 +02:00
Ines Montani 913d21f0a3
Merge pull request #5882 from explosion/feature/raise-from
Use "raise ... from" in custom errors for better tracebacks
2020-08-06 00:35:26 +02:00
Ines Montani d92954ac1d
Merge pull request #5881 from explosion/feature/better-error-model-shortcuts 2020-08-06 00:13:35 +02:00
Ines Montani 56c17973aa Use "raise ... from" in custom errors for better tracebacks 2020-08-05 23:53:21 +02:00
Ines Montani 5cc0d89fad
Simplify config overrides in CLI and deserialization (#5880) 2020-08-05 23:35:09 +02:00
Ines Montani 2a1fa86a0d Add better error for failed model shortcut loading 2020-08-05 23:10:29 +02:00
Ines Montani 823e533dc1
Add config callbacks for modifying nlp object before and after init (#5866)
* WIP: Concept for modifying nlp object before and after init

* Make callbacks return nlp object

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Raise if callbacks don't return correct type

* Rename, update types, add after_pipeline_creation

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-05 19:47:54 +02:00
Ines Montani e68459296d Tidy up and auto-format 2020-08-05 16:00:59 +02:00
Ines Montani b795f02fbd
Allow adding pipeline components from source model (#5857)
* Allow adding pipeline components from source model

* Config: name -> component

* Improve error messages

* Fix error and test

* Add frozen components and exclude logic

* Remove exclude from Language.evaluate

* Init sourced components with current vocab

* Fix error codes
2020-08-04 23:39:19 +02:00
Matthew Honnibal ecb3c4e8f4
Create corpus iterator and batcher from registry during training (#5865)
* Move batchers into their own module (and registry)

* Update CLI

* Update Corpus and batcher

* Update tests

* Update one config

* Merge 'evaluation' block back under [training]

* Import batchers in gold __init__

* Fix batchers

* Update config

* Update schema

* Update util

* Don't assume train and dev are actually paths

* Update onto-joint config

* Fix missing import

* Format

* Format

* Update spacy/gold/corpus.py

Co-authored-by: Ines Montani <ines@ines.io>

* Fix name

* Update default config

* Fix get_length option in batchers

* Update test

* Add comment

* Pass path into Corpus

* Update docstring

* Update schema and configs

* Update config

* Fix test

* Fix paths

* Fix print

* Fix create_train_batches

* [training.read_train] -> [training.train_corpus]

* Update onto-joint config

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-04 15:09:37 +02:00
Ines Montani e9e8fa2466 Update docs and types 2020-07-31 17:02:54 +02:00
Matthew Honnibal 1784c95827 Clean up link_vectors_to_models unused stuff 2020-07-29 14:01:11 +02:00
Matthew Honnibal 0c17ea4c85 Format 2020-07-29 14:00:13 +02:00
Matthew Honnibal 7852a68a75 Fix load_vectors_into_model function 2020-07-29 14:00:13 +02:00
Matthew Honnibal df95e2af64 Add load_vectors_into_model util 2020-07-29 14:00:12 +02:00
Matthew Honnibal acc64e138a Add import 2020-07-29 14:00:11 +02:00
Matthew Honnibal cb9654e98c WIP on new StaticVectors 2020-07-29 14:00:09 +02:00
Ines Montani ba22111ff4 Move error to Errors 2020-07-28 16:24:14 +02:00
Ines Montani b83ead5bf5
Merge pull request #5824 from svlandeg/fix/textcat-v3 2020-07-28 15:04:25 +02:00
Ines Montani ae4d8a6ffd Update docstrings, docs and pipe consistency 2020-07-28 13:37:31 +02:00
svlandeg 61068e0fb1 util function dot_to_object and corresponding unit test 2020-07-27 17:50:12 +02:00
Adriane Boyd 8bb0507777 Add and update score methods and score weights
Add and update `score` methods, provided `scores`, and default weights
`default_score_weights` for pipeline components.

* `scores` provides all top-level keys returned by `score` (merely informative, similar to `assigns`).
* `default_score_weights` provides the default weights for a default config.
* The keys from `default_score_weights` determine which values will be
shown in the `spacy train` output, so keys with weight `0.0` will be
displayed but not counted toward the overall score.
2020-07-27 14:44:53 +02:00
Adriane Boyd f8cf378be9 Combine weights from multiple components
Combine weights from multiple components for the same score.
2020-07-27 10:21:31 +02:00
Ines Montani 2470486543 Allow pipeline components to set default scores and weights 2020-07-26 13:18:43 +02:00