Commit Graph

26 Commits

Author SHA1 Message Date
Ines Montani febb99916d Tidy up and auto-format [ci skip] 2020-09-13 10:55:36 +02:00
svlandeg 115147804a string_to_list to parse comma-separated string into a list 2020-09-12 14:43:22 +02:00
Sofie Van Landeghem 60f22e1800
Pipe API (#6034)
* ensure Language passes on valid examples for initialization

* fix tagger model initialization

* check for valid get_examples across components

* assume labels were added before begin_training

* fix senter initialization

* fix morphologizer initialization

* use methods to check arguments

* test textcat init, requires thinc>=8.0.0a31

* fix tok2vec init

* fix entity linker init

* use islice

* fix simple NER

* cleanup debug model

* fix assert statements

* fix tests

* throw error when adding a label if the output layer can't be resized anymore

* fix test

* add failing test for simple_ner

* UX improvements

* morphologizer UX

* assume begin_training gets a representative set and processes the labels

* remove assumptions for output of untrained NER model

* restore test for original purpose
2020-09-08 22:44:25 +02:00
Ines Montani ab1bb421ed Update docs links in codebase 2020-09-04 12:58:50 +02:00
Ines Montani 0e7f99da58
Fix handling of optional [pretraining] block (#5954)
* Fix handling of optional [pretraining] block

* Remote pretraining from default config

* Fix test

* Add schema option for empty pretrain block
2020-08-24 15:56:03 +02:00
Ines Montani 6ae83bde0c Fix CLI consistency [ci skip] 2020-08-16 15:46:29 +02:00
Ines Montani 67cc39af7f Update Thinc and include section order 2020-08-14 14:06:22 +02:00
Ines Montani 88b0a96801 Update for new Thinc and adjust config 2020-08-13 17:38:30 +02:00
Ines Montani 06e80d95cd
Sync develop with nightly docs state (#5883)
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2020-08-06 00:28:14 +02:00
Ines Montani 5cc0d89fad
Simplify config overrides in CLI and deserialization (#5880) 2020-08-05 23:35:09 +02:00
Ines Montani e68459296d Tidy up and auto-format 2020-08-05 16:00:59 +02:00
Ines Montani 934447a611
Merge pull request #5855 from svlandeg/fix/cli-debug 2020-08-03 13:09:20 +02:00
Ines Montani 4c055f0aa7
Add init CLI and init config (#5854)
* Add init CLI and init config draft

* Improve config validation

* Auto-format

* Don't export anything in debug config

* Update docs
2020-08-02 15:18:30 +02:00
svlandeg 9b719dfb1a use divider inbetween steps 2020-07-31 18:06:48 +02:00
svlandeg 51ffc4a166 rename pipe_name to component 2020-07-31 17:58:55 +02:00
svlandeg 878327d38e printing final predictions by default to False 2020-07-31 17:36:32 +02:00
svlandeg cc2f58a1b0 use data_validation context manager 2020-07-31 16:49:42 +02:00
svlandeg 5fa3235d06 set DATA_VALIDATION to False for debug_model (upgrade thinc) 2020-07-31 15:21:01 +02:00
svlandeg 0b23594953 pipe_name instead of section in debug_model 2020-07-30 20:06:28 +02:00
svlandeg 61068e0fb1 util function dot_to_object and corresponding unit test 2020-07-27 17:50:12 +02:00
Ines Montani 2c5bb59909 Use consistent --gpu-id option name 2020-07-22 16:53:41 +02:00
Ines Montani 43b960c01b
Refactor pipeline components, config and language data (#5759)
* Update with WIP

* Update with WIP

* Update with pipeline serialization

* Update types and pipe factories

* Add deep merge, tidy up and add tests

* Fix pipe creation from config

* Don't validate default configs on load

* Update spacy/language.py

Co-authored-by: Ines Montani <ines@ines.io>

* Adjust factory/component meta error

* Clean up factory args and remove defaults

* Add test for failing empty dict defaults

* Update pipeline handling and methods

* provide KB as registry function instead of as object

* small change in test to make functionality more clear

* update example script for EL configuration

* Fix typo

* Simplify test

* Simplify test

* splitting pipes.pyx into separate files

* moving default configs to each component file

* fix batch_size type

* removing default values from component constructors where possible (TODO: test 4725)

* skip instead of xfail

* Add test for config -> nlp with multiple instances

* pipeline.pipes -> pipeline.pipe

* Tidy up, document, remove kwargs

* small cleanup/generalization for Tok2VecListener

* use DEFAULT_UPSTREAM field

* revert to avoid circular imports

* Fix tests

* Replace deprecated arg

* Make model dirs require config

* fix pickling of keyword-only arguments in constructor

* WIP: clean up and integrate full config

* Add helper to handle function args more reliably

Now also includes keyword-only args

* Fix config composition and serialization

* Improve config debugging and add visual diff

* Remove unused defaults and fix type

* Remove pipeline and factories from meta

* Update spacy/default_config.cfg

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/default_config.cfg

* small UX edits

* avoid printing stack trace for debug CLI commands

* Add support for language-specific factories

* specify the section of the config which holds the model to debug

* WIP: add Language.from_config

* Update with language data refactor WIP

* Auto-format

* Add backwards-compat handling for Language.factories

* Update morphologizer.pyx

* Fix morphologizer

* Update and simplify lemmatizers

* Fix Japanese tests

* Port over tagger changes

* Fix Chinese and tests

* Update to latest Thinc

* WIP: xfail first Russian lemmatizer test

* Fix component-specific overrides

* fix nO for output layers in debug_model

* Fix default value

* Fix tests and don't pass objects in config

* Fix deep merging

* Fix lemma lookup data registry

Only load the lookups if an entry is available in the registry (and if spacy-lookups-data is installed)

* Add types

* Add Vocab.from_config

* Fix typo

* Fix tests

* Make config copying more elegant

* Fix pipe analysis

* Fix lemmatizers and is_base_form

* WIP: move language defaults to config

* Fix morphology type

* Fix vocab

* Remove comment

* Update to latest Thinc

* Add morph rules to config

* Tidy up

* Remove set_morphology option from tagger factory

* Hack use_gpu

* Move [pipeline] to top-level block and make [nlp.pipeline] list

Allows separating component blocks from component order – otherwise, ordering the config would mean a changed component order, which is bad. Also allows initial config to define more components and not use all of them

* Fix use_gpu and resume in CLI

* Auto-format

* Remove resume from config

* Fix formatting and error

* [pipeline] -> [components]

* Fix types

* Fix tagger test: requires set_morphology?

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-07-22 13:42:59 +02:00
Ines Montani 0ab483037c Make debug commands subcommands of spacy debug
Also handle backwards-compatibility so the old commands don't break
2020-07-12 13:53:41 +02:00
Ines Montani 3a8632c3fb Hide command from public --help for now
Not sure we want this to be officially documented yet?
2020-07-11 19:21:22 +02:00
Ines Montani bfa8e11ffa Update and auto-format 2020-07-10 20:52:00 +02:00
Sofie Van Landeghem de6a32315c
debug-model script (#5749)
* adding debug-model to print the internals for debugging purposes

* expend debug-model script with 4 stages: before, init, train, predict

* avoid enforcing to have a seed in the train script

* small fixes
2020-07-10 19:47:53 +02:00