Commit Graph

11445 Commits

Author SHA1 Message Date
Matthw Honnibal d1c2e88d0f Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-05-30 19:23:12 +02:00
Ines Montani 368182776e Tidy up dependencies 2020-05-30 15:19:53 +02:00
Matthw Honnibal 58750b06f8 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-05-27 22:18:36 +02:00
Matthew Honnibal a44d51a3d8
Merge pull request #5496 from explosion/docs/unicode-str
unicode -> str consistency
2020-05-26 10:30:37 +02:00
Ines Montani 1a15896ba9 unicode -> str consistency [ci skip] 2020-05-24 18:51:10 +02:00
Ines Montani 262d306eaa unicode -> str consistency 2020-05-24 17:23:00 +02:00
Ines Montani 5d3806e059 unicode -> str consistency 2020-05-24 17:20:58 +02:00
Ines Montani cf156ed2f4
Merge pull request #5495 from explosion/fix/simplify-is-package 2020-05-24 15:42:55 +02:00
Ines Montani 387c7aba15 Update test 2020-05-24 14:55:16 +02:00
Ines Montani f9786d765e Simplify is_package check 2020-05-24 14:48:56 +02:00
Ines Montani 15d3a0ac3a
Merge pull request #5491 from explosion/chore/rename-pipe-analysis 2020-05-23 12:41:54 +02:00
Matthw Honnibal 2d9de8684d Support use_pytorch_for_gpu_memory config 2020-05-22 23:10:40 +02:00
Ines Montani 4465cad6c5 Rename spacy.analysis to spacy.pipe_analysis 2020-05-22 17:42:06 +02:00
Ines Montani 25d6ed3fb8
Merge pull request #5489 from explosion/feature/connected-components 2020-05-22 17:40:11 +02:00
Ines Montani 841c05b47b
Merge pull request #5490 from explosion/fix/remove-jsonschema 2020-05-22 17:39:54 +02:00
Ines Montani 569a65b60e Auto-format 2020-05-22 16:55:42 +02:00
Ines Montani d844528c5f Add test for is_compatible_model 2020-05-22 16:55:15 +02:00
Ines Montani 12b7be1d98 Remove jsonschema from dependencies 2020-05-22 16:49:26 +02:00
Matthew Honnibal 7a73a9dcf6
Merge pull request #5488 from explosion/feature/better-model-compat
Better model compatibility and validation
2020-05-22 16:44:29 +02:00
Matthew Honnibal f7f6df7275 Move to spacy.analysis 2020-05-22 16:43:18 +02:00
Matthew Honnibal 78d79d94ce Guess set_annotations=True in nlp.update
During `nlp.update`, components can be passed a boolean set_annotations
to indicate whether they should assign annotations to the `Doc`. This
needs to be called if downstream components expect to use the
annotations during training, e.g. if we wanted to use tagger features in
the parser.

Components can specify their assignments and requirements, so we can
figure out which components have these inter-dependencies. After
figuring this out, we can guess whether to pass set_annotations=True.

We could also call set_annotations=True always, or even just have this
as the only behaviour. The downside of this is that it would require the
`Doc` objects to be created afresh to avoid problematic modifications.
One approach would be to make a fresh copy of the `Doc` objects within
`nlp.update()`, so that we can write to the objects without any
problems. If we do that, we can drop this logic and also drop the
`set_annotations` mechanism. I would be fine with that approach,
although it runs the risk of introducing some performance overhead, and
we'll have to take care to copy all extension attributes etc.
2020-05-22 15:55:45 +02:00
Ines Montani 6e6db6afb6 Better model compatibility and validation 2020-05-22 15:42:46 +02:00
Matthw Honnibal 25b51f4fc8 Set version to v3.0.0.dev9 2020-05-21 20:47:52 +02:00
Matthw Honnibal bc94fdabd0 Fix begin_training 2020-05-21 20:46:21 +02:00
Matthw Honnibal d507ac28d8 Fix shape inference 2020-05-21 20:46:10 +02:00
Matthw Honnibal df87c32a40 Pass smaller doc sample into model initialize 2020-05-21 20:17:24 +02:00
Matthw Honnibal 3b5cfec1fc Tweak memory management in train_from_config 2020-05-21 19:32:04 +02:00
Matthw Honnibal f075655deb Fix shape inference in begin_training 2020-05-21 19:26:29 +02:00
Matthw Honnibal 1729165e90 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-05-21 19:11:08 +02:00
Matthew Honnibal e6c4c1a507
Merge pull request #5468 from adrianeboyd/feature/cli-conllu-misc-ner
Improve handling of NER in CoNLL-U MISC
2020-05-21 16:39:46 +02:00
Adriane Boyd 4b229bfc22 Improve handling of NER in CoNLL-U MISC 2020-05-20 18:48:51 +02:00
Matthew Honnibal 609c0ba557
Fix accidentally quadratic runtime in Example.split_sents (#5464)
* Tidy up train-from-config a bit

* Fix accidentally quadratic perf in TokenAnnotation.brackets

When we're reading in the gold data, we had a nested loop where
we looped over the brackets for each token, looking for brackets
that start on that word. This is accidentally quadratic, because
we have one bracket per word (for the POS tags). So we had
an O(N**2) behaviour here that ended up being pretty slow.

To solve this I'm indexing the brackets by their starting word
on the TokenAnnotations object, and having a property to provide
the previous view.

* Fixes
2020-05-20 18:48:18 +02:00
Matthw Honnibal 60e8da4813 Tidy up train-from-config a bit 2020-05-20 12:56:27 +02:00
Matthw Honnibal fda7355508 Fix train-from-config 2020-05-20 12:30:21 +02:00
Matthw Honnibal 24efd54a42 Merge from develop 2020-05-20 12:27:31 +02:00
Sofie Van Landeghem 7f5715a081
Various fixes to NEL functionality, Example class etc (#5460)
* setting KB in the EL constructor, similar to how the model is passed on

* removing wikipedia example files - moved to projects

* throw an error when nlp.update is called with 2 positional arguments

* rewriting the config logic in create pipe to accomodate for other objects (e.g. KB) in the config

* update config files with new parameters

* avoid training pipeline components that don't have a model (like sentencizer)

* various small fixes + UX improvements

* small fixes

* set thinc to 8.0.0a9 everywhere

* remove outdated comment
2020-05-20 11:41:12 +02:00
Matthew Honnibal 664a3603b0 Set version to v3.0.0.dev8 2020-05-19 17:15:39 +02:00
Matthew Honnibal a2830c3ef5 Use thinc 8.0.0a9 2020-05-19 16:23:11 +02:00
Sofie Van Landeghem f00de445dd
default models defined in component decorator (#5452)
* move defaults to pipeline and use in component decorator

* black formatting

* relative import
2020-05-19 16:20:03 +02:00
Sofie Van Landeghem 0d94737857
Feature toggle_pipes (#5378)
* make disable_pipes deprecated in favour of the new toggle_pipes

* rewrite disable_pipes statements

* update documentation

* remove bin/wiki_entity_linking folder

* one more fix

* remove deprecated link to documentation

* few more doc fixes

* add note about name change to the docs

* restore original disable_pipes

* small fixes

* fix typo

* fix error number to W096

* rename to select_pipes

* also make changes to the documentation

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-05-18 22:27:10 +02:00
Matthew Honnibal 333b1a308b
Adapt parser and NER for transformers (#5449)
* Draft layer for BILUO actions

* Fixes to biluo layer

* WIP on BILUO layer

* Add tests for BILUO layer

* Format

* Fix transitions

* Update test

* Link in the simple_ner

* Update BILUO tagger

* Update __init__

* Import simple_ner

* Update test

* Import

* Add files

* Add config

* Fix label passing for BILUO and tagger

* Fix label handling for simple_ner component

* Update simple NER test

* Update config

* Hack train script

* Update BILUO layer

* Fix SimpleNER component

* Update train_from_config

* Add biluo_to_iob helper

* Add IOB layer

* Add IOBTagger model

* Update biluo layer

* Update SimpleNER tagger

* Update BILUO

* Read random seed in train-from-config

* Update use of normal_init

* Fix normalization of gradient in SimpleNER

* Update IOBTagger

* Remove print

* Tweak masking in BILUO

* Add dropout in SimpleNER

* Update thinc

* Tidy up simple_ner

* Fix biluo model

* Unhack train-from-config

* Update setup.cfg and requirements

* Add tb_framework.py for parser model

* Try to avoid memory leak in BILUO

* Move ParserModel into spacy.ml, avoid need for subclass.

* Use updated parser model

* Remove incorrect call to model.initializre in PrecomputableAffine

* Update parser model

* Avoid divide by zero in tagger

* Add extra dropout layer in tagger

* Refine minibatch_by_words function to avoid oom

* Fix parser model after refactor

* Try to avoid div-by-zero in SimpleNER

* Fix infinite loop in minibatch_by_words

* Use SequenceCategoricalCrossentropy in Tagger

* Fix parser model when hidden layer

* Remove extra dropout from tagger

* Add extra nan check in tagger

* Fix thinc version

* Update tests and imports

* Fix test

* Update test

* Update tests

* Fix tests

* Fix test

Co-authored-by: Ines Montani <ines@ines.io>
2020-05-18 22:23:33 +02:00
Ines Montani 3100c97e69
Merge pull request #5441 from svlandeg/fix/updating 2020-05-18 10:53:41 +02:00
Ines Montani e8ff4c1e6a
Pin flake8 version 2020-05-18 10:50:21 +02:00
svlandeg 6fb6a8518c bump to 3.0.0.dev7 and thinc to 8.0.0a8 2020-05-15 13:25:54 +02:00
svlandeg 047f3d7d94 remove ops argument for Adam 2020-05-15 13:25:00 +02:00
svlandeg 79d4f196e5 pin flak8 to 3.5.0 2020-05-15 11:53:01 +02:00
svlandeg e0fda2bd81 throw warning when model_cfg is None 2020-05-15 11:02:10 +02:00
svlandeg 102c8c7e2f fix fan_in renaming 2020-05-12 13:56:10 +02:00
svlandeg 9fe1e23512 update to thinc 8.0.0a6 2020-05-12 13:51:25 +02:00
Matthew Honnibal eb117e2fce Add load_config_from_str helper 2020-05-02 14:09:21 +02:00