Commit Graph

182 Commits

Author SHA1 Message Date
Matthew Honnibal 83ac227bd3
💫 Better support for semi-supervised learning (#3035)
The new spacy pretrain command implemented BERT/ULMFit/etc-like transfer learning, using our Language Modelling with Approximate Outputs version of BERT's cloze task. Pretraining is convenient, but in some ways it's a bit of a strange solution. All we're doing is initialising the weights. At the same time, we're putting a lot of work into our optimisation so that it's less sensitive to initial conditions, and more likely to find good optima. I discuss this a bit in the pseudo-rehearsal blog post: https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting
Support semi-supervised learning in spacy train

One obvious way to improve these pretraining methods is to do multi-task learning, instead of just transfer learning. This has been shown to work very well: https://arxiv.org/pdf/1809.08370.pdf . This patch makes it easy to do this sort of thing.

    Add a new argument to spacy train, --raw-text. This takes a jsonl file with unlabelled data that can be used in arbitrary ways to do semi-supervised learning.

    Add a new method to the Language class and to pipeline components, .rehearse(). This is like .update(), but doesn't expect GoldParse objects. It takes a batch of Doc objects, and performs an update on some semi-supervised objective.

    Move the BERT-LMAO objective out from spacy/cli/pretrain.py into spacy/_ml.py, so we can create a new pipeline component, ClozeMultitask. This can be specified as a parser or NER multitask in the spacy train command. Example usage:

python -m spacy train en ./tmp ~/data/en-core-web/train/nw.json ~/data/en-core-web/dev/nw.json --pipeline parser --raw-textt ~/data/unlabelled/reddit-100k.jsonl --vectors en_vectors_web_lg --parser-multitasks cloze

Implement rehearsal methods for pipeline components

The new --raw-text argument and nlp.rehearse() method also gives us a good place to implement the the idea in the pseudo-rehearsal blog post in the parser. This works as follows:

    Add a new nlp.resume_training() method. This allocates copies of pre-trained models in the pipeline, setting things up for the rehearsal updates. It also returns an optimizer object. This also greatly reduces confusion around the nlp.begin_training() method, which randomises the weights, making it not suitable for adding new labels or otherwise fine-tuning a pre-trained model.

    Implement rehearsal updates on the Parser class, making it available for the dependency parser and NER. During rehearsal, the initial model is used to supervise the model being trained. The current model is asked to match the predictions of the initial model on some data. This minimises catastrophic forgetting, by keeping the model's predictions close to the original. See the blog post for details.

    Implement rehearsal updates for tagger

    Implement rehearsal updates for text categoriz
2018-12-10 16:25:33 +01:00
Matthew Honnibal 375f0dc529
💫 Make TextCategorizer default to a simpler, GPU-friendly model (#3038)
Currently the TextCategorizer defaults to a fairly complicated model, designed partly around the active learning requirements of Prodigy. The model's a bit slow, and not very GPU-friendly.

This patch implements a straightforward CNN model that still performs pretty well. The replacement model also makes it easy to use the LMAO pretraining, since most of the parameters are in the CNN.

The replacement model has a flag to specify whether labels are mutually exclusive, which defaults to True. This has been a common problem with the text classifier. We'll also now be able to support adding labels to pretrained models again.

Resolves #2934, #2756, #1798, #1748.
2018-12-10 14:37:39 +01:00
Matthew Honnibal d2ac618af1 Set cbb_maxout_pieces=3 2018-12-08 23:27:29 +01:00
Matthew Honnibal cabaadd793
Fix build error from bad import
Thinc v7.0.0.dev6 moved FeatureExtracter around and didn't add a compatibility import.
2018-12-06 15:12:39 +01:00
Ines Montani 323fc26880 Tidy up and format remaining files 2018-11-30 17:43:08 +01:00
Ines Montani eddeb36c96
💫 Tidy up and auto-format .py files (#2983)
<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-11-30 17:03:03 +01:00
Matthew Honnibal ef0820827a
Update hyper-parameters after NER random search (#2972)
These experiments were completed a few weeks ago, but I didn't make the PR, pending model release.

    Token vector width: 128->96
    Hidden width: 128->64
    Embed size: 5000->2000
    Dropout: 0.2->0.1
    Updated optimizer defaults (unclear how important?)

This should improve speed, model size and load time, while keeping
similar or slightly better accuracy.

The tl;dr is we prefer to prevent over-fitting by reducing model size,
rather than using more dropout.
2018-11-27 18:49:52 +01:00
Matthew Honnibal 2527ba68e5 Fix tensorizer 2018-11-02 23:29:54 +00:00
Matthew Honnibal 99a6011580 Avoid adding empty layer in model, to keep models backwards compatible 2018-09-14 22:51:58 +02:00
Matthew Honnibal afeddfff26 Fix PyTorch BiLSTM 2018-09-13 22:54:34 +00:00
Matthew Honnibal 45032fe9e1 Support option of BiLSTM in Tok2Vec (requires pytorch) 2018-09-13 19:28:35 +02:00
Matthew Honnibal 4d2d7d5866 Fix new feature flags 2018-08-27 02:12:39 +02:00
Matthew Honnibal 8051136d70 Support subword_features and conv_depth params in Tok2Vec 2018-08-27 01:50:48 +02:00
Matthew Honnibal 401213fb1f Only warn about unnamed vectors if non-zero sized. 2018-05-19 18:51:55 +02:00
Matthew Honnibal 2338e8c7fc Update develop from master 2018-05-02 01:36:12 +00:00
Matthew Honnibal 548bdff943 Update default Adam settings 2018-05-01 15:18:20 +02:00
Matthew Honnibal 2c4a6d66fa Merge master into develop. Big merge, many conflicts -- need to review 2018-04-29 14:49:26 +02:00
Ines Montani 3141e04822
💫 New system for error messages and warnings (#2163)
* Add spacy.errors module

* Update deprecation and user warnings

* Replace errors and asserts with new error message system

* Remove redundant asserts

* Fix whitespace

* Add messages for print/util.prints statements

* Fix typo

* Fix typos

* Move CLI messages to spacy.cli._messages

* Add decorator to display error code with message

An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc.

* Remove unused link in spacy.about

* Update errors for invalid pipeline components

* Improve error for unknown factories

* Add displaCy warnings

* Update formatting consistency

* Move error message to spacy.errors

* Update errors and check if doc returned by component is None
2018-04-03 15:50:31 +02:00
Matthew Honnibal 4555e3e251 Dont assume pretrained_vectors cfg set in build_tagger 2018-03-28 20:12:45 +02:00
Matthew Honnibal f8dd905a24 Warn and fallback if vectors have no name 2018-03-28 18:24:53 +02:00
Matthew Honnibal 95a9615221 Fix loading of multiple pre-trained vectors
This patch addresses #1660, which was caused by keying all pre-trained
vectors with the same ID when telling Thinc how to refer to them. This
meant that if multiple models were loaded that had pre-trained vectors,
errors or incorrect behaviour resulted.

The vectors class now includes a .name attribute, which defaults to:
{nlp.meta['lang']_nlp.meta['name']}.vectors
The vectors name is set in the cfg of the pipeline components under the
key pretrained_vectors. This replaces the previous cfg key
pretrained_dims.

In order to make existing models compatible with this change, we check
for the pretrained_dims key when loading models in from_disk and
from_bytes, and add the cfg key pretrained_vectors if we find it.
2018-03-28 16:02:59 +02:00
Matthew Honnibal 1f7229f40f Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit c9ba3d3c2d, reversing
changes made to 92c26a35d4.
2018-03-27 19:23:02 +02:00
Matthew Honnibal 6e641f46d4 Create a preprocess function that gets bigrams 2017-11-12 00:43:41 +01:00
Matthew Honnibal d5537e5516 Work on Windows test failure 2017-11-08 13:25:18 +01:00
Matthew Honnibal 1d5599cd28 Fix dtype 2017-11-08 12:18:32 +01:00
Matthew Honnibal a8b592783b Make a dtype more specific, to fix a windows build 2017-11-08 11:24:35 +01:00
Matthew Honnibal 13336a6197 Fix Adam import 2017-11-06 14:25:37 +01:00
Matthew Honnibal 2eb11d60f2 Add function create_default_optimizer to spacy._ml 2017-11-06 14:11:59 +01:00
Matthew Honnibal 33bd2428db Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-03 13:29:56 +01:00
Matthew Honnibal c9b118a7e9 Set softmax attr in tagger model 2017-11-03 11:22:01 +01:00
Matthew Honnibal b3264aa5f0 Expose the softmax layer in the tagger model, to allow setting tensors 2017-11-03 11:19:51 +01:00
Matthew Honnibal 6771780d3f Fix backprop of padding variable 2017-11-03 01:54:34 +01:00
Matthew Honnibal 260e6ee3fb Improve efficiency of backprop of padding variable 2017-11-03 00:49:11 +01:00
Matthew Honnibal e85e31cfbd Fix backprop of d_pad 2017-11-01 19:27:26 +01:00
Matthew Honnibal d17a12c71d Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-01 16:38:26 +01:00
Matthew Honnibal 9f9439667b Don't create low-data text classifier if no vectors 2017-11-01 16:34:09 +01:00
Matthew Honnibal 8075726838 Restore vector usage in models 2017-10-31 19:21:17 +01:00
Matthew Honnibal cb5217012f Fix vector remapping 2017-10-31 11:40:46 +01:00
Matthew Honnibal ce876c551e Fix GPU usage 2017-10-31 02:33:34 +01:00
Matthew Honnibal 368fdb389a WIP on refactoring and fixing vectors 2017-10-31 02:00:26 +01:00
Matthew Honnibal 3b91097321 Whitespace 2017-10-28 17:05:11 +00:00
Matthew Honnibal 6ef72864fa Improve initialization for hidden layers 2017-10-28 17:05:01 +00:00
Matthew Honnibal df4803cc6d Add learned missing values for parser 2017-10-28 16:45:14 +00:00
Matthew Honnibal 64e4ff7c4b Merge 'tidy-up' changes into branch. Resolve conflicts 2017-10-28 13:16:06 +02:00
Explosion Bot b22e42af7f Merge changes to parser and _ml 2017-10-28 11:52:10 +02:00
ines d96e72f656 Tidy up rest 2017-10-27 21:07:59 +02:00
ines e33b7e0b3c Tidy up parser and ML 2017-10-27 14:39:30 +02:00
Matthew Honnibal 531142a933 Merge remote-tracking branch 'origin/develop' into feature/better-parser 2017-10-27 12:34:48 +00:00
Matthew Honnibal c9987cf131 Avoid use of numpy.tensordot 2017-10-27 10:18:36 +00:00
Matthew Honnibal f6fef30adc Remove dead code from spacy._ml 2017-10-27 10:16:41 +00:00