Commit Graph

849 Commits

Author SHA1 Message Date
Matthew Honnibal ef0820827a
Update hyper-parameters after NER random search (#2972)
These experiments were completed a few weeks ago, but I didn't make the PR, pending model release.

    Token vector width: 128->96
    Hidden width: 128->64
    Embed size: 5000->2000
    Dropout: 0.2->0.1
    Updated optimizer defaults (unclear how important?)

This should improve speed, model size and load time, while keeping
similar or slightly better accuracy.

The tl;dr is we prefer to prevent over-fitting by reducing model size,
rather than using more dropout.
2018-11-27 18:49:52 +01:00
Matthew Honnibal 2c37e0ccf6
💫 Use Blis for matrix multiplications (#2966)
Our epic matrix multiplication odyssey is drawing to a close...

I've now finally got the Blis linear algebra routines in a self-contained Python package, with wheels for Windows, Linux and OSX. The only missing platform at the moment is Windows Python 2.7. The result is at https://github.com/explosion/cython-blis

Thinc v7.0.0 will make the change to Blis. I've put a Thinc v7.0.0.dev0 up on PyPi so that we can test these changes with the CI, and even get them out to spacy-nightly, before Thinc v7.0.0 is released. This PR also updates the other dependencies to be in line with the current versions master is using. I've also resolved the msgpack deprecation problems, and gotten spaCy and Thinc up to date with the latest Cython.

The point of switching to Blis is to have control of how our matrix multiplications are executed across platforms. When we were using numpy for this, a different library would be used on pip and conda, OSX would use Accelerate, etc. This would open up different bugs and performance problems, especially when multi-threading was introduced.

With the change to Blis, we now strictly single-thread the matrix multiplications. This will make it much easier to use multiprocessing to parallelise the runtime, since we won't have nested parallelism problems to deal with.

* Use blis

* Use -2 arg to Cython

* Update dependencies

* Fix requirements

* Update setup dependencies

* Fix requirement typo

* Fix msgpack errors

* Remove Python27 test from Appveyor, until Blis works there

* Auto-format setup.py

* Fix murmurhash version
2018-11-27 00:44:04 +01:00
Matthew Honnibal 2874b8efd8 Fix tok2vec loading in spacy train 2018-11-15 23:34:54 +00:00
Matthew Honnibal 817e1fc5e5 Fix out-of-bounds access in NER training
The helper method state.B(1) gets the index of the first token of the
buffer, or -1 if no such token exists. Normally this is safe because we
pass this to functions like state.safe_get(), which returns an empty
token. Here we used it directly as an array index, which is not okay!

This error may have been the cause of out-of-bounds access errors during
training. Similar errors may still be around, so much be hunted down.
Hunting this one down took a long time...I printed out values across
training runs and diffed, looking for points of divergence between
runs, when no randomness should be allowed.
2018-10-27 01:12:50 +02:00
Matthew Honnibal ad068f51be Fix out-of-bounds access in NER training
The helper method state.B(1) gets the index of the first token of the
buffer, or -1 if no such token exists. Normally this is safe because we
pass this to functions like state.safe_get(), which returns an empty
token. Here we used it directly as an array index, which is not okay!

This error may have been the cause of out-of-bounds access errors during
training. Similar errors may still be around, so much be hunted down.
Hunting this one down took a long time...I printed out values across
training runs and diffed, looking for points of divergence between
runs, when no randomness should be allowed.
2018-10-27 00:46:30 +02:00
Matthew Honnibal f82f8ba5dd Fix serialization when empty parser model. Closes #2482 2018-09-28 15:18:52 +02:00
Matthew Honnibal 96fe314d8d Fix bug when too many entity types. Fixes #2800 2018-09-27 13:54:34 +02:00
Matthew Honnibal 500898907b Fix regression in parser.begin_training() 2018-09-25 11:08:31 +02:00
Matthew Honnibal c046392317 Trigger on_data hooks in parser model 2018-09-14 20:51:21 +02:00
Matthew Honnibal f32b52e611 Fix bug that caused deprojectivisation to run multiple times 2018-09-14 12:12:54 +02:00
Matthew Honnibal b43643a953 Support bilstm_depth option in parser 2018-09-13 19:29:49 +02:00
Matthew Honnibal 21321cd6cf Add tok2vec property to parser model 2018-09-13 14:08:43 +02:00
Matthew Honnibal 3763e20afc Pass subword_features and conv_depth params 2018-08-27 01:51:15 +02:00
Matthew Honnibal 5080760288 Add extra comment on 'add label' in parser 2018-08-15 15:37:24 +02:00
Matthew Honnibal 6ec236ab08 Fix label-clobber bug in parser.begin_training()
The parser.begin_training() method was rewritten in v2.1. The rewrite
introduced a regression, where if you added labels prior to
begin_training(), these labels were discarded. This patch fixes that.
2018-08-14 13:20:19 +02:00
Matthew Honnibal 01ace9734d Make pipeline work on empty docs 2018-06-29 19:21:38 +02:00
Matthew Honnibal ee33de8652 Fix unpickling of NER parser 2018-05-21 17:42:40 +02:00
Matthew Honnibal 7431e9c87f Fix parser for GPU 2018-05-19 17:24:34 +00:00
Matthew Honnibal a7aa49c419 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-05-16 23:20:51 +02:00
Matthew Honnibal d1b27fe5aa Revert "Improve dynamic oracle when values are missing in parse"
This reverts commit f56bd4736b.
2018-05-16 00:31:52 +02:00
Matthew Honnibal 8661218fe8
Refactor parser (#2308)
* Work on refactoring greedy parser

* Compile updated parser

* Fix refactored parser

* Update test

* Fix refactored parser

* Fix refactored parser

* Readd beam search after refactor

* Fix beam search after refactor

* Fix parser

* Fix beam parsing

* Support oracle segmentation in ud-train CLI command

* Avoid relying on final gold check in beam search

* Add a keyword argument sink to GoldParse

* Bug fixes to beam search after refactor

* Avoid importing fused token symbol in ud-run-test, untl that's added

* Avoid importing fused token symbol in ud-run-test, untl that's added

* Don't modify Token in global scope

* Fix error in beam gradient calculation

* Default to beam_update_prob 1

* Set a more aggressive threshold on the max violn update

* Disable some tests to figure out why CI fails

* Disable some tests to figure out why CI fails

* Add some diagnostics to travis.yml to try to figure out why build fails

* Tell Thinc to link against system blas on Travis

* Point thinc to libblas on Travis

* Try running sudo=true for travis

* Unhack travis.sh

* Restore beam_density argument for parser beam

* Require thinc 6.11.1.dev16

* Revert hacks to tests

* Revert hacks to travis.yml

* Update thinc requirement

* Fix parser model loading

* Fix size limits in training data

* Add missing name attribute for parser

* Fix appveyor for Windows
2018-05-15 22:17:29 +02:00
Matthew Honnibal f56bd4736b Improve dynamic oracle when values are missing in parse 2018-05-07 15:53:18 +02:00
Matthew Honnibal 8cd06cc763 Try to fix root-outside-sentence bug 2018-05-02 14:39:48 +00:00
Matthew Honnibal acebd01033 Set cildren from heads in finalize doc 2018-05-02 14:19:22 +00:00
Matthew Honnibal 2338e8c7fc Update develop from master 2018-05-02 01:36:12 +00:00
Matthew Honnibal 6d0fe67b72 Constrain subtok label to adjacent tokens 2018-05-01 17:34:27 +02:00
Matthew Honnibal 8f21953fc5 Constrain subtok to adjacent words 2018-05-01 17:29:00 +02:00
Matthew Honnibal 697bcaa34f Add some methods to ArcEager that make testing easier 2018-05-01 15:13:14 +02:00
Matthew Honnibal 5de8a36537 Fix arc_eager is_nonproj_tree 2018-04-29 15:49:11 +02:00
Matthew Honnibal 2c4a6d66fa Merge master into develop. Big merge, many conflicts -- need to review 2018-04-29 14:49:26 +02:00
Matthew Honnibal 3836199a83 Fix loading of models when custom vectors are added 2018-04-10 22:19:20 +02:00
Matthew Honnibal 96b612873b Add hyper-parameter to control whether parser makes a beam update 2018-04-03 22:02:56 +02:00
Ines Montani 3141e04822
💫 New system for error messages and warnings (#2163)
* Add spacy.errors module

* Update deprecation and user warnings

* Replace errors and asserts with new error message system

* Remove redundant asserts

* Fix whitespace

* Add messages for print/util.prints statements

* Fix typo

* Fix typos

* Move CLI messages to spacy.cli._messages

* Add decorator to display error code with message

An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc.

* Remove unused link in spacy.about

* Update errors for invalid pipeline components

* Improve error for unknown factories

* Add displaCy warnings

* Update formatting consistency

* Move error message to spacy.errors

* Update errors and check if doc returned by component is None
2018-04-03 15:50:31 +02:00
Matthew Honnibal 98165e43a7 Sometimes update beam with greedy oracle 2018-04-01 08:44:35 +00:00
Ines Montani 98e9cda677
Merge pull request #2158 from explosion/feature/fix-multiple-vectors (resolves #1660)
💫 Fix loading of multiple vector models
2018-03-28 23:08:24 +02:00
ines 3eb67bbe4b Allow entity types with dashes (resolves #1967) 2018-03-28 20:51:26 +02:00
Matthew Honnibal 79dc241caa Set pretrained_vectors in parser cfg 2018-03-28 17:35:07 +02:00
Matthew Honnibal 9bf6e93b3e Set pretrained_vectors in begin_training 2018-03-28 16:32:41 +02:00
Matthew Honnibal 95a9615221 Fix loading of multiple pre-trained vectors
This patch addresses #1660, which was caused by keying all pre-trained
vectors with the same ID when telling Thinc how to refer to them. This
meant that if multiple models were loaded that had pre-trained vectors,
errors or incorrect behaviour resulted.

The vectors class now includes a .name attribute, which defaults to:
{nlp.meta['lang']_nlp.meta['name']}.vectors
The vectors name is set in the cfg of the pipeline components under the
key pretrained_vectors. This replaces the previous cfg key
pretrained_dims.

In order to make existing models compatible with this change, we check
for the pretrained_dims key when loading models in from_disk and
from_bytes, and add the cfg key pretrained_vectors if we find it.
2018-03-28 16:02:59 +02:00
Matthew Honnibal 18da89e04c Handle non-callable gold_tuples in parser begin_training 2018-03-27 21:08:41 +02:00
Matthew Honnibal 1f7229f40f Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit c9ba3d3c2d, reversing
changes made to 92c26a35d4.
2018-03-27 19:23:02 +02:00
Matthew Honnibal f57bfbccdc Fix non-projective label filtering 2018-03-27 13:41:33 +02:00
Matthew Honnibal d2118792e7 Merge changes from master 2018-03-27 13:38:41 +02:00
Matthew Honnibal 25280b7013 Try to make sum_state_features faster 2018-03-27 10:08:38 +00:00
Matthew Honnibal 987e1533a4 Use 8 features in parser 2018-03-27 10:08:12 +00:00
Matthew Honnibal dd54511c4f Pass data as a function in begin_training methods 2018-03-27 09:39:59 +00:00
Matthew Honnibal d9ebd78e11 Change default sizes in parser 2018-03-26 17:22:18 +02:00
Matthew Honnibal 49fbe2dfee Use thinc.openblas in spacy.syntax.nn_parser 2018-03-20 02:22:09 +01:00
Matthew Honnibal bede11b67c
Improve label management in parser and NER (#2108)
This patch does a few smallish things that tighten up the training workflow a little, and allow memory use during training to be reduced by letting the GoldCorpus stream data properly.

Previously, the parser and entity recognizer read and saved labels as lists, with extra labels noted separately. Lists were used becaue ordering is very important, to ensure that the label-to-class mapping is stable.

We now manage labels as nested dictionaries, first keyed by the action, and then keyed by the label. Values are frequencies. The trick is, how do we save new labels? We need to make sure we iterate over these in the same order they're added. Otherwise, we'll get different class IDs, and the model's predictions won't make sense.

To allow stable sorting, we map the new labels to negative values. If we have two new labels, they'll be noted as having "frequency" -1 and -2. The next new label will then have "frequency" -3. When we sort by (frequency, label), we then get a stable sort.

Storing frequencies then allows us to make the next nice improvement. Previously we had to iterate over the whole training set, to pre-process it for the deprojectivisation. This led to storing the whole training set in memory. This was most of the required memory during training.

To prevent this, we now store the frequencies as we stream in the data, and deprojectivize as we go. Once we've built the frequencies, we can then apply a frequency cut-off when we decide how many classes to make.

Finally, to allow proper data streaming, we also have to have some way of shuffling the iterator. This is awkward if the training files have multiple documents in them. To solve this, the GoldCorpus class now writes the training data to disk in msgpack files, one per document. We can then shuffle the data by shuffling the paths.

This is a squash merge, as I made a lot of very small commits. Individual commit messages below.

* Simplify label management for TransitionSystem and its subclasses

* Fix serialization for new label handling format in parser

* Simplify and improve GoldCorpus class. Reduce memory use, write to temp dir

* Set actions in transition system

* Require thinc 6.11.1.dev4

* Fix error in parser init

* Add unicode declaration

* Fix unicode declaration

* Update textcat test

* Try to get model training on less memory

* Print json loc for now

* Try rapidjson to reduce memory use

* Remove rapidjson requirement

* Try rapidjson for reduced mem usage

* Handle None heads when projectivising

* Stream json docs

* Fix train script

* Handle projectivity in GoldParse

* Fix projectivity handling

* Add minibatch_by_words util from ud_train

* Minibatch by number of words in spacy.cli.train

* Move minibatch_by_words util to spacy.util

* Fix label handling

* More hacking at label management in parser

* Fix encoding in msgpack serialization in GoldParse

* Adjust batch sizes in parser training

* Fix minibatch_by_words

* Add merge_subtokens function to pipeline.pyx

* Register merge_subtokens factory

* Restore use of msgpack tmp directory

* Use minibatch-by-words in train

* Handle retokenization in scorer

* Change back-off approach for missing labels. Use 'dep' label

* Update NER for new label management

* Set NER tags for over-segmented words

* Fix label alignment in gold

* Fix label back-off for infrequent labels

* Fix int type in labels dict key

* Fix int type in labels dict key

* Update feature definition for 8 feature set

* Update ud-train script for new label stuff

* Fix json streamer

* Print the line number if conll eval fails

* Update children and sentence boundaries after deprojectivisation

* Export set_children_from_heads from doc.pxd

* Render parses during UD training

* Remove print statement

* Require thinc 6.11.1.dev6. Try adding wheel as install_requires

* Set different dev version, to flush pip cache

* Update thinc version

* Update GoldCorpus docs

* Remove print statements

* Fix formatting and links [ci skip]
2018-03-19 02:58:08 +01:00
Matthew Honnibal 307d6bf6d3 Fix parser for Thinc 6.11 2018-03-16 10:59:31 +01:00
Matthew Honnibal 9a389c4490 Fix parser for Thinc 6.11 2018-03-16 10:38:13 +01:00
Matthew Honnibal 648532d647 Don't assume blas methods are present 2018-03-16 02:48:20 +01:00
Matthew Honnibal e101f10ef0 Fix header 2018-03-13 02:12:16 +01:00
Matthew Honnibal d55620041b Switch parser to gemm from thinc.openblas 2018-03-13 02:10:58 +01:00
Matthew Honnibal 4b72c38556 Fix dropout bug in beam parser 2018-03-10 23:16:40 +01:00
Matthew Honnibal 3d6487c734 Support dropout in beam parse 2018-03-10 22:41:55 +01:00
Matthew Honnibal 14f729c72a Add subtok label to parser 2018-02-26 12:26:35 +01:00
Matthew Honnibal 7137ad8b0b Make label filtering clearer for projectivisation 2018-02-26 12:02:01 +01:00
Matthew Honnibal 7b66ec896a Revert "Revert "Improve parser oracle around sentence breaks.""
This reverts commit 36e481c584.
2018-02-26 10:57:37 +01:00
Matthew Honnibal 36e481c584 Revert "Improve parser oracle around sentence breaks."
This reverts commit 50817dc9ad.
2018-02-26 10:53:55 +01:00
Matthew Honnibal 50817dc9ad Improve parser oracle around sentence breaks. 2018-02-22 19:22:26 +01:00
Matthew Honnibal 661873ee4c Randomize the rebatch size in parser 2018-02-21 21:02:07 +01:00
Matthew Honnibal a0ddb803fd Make error when no label found more helpful 2018-02-21 16:00:59 +01:00
Matthew Honnibal ea2fc5d45f Improve length and freq cutoffs in parser 2018-02-21 16:00:38 +01:00
Matthew Honnibal e5757d4bf0 Add labels property to parser 2018-02-21 16:00:00 +01:00
Matthew Honnibal eff4ae809a Fix nonproj label filter 2018-02-21 15:59:04 +01:00
Matthew Honnibal e624405cda Temporarily remove cutoff when filtering labels in nonproj 2018-02-21 13:53:40 +01:00
Matthew Honnibal 8f06903e09 Fix multitask objectives 2018-02-17 18:41:36 +01:00
Matthew Honnibal d1246c95fb Fix model loading when using multitask objectives 2018-02-17 18:11:36 +01:00
Matthew Honnibal 7d5c720fc3 Fix multitask objective when no pipeline provided 2018-02-15 23:50:21 +01:00
Matthew Honnibal 59b7cf9db8 Add get_beam_parse method in ArcEager, for Prodigy 2018-02-15 21:03:16 +01:00
Claudiu-Vlad Ursache e28de12cbd
Ensure files opened in `from_disk` are closed
Fixes [issue 1706](https://github.com/explosion/spaCy/issues/1706).
2018-02-13 20:49:43 +01:00
Matthew Honnibal e361b4f82b Fix #1929: Incorrect NER when pre-set sentence boundaries. 2018-02-08 15:25:41 +01:00
Matthew Honnibal f74a802d09 Test and fix #1919: Error resuming training 2018-02-02 02:32:40 +01:00
Matthew Honnibal 85c942a6e3 Dont overwrite pretrained_dims setting from cfg. Fixes #1727 2018-01-23 19:10:49 +01:00
Matthew Honnibal fe4748fc38
Merge pull request #1870 from avadhpatel/master
Model Load Performance Improvement by more than 5x
2018-01-22 00:05:15 +01:00
Avadh Patel a517df55c8 Small fix
Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-21 15:20:45 -06:00
Avadh Patel 5b5029890d Merge branch 'perfTuning' into perfTuningMaster
Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-21 15:20:00 -06:00
Matthew Honnibal 203d2ea830 Allow multitask objectives to be added to the parser and NER more easily 2018-01-21 19:37:02 +01:00
Avadh Patel 75903949da Updated model building after suggestion from Matthew
Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-18 06:51:57 -06:00
Avadh Patel fe879da2a1 Do not train model if its going to be loaded from disk
This saves significant time in loading a model from disk.

Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-17 06:16:07 -06:00
Avadh Patel 2146faffee Do not train model if its going to be loaded from disk
This saves significant time in loading a model from disk.

Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-17 06:04:22 -06:00
Matthew Honnibal f29c3925ee Fix more efficient nonproj 2017-11-23 12:48:00 +00:00
Matthew Honnibal db5c714ad2 Improve efficiency of deprojectivization 2017-11-23 12:31:34 +00:00
Matthew Honnibal d274d3a3b9 Let beam forward use minibatches 2017-11-15 00:51:42 +01:00
Matthew Honnibal 855872f872 Remove state hashing 2017-11-14 23:36:46 +01:00
Matthew Honnibal 2512ea9eeb Fix memory leak in beam parser 2017-11-14 02:11:40 +01:00
Matthew Honnibal ca73d0d8fe Cleanup states after beam parsing, explicitly 2017-11-13 18:18:26 +01:00
Matthew Honnibal 63ef9a2e73 Remove __dealloc__ from ParserBeam 2017-11-13 18:18:08 +01:00
Matthew Honnibal 25859dbb48 Return optimizer from begin_training, creating if necessary 2017-11-06 14:26:49 +01:00
Matthew Honnibal 2b35bb76ad Fix tensorizer on GPU 2017-11-05 15:34:40 +01:00
Matthew Honnibal 3ca16ddbd4 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-04 00:25:02 +01:00
Matthew Honnibal 98c29b7912 Add padding vector in parser, to make gradient more correct 2017-11-04 00:23:23 +01:00
Matthew Honnibal 13c8881d2f Expose parser's tok2vec model component 2017-11-03 20:20:59 +01:00
Matthew Honnibal 7fea845374 Remove print statement 2017-11-03 14:04:51 +01:00
Matthew Honnibal a5b05f85f0 Set Doc.tensor attribute in parser 2017-11-03 11:21:00 +01:00
Matthew Honnibal 7698903617 Fix GPU usage 2017-10-31 02:33:16 +01:00
Matthew Honnibal a0c7dabb72 Fix bug in 8-token parser features 2017-10-28 23:01:35 +00:00
Matthew Honnibal b713d10d97 Switch to 13 features in parser 2017-10-28 23:01:14 +00:00
Matthew Honnibal 5414e2f14b Use missing features in parser 2017-10-28 16:45:54 +00:00