Commit Graph

742 Commits

Author SHA1 Message Date
Matthew Honnibal ad068f51be Fix out-of-bounds access in NER training
The helper method state.B(1) gets the index of the first token of the
buffer, or -1 if no such token exists. Normally this is safe because we
pass this to functions like state.safe_get(), which returns an empty
token. Here we used it directly as an array index, which is not okay!

This error may have been the cause of out-of-bounds access errors during
training. Similar errors may still be around, so much be hunted down.
Hunting this one down took a long time...I printed out values across
training runs and diffed, looking for points of divergence between
runs, when no randomness should be allowed.
2018-10-27 00:46:30 +02:00
Matthew Honnibal 96fe314d8d Fix bug when too many entity types. Fixes #2800 2018-09-27 13:54:34 +02:00
Matthew Honnibal 3836199a83 Fix loading of models when custom vectors are added 2018-04-10 22:19:20 +02:00
Matthew Honnibal 96b612873b Add hyper-parameter to control whether parser makes a beam update 2018-04-03 22:02:56 +02:00
Ines Montani 3141e04822
💫 New system for error messages and warnings (#2163)
* Add spacy.errors module

* Update deprecation and user warnings

* Replace errors and asserts with new error message system

* Remove redundant asserts

* Fix whitespace

* Add messages for print/util.prints statements

* Fix typo

* Fix typos

* Move CLI messages to spacy.cli._messages

* Add decorator to display error code with message

An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc.

* Remove unused link in spacy.about

* Update errors for invalid pipeline components

* Improve error for unknown factories

* Add displaCy warnings

* Update formatting consistency

* Move error message to spacy.errors

* Update errors and check if doc returned by component is None
2018-04-03 15:50:31 +02:00
Ines Montani 98e9cda677
Merge pull request #2158 from explosion/feature/fix-multiple-vectors (resolves #1660)
💫 Fix loading of multiple vector models
2018-03-28 23:08:24 +02:00
ines 3eb67bbe4b Allow entity types with dashes (resolves #1967) 2018-03-28 20:51:26 +02:00
Matthew Honnibal 79dc241caa Set pretrained_vectors in parser cfg 2018-03-28 17:35:07 +02:00
Matthew Honnibal 9bf6e93b3e Set pretrained_vectors in begin_training 2018-03-28 16:32:41 +02:00
Matthew Honnibal 95a9615221 Fix loading of multiple pre-trained vectors
This patch addresses #1660, which was caused by keying all pre-trained
vectors with the same ID when telling Thinc how to refer to them. This
meant that if multiple models were loaded that had pre-trained vectors,
errors or incorrect behaviour resulted.

The vectors class now includes a .name attribute, which defaults to:
{nlp.meta['lang']_nlp.meta['name']}.vectors
The vectors name is set in the cfg of the pipeline components under the
key pretrained_vectors. This replaces the previous cfg key
pretrained_dims.

In order to make existing models compatible with this change, we check
for the pretrained_dims key when loading models in from_disk and
from_bytes, and add the cfg key pretrained_vectors if we find it.
2018-03-28 16:02:59 +02:00
Matthew Honnibal 8f06903e09 Fix multitask objectives 2018-02-17 18:41:36 +01:00
Matthew Honnibal d1246c95fb Fix model loading when using multitask objectives 2018-02-17 18:11:36 +01:00
Matthew Honnibal 7d5c720fc3 Fix multitask objective when no pipeline provided 2018-02-15 23:50:21 +01:00
Matthew Honnibal 59b7cf9db8 Add get_beam_parse method in ArcEager, for Prodigy 2018-02-15 21:03:16 +01:00
Claudiu-Vlad Ursache e28de12cbd
Ensure files opened in `from_disk` are closed
Fixes [issue 1706](https://github.com/explosion/spaCy/issues/1706).
2018-02-13 20:49:43 +01:00
Matthew Honnibal e361b4f82b Fix #1929: Incorrect NER when pre-set sentence boundaries. 2018-02-08 15:25:41 +01:00
Matthew Honnibal f74a802d09 Test and fix #1919: Error resuming training 2018-02-02 02:32:40 +01:00
Matthew Honnibal 85c942a6e3 Dont overwrite pretrained_dims setting from cfg. Fixes #1727 2018-01-23 19:10:49 +01:00
Matthew Honnibal fe4748fc38
Merge pull request #1870 from avadhpatel/master
Model Load Performance Improvement by more than 5x
2018-01-22 00:05:15 +01:00
Avadh Patel a517df55c8 Small fix
Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-21 15:20:45 -06:00
Avadh Patel 5b5029890d Merge branch 'perfTuning' into perfTuningMaster
Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-21 15:20:00 -06:00
Matthew Honnibal 203d2ea830 Allow multitask objectives to be added to the parser and NER more easily 2018-01-21 19:37:02 +01:00
Avadh Patel 75903949da Updated model building after suggestion from Matthew
Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-18 06:51:57 -06:00
Avadh Patel fe879da2a1 Do not train model if its going to be loaded from disk
This saves significant time in loading a model from disk.

Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-17 06:16:07 -06:00
Avadh Patel 2146faffee Do not train model if its going to be loaded from disk
This saves significant time in loading a model from disk.

Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-17 06:04:22 -06:00
Matthew Honnibal f29c3925ee Fix more efficient nonproj 2017-11-23 12:48:00 +00:00
Matthew Honnibal db5c714ad2 Improve efficiency of deprojectivization 2017-11-23 12:31:34 +00:00
Matthew Honnibal d274d3a3b9 Let beam forward use minibatches 2017-11-15 00:51:42 +01:00
Matthew Honnibal 855872f872 Remove state hashing 2017-11-14 23:36:46 +01:00
Matthew Honnibal 2512ea9eeb Fix memory leak in beam parser 2017-11-14 02:11:40 +01:00
Matthew Honnibal ca73d0d8fe Cleanup states after beam parsing, explicitly 2017-11-13 18:18:26 +01:00
Matthew Honnibal 63ef9a2e73 Remove __dealloc__ from ParserBeam 2017-11-13 18:18:08 +01:00
Matthew Honnibal 25859dbb48 Return optimizer from begin_training, creating if necessary 2017-11-06 14:26:49 +01:00
Matthew Honnibal 2b35bb76ad Fix tensorizer on GPU 2017-11-05 15:34:40 +01:00
Matthew Honnibal 3ca16ddbd4 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-04 00:25:02 +01:00
Matthew Honnibal 98c29b7912 Add padding vector in parser, to make gradient more correct 2017-11-04 00:23:23 +01:00
Matthew Honnibal 13c8881d2f Expose parser's tok2vec model component 2017-11-03 20:20:59 +01:00
Matthew Honnibal 7fea845374 Remove print statement 2017-11-03 14:04:51 +01:00
Matthew Honnibal a5b05f85f0 Set Doc.tensor attribute in parser 2017-11-03 11:21:00 +01:00
Matthew Honnibal 7698903617 Fix GPU usage 2017-10-31 02:33:16 +01:00
Matthew Honnibal a0c7dabb72 Fix bug in 8-token parser features 2017-10-28 23:01:35 +00:00
Matthew Honnibal b713d10d97 Switch to 13 features in parser 2017-10-28 23:01:14 +00:00
Matthew Honnibal 5414e2f14b Use missing features in parser 2017-10-28 16:45:54 +00:00
Matthew Honnibal 64e4ff7c4b Merge 'tidy-up' changes into branch. Resolve conflicts 2017-10-28 13:16:06 +02:00
Explosion Bot b22e42af7f Merge changes to parser and _ml 2017-10-28 11:52:10 +02:00
ines b4d226a3f1 Tidy up syntax 2017-10-27 19:45:57 +02:00
ines 9c89e2cdef Remove unused syntax iterators (now in language data) 2017-10-27 18:09:53 +02:00
ines e33b7e0b3c Tidy up parser and ML 2017-10-27 14:39:30 +02:00
Matthew Honnibal 531142a933 Merge remote-tracking branch 'origin/develop' into feature/better-parser 2017-10-27 12:34:48 +00:00
Matthew Honnibal 75a637fa43 Remove redundant imports from _ml 2017-10-27 10:19:56 +00:00