Commit Graph

169 Commits

Author SHA1 Message Date
Matthew Honnibal 3836199a83 Fix loading of models when custom vectors are added 2018-04-10 22:19:20 +02:00
ines 5ecb274764 Fix indentation error and set Doc.is_tagged correctly 2018-04-10 16:14:52 +02:00
ines 987ee27af7 Return Doc if noun chunks merger component if Doc is not parsed 2018-04-09 14:51:02 +02:00
ines e5f47cd82d Update errors 2018-04-03 21:40:29 +02:00
Ines Montani 3141e04822
💫 New system for error messages and warnings (#2163)
* Add spacy.errors module

* Update deprecation and user warnings

* Replace errors and asserts with new error message system

* Remove redundant asserts

* Fix whitespace

* Add messages for print/util.prints statements

* Fix typo

* Fix typos

* Move CLI messages to spacy.cli._messages

* Add decorator to display error code with message

An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc.

* Remove unused link in spacy.about

* Update errors for invalid pipeline components

* Improve error for unknown factories

* Add displaCy warnings

* Update formatting consistency

* Move error message to spacy.errors

* Update errors and check if doc returned by component is None
2018-04-03 15:50:31 +02:00
Ines Montani a609a1ca29
Merge pull request #2152 from explosion/feature/tidy-up-dependencies
💫 Tidy up dependencies
2018-03-29 14:35:09 +02:00
Matthew Honnibal 8308bbc617 Get msgpack and msgpack_numpy via Thinc, to avoid potential version conflicts 2018-03-29 00:14:55 +02:00
Matthew Honnibal bc4afa9881 Remove print statement 2018-03-28 17:48:37 +02:00
Matthew Honnibal 9bf6e93b3e Set pretrained_vectors in begin_training 2018-03-28 16:32:41 +02:00
Matthew Honnibal 95a9615221 Fix loading of multiple pre-trained vectors
This patch addresses #1660, which was caused by keying all pre-trained
vectors with the same ID when telling Thinc how to refer to them. This
meant that if multiple models were loaded that had pre-trained vectors,
errors or incorrect behaviour resulted.

The vectors class now includes a .name attribute, which defaults to:
{nlp.meta['lang']_nlp.meta['name']}.vectors
The vectors name is set in the cfg of the pipeline components under the
key pretrained_vectors. This replaces the previous cfg key
pretrained_dims.

In order to make existing models compatible with this change, we check
for the pretrained_dims key when loading models in from_disk and
from_bytes, and add the cfg key pretrained_vectors if we find it.
2018-03-28 16:02:59 +02:00
ines f3f8bfc367 Add built-in factories for merge_entities and merge_noun_chunks
Allows adding those components to the pipeline out-of-the-box if they're defined in a model's meta.json. Also allows usage as nlp.add_pipe(nlp.create_pipe('merge_entities')).
2018-03-15 17:16:54 +01:00
Matthew Honnibal 8f06903e09 Fix multitask objectives 2018-02-17 18:41:36 +01:00
Matthew Honnibal d1246c95fb Fix model loading when using multitask objectives 2018-02-17 18:11:36 +01:00
Matthew Honnibal 3e541de440 Merge branch 'master' of https://github.com/explosion/spaCy 2018-02-15 21:02:55 +01:00
Claudiu-Vlad Ursache e28de12cbd
Ensure files opened in `from_disk` are closed
Fixes [issue 1706](https://github.com/explosion/spaCy/issues/1706).
2018-02-13 20:49:43 +01:00
Matthew Honnibal d7c9b53120 Pass kwargs into pipeline components during begin_training 2018-02-12 10:18:39 +01:00
Matthew Honnibal f3753c2453 Further model deserialization fixes re #1727 2018-01-23 19:16:05 +01:00
Matthew Honnibal 85c942a6e3 Dont overwrite pretrained_dims setting from cfg. Fixes #1727 2018-01-23 19:10:49 +01:00
Matthew Honnibal 203d2ea830 Allow multitask objectives to be added to the parser and NER more easily 2018-01-21 19:37:02 +01:00
Matthew Honnibal 61a051f2c0 Fix MultitaskObjective 2018-01-21 19:21:34 +01:00
Matthew Honnibal c27c82d5f9 Fix serialization 2017-11-08 13:08:48 +01:00
Matthew Honnibal 072ff38a01 Try to fix python3.5 serialization 2017-11-08 12:10:49 +01:00
Matthew Honnibal dd90fe09f5 Remove extraneous label from textcat class 2017-11-06 22:09:02 +01:00
Matthew Honnibal 8fea512ac8 Don't set tensor in textcat 2017-11-06 19:20:14 +01:00
Matthew Honnibal 75e1618ec3 Fix lemma clobbering 2017-11-06 16:56:19 +01:00
Matthew Honnibal 25859dbb48 Return optimizer from begin_training, creating if necessary 2017-11-06 14:26:49 +01:00
Matthew Honnibal 31babe3c3f Fix non-clobbering lemmatization 2017-11-06 12:36:05 +01:00
Matthew Honnibal 2b35bb76ad Fix tensorizer on GPU 2017-11-05 15:34:40 +01:00
uwol a2162b8908 tensorizer return parameter fix 2017-11-05 12:25:10 +01:00
Matthew Honnibal 17c63906f9 Update tensorizer component 2017-11-03 20:20:26 +01:00
Matthew Honnibal 6681058abd Fix tensor extending in tagger 2017-11-03 13:29:36 +01:00
Matthew Honnibal d6fc39c8a6 Set Doc.tensor from Tagger 2017-11-03 11:20:05 +01:00
Matthew Honnibal b30dd36179 Allow Tagger.add_label() before training 2017-11-01 21:49:24 +01:00
Matthew Honnibal b84d99b281 Revert tagger.add_label() changes, to fix model 2017-11-01 21:10:45 +01:00
Matthew Honnibal f5855e539b Fix tagger model loading 2017-11-01 20:42:36 +01:00
Matthew Honnibal 190522efd3 Fix tagger when some tags aren't in Morphology 2017-11-01 19:27:49 +01:00
Matthew Honnibal 7ae1aacdb8 Fix add_label methods 2017-11-01 17:06:43 +01:00
Matthew Honnibal e7a9174877 Add add_label methods to Tagger and TextCategorizer 2017-11-01 16:32:44 +01:00
ines ba5e646219 Tidy up pipeline 2017-10-27 20:29:08 +02:00
Ines Montani 4033e70c71 Merge pull request #1461 from explosion/feature/disable-pipes
💫 Add Language.disable_pipes(), to temporarily edit pipeline and update code examples
2017-10-27 12:21:40 +02:00
ines 9e372913e0 Remove old 'SP' condition in tag map 2017-10-26 16:11:57 +02:00
Matthew Honnibal a8abc47811 Rename BaseThincComponent --> Pipe 2017-10-26 12:40:40 +02:00
Matthew Honnibal b0f3ea2200 Fix names of pipeline components
NeuralDependencyParser --> DependencyParser
NeuralEntityRecognizer --> EntityRecognizer
TokenVectorEncoder     --> Tensorizer
NeuralLabeller         --> MultitaskObjective
2017-10-26 12:38:23 +02:00
Matthew Honnibal ed8da9b11f Add missing return statement in SentenceSegmenter 2017-10-17 15:32:56 +02:00
Matthew Honnibal 09d61ada5e Merge pull request #1396 from explosion/feature/pipeline-management
💫 Improve pipeline and factory management
2017-10-10 04:29:54 +02:00
Matthew Honnibal 8978212ee5 Patch serialization bug raised in #1105 2017-10-10 03:58:12 +02:00
Matthew Honnibal 0384f08218 Trigger nonproj.deprojectivize as a postprocess 2017-10-07 02:00:47 +02:00
Matthew Honnibal 563f46f026 Fix multi-label support for text classification
The TextCategorizer class is supposed to support multi-label
text classification, and allow training data to contain missing
values.

For this to work, the gradient of the loss should be 0 when labels
are missing. Instead, there was no way to actually denote "missing"
in the GoldParse class, and so the TextCategorizer class treated
the label set within gold.cats as complete.

To fix this, we change GoldParse.cats to be a dict instead of a list.
The GoldParse.cats dict should map to floats, with 1. denoting
'present' and 0. denoting 'absent'. Gradients are zeroed for categories
absent from the gold.cats dict. A nice bonus is that you can also set
values between 0 and 1 for partial membership. You can also set numeric
values, if you're using a text classification model that uses an
appropriate loss function.

Unfortunately this is a breaking change; although the functionality
was only recently introduced and hasn't been properly documented
yet. I've updated the example script accordingly.
2017-10-05 18:43:02 -05:00
Matthew Honnibal 5454b20cd7 Update thinc imports for 6.9 2017-10-03 20:07:17 +02:00
Matthew Honnibal 4a59f6358c Fix thinc imports 2017-10-03 19:21:26 +02:00