Commit Graph

125 Commits

Author SHA1 Message Date
Matthew Honnibal 09d61ada5e Merge pull request #1396 from explosion/feature/pipeline-management
💫 Improve pipeline and factory management
2017-10-10 04:29:54 +02:00
Matthew Honnibal 8978212ee5 Patch serialization bug raised in #1105 2017-10-10 03:58:12 +02:00
Matthew Honnibal 0384f08218 Trigger nonproj.deprojectivize as a postprocess 2017-10-07 02:00:47 +02:00
Matthew Honnibal 563f46f026 Fix multi-label support for text classification
The TextCategorizer class is supposed to support multi-label
text classification, and allow training data to contain missing
values.

For this to work, the gradient of the loss should be 0 when labels
are missing. Instead, there was no way to actually denote "missing"
in the GoldParse class, and so the TextCategorizer class treated
the label set within gold.cats as complete.

To fix this, we change GoldParse.cats to be a dict instead of a list.
The GoldParse.cats dict should map to floats, with 1. denoting
'present' and 0. denoting 'absent'. Gradients are zeroed for categories
absent from the gold.cats dict. A nice bonus is that you can also set
values between 0 and 1 for partial membership. You can also set numeric
values, if you're using a text classification model that uses an
appropriate loss function.

Unfortunately this is a breaking change; although the functionality
was only recently introduced and hasn't been properly documented
yet. I've updated the example script accordingly.
2017-10-05 18:43:02 -05:00
Matthew Honnibal 5454b20cd7 Update thinc imports for 6.9 2017-10-03 20:07:17 +02:00
Matthew Honnibal 4a59f6358c Fix thinc imports 2017-10-03 19:21:26 +02:00
Matthew Honnibal 66c388ee01 Remove unhelpful multitask objectives 2017-09-27 11:44:16 -05:00
Matthew Honnibal 983201a83a Fix hard-coded vector width 2017-09-27 11:43:58 -05:00
Matthew Honnibal defb68e94f Update feature/noshare with recent develop changes 2017-09-26 08:15:14 -05:00
Matthew Honnibal ca28590ddd Use dep and ent multi-task objectives for parser' 2017-09-26 08:13:52 -05:00
Matthew Honnibal 18a27c7579 Fix typo in tensorizer serialization 2017-09-26 06:45:14 -05:00
Matthew Honnibal bf917225ab Allow multi-task objectives during training 2017-09-26 05:42:52 -05:00
ines d2d35b63b7 Fix formatting 2017-09-25 18:37:13 +02:00
Matthew Honnibal 8eb0b7b779 Add docstrings for Pipe API 2017-09-25 16:22:07 +02:00
Matthew Honnibal 39f390dba7 Add docstrings for Pipe API 2017-09-25 16:20:49 +02:00
Matthew Honnibal 4348c479fc Merge pre-trained vectors and noshare patches 2017-09-22 20:07:28 -05:00
Matthew Honnibal 386c1a5bd8 Fix tagger training 2017-09-23 02:58:06 +02:00
Matthew Honnibal 05596159bf Fix serialization when pre-trained vectors 2017-09-22 15:33:27 -05:00
Matthew Honnibal d9124f1aa3 Add link_vectors_to_models function 2017-09-22 09:38:22 -05:00
Matthew Honnibal 40a4873b70 Fix serialization of model options 2017-09-21 13:07:26 -05:00
Matthew Honnibal 20193371f5 Don't share CNN, to reduce complexities 2017-09-21 14:59:48 +02:00
Matthew Honnibal 24e85c2048 Pass values for CNN maxout pieces option 2017-09-20 19:16:12 -05:00
Matthew Honnibal b36a38f63d Fix serialization of pretrained_dims property 2017-09-19 23:42:27 +02:00
Matthew Honnibal 40837b275d Fix tensorizer with pretrained vectors 2017-09-18 18:05:38 -05:00
Matthew Honnibal 84e637e2e6 Pass option for pretrained vectors in pipeline 2017-09-16 12:46:02 -05:00
Matthew Honnibal 7fdafcc4c4 Fix config loading in tagger 2017-09-04 16:38:49 +02:00
Matthew Honnibal 382ce566eb Fix deserialization bug 2017-09-04 15:19:01 +02:00
Matthew Honnibal 9e378bdac5 Fix textcat serialization 2017-09-02 15:17:20 +02:00
Matthew Honnibal a3b69bcb3d Add low_data mode in textcat 2017-09-02 14:56:30 +02:00
Matthew Honnibal 5e6a9e7dcc Add rule-based SBD 2017-09-02 12:53:38 +02:00
Matthew Honnibal c1d3ff517a Track loss in tagger 2017-08-20 14:42:23 +02:00
Matthew Honnibal ec482580b5 Restore changes to pipeline.pyx from nn-beam-parser branch 2017-08-18 22:02:35 +02:00
Matthew Honnibal 426f84937f Resolve conflicts when merging new beam parsing stuff 2017-08-18 13:38:32 -05:00
Matthew Honnibal 1cb2f15d65 Clean up unused predict_confidences function 2017-08-16 18:22:26 -05:00
Matthew Honnibal 52c180ecf5 Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit ea8de11ad5, reversing
changes made to 08e443e083.
2017-08-14 13:00:23 +02:00
Matthew Honnibal 3e30712b62 Improve defaults 2017-08-12 19:24:17 -05:00
Matthew Honnibal 680043ebca Improve efficiency of tagger.set_annotations for GPU 2017-08-12 08:54:21 -05:00
Matthew Honnibal 3cb8f06881 Fix NeuralLabeller 2017-08-06 14:15:14 +02:00
Matthew Honnibal e9ab800e15 Fix tagging model 2017-08-06 01:50:08 +02:00
Matthew Honnibal 468c138ab3 WIP: Add fine-tuning logic to tagger model, re #1182 2017-08-06 01:13:23 +02:00
Matthew Honnibal 6780132821 Fix tagger loading 2017-07-25 19:41:11 +02:00
Matthew Honnibal c4a81a47a4 Fix deserialization 2017-07-23 14:11:07 +02:00
Matthew Honnibal 4fe77bced2 Add cfg attr to pipeline components 2017-07-23 00:52:47 +02:00
Matthew Honnibal a88a7deffe Five save/load of textcat config 2017-07-23 00:33:43 +02:00
Matthew Honnibal b55714d5d1 Make gold_tuples arg optional in begin_training 2017-07-22 20:04:43 +02:00
Matthew Honnibal b3a749610e Fix name of TextCategorizer 2017-07-22 01:14:07 +02:00
Matthew Honnibal a231b56d40 Add text-classification hook to pipeline 2017-07-20 00:18:15 +02:00
Matthew Honnibal d59fa32df1 Add experimental SimilarityHook omponent 2017-06-05 15:40:03 +02:00
Matthew Honnibal b3b5521625 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-04 20:17:18 -05:00
Matthew Honnibal 7b2ede783d Add SP tag to tag map if missing 2017-06-04 20:16:30 -05:00