Commit Graph

196 Commits

Author SHA1 Message Date
Matthew Honnibal 3e3a309764 Fix tagger 2018-09-13 14:14:38 +02:00
Matthew Honnibal a95eea4c06 Fix multi-task objective for parser 2018-09-13 14:08:55 +02:00
Matthew Honnibal d6aa60139d Fix tagger training on GPU 2018-09-13 14:05:37 +02:00
Ines Montani e7b075565d
💫 Rule-based NER component (#2513)
* Add helper function for reading in JSONL

* Add rule-based NER component

* Fix whitespace

* Add component to factories

* Add tests

* Add option to disable indent on json_dumps compat

Otherwise, reading JSONL back in line by line won't work

* Fix error code
2018-07-18 19:43:16 +02:00
Matthew Honnibal 08c362d541 Suppress compiler warning about unreachable code 2018-07-06 11:31:22 +02:00
Matthew Honnibal 01ace9734d Make pipeline work on empty docs 2018-06-29 19:21:38 +02:00
Matthew Honnibal a1b05048d0 Fix tagger when doc is empty 2018-06-29 16:05:40 +02:00
Matthew Honnibal 3786942ff1 Fix tagger when docs are empty 2018-06-29 15:13:45 +02:00
Matthew Honnibal e0860bcfb3 Fix bug when docs are empty 2018-06-29 13:56:29 +02:00
Matthew Honnibal a4d2b0c293 Fix bug when docs are empty 2018-06-29 13:44:25 +02:00
Matthew Honnibal 5a65418c40 Fix handling of unseen labels in tagger 2018-06-25 22:28:59 +02:00
Matthew Honnibal 5b56aad4c2 Fix handling of unseen labels in tagger 2018-06-25 22:24:54 +02:00
Matthew Honnibal 3aabf621a3 Fix handling of unknown tags in tagger update 2018-06-25 22:01:02 +02:00
Matthew Honnibal 569440a6db Dont normalize gradient by batch size 2018-05-02 08:42:10 +02:00
Matthew Honnibal 5260268f70 Fix textcat after merge 2018-04-29 15:48:53 +02:00
Matthew Honnibal 2c4a6d66fa Merge master into develop. Big merge, many conflicts -- need to review 2018-04-29 14:49:26 +02:00
Matthew Honnibal 3836199a83 Fix loading of models when custom vectors are added 2018-04-10 22:19:20 +02:00
ines 5ecb274764 Fix indentation error and set Doc.is_tagged correctly 2018-04-10 16:14:52 +02:00
ines 987ee27af7 Return Doc if noun chunks merger component if Doc is not parsed 2018-04-09 14:51:02 +02:00
ines e5f47cd82d Update errors 2018-04-03 21:40:29 +02:00
Ines Montani 3141e04822
💫 New system for error messages and warnings (#2163)
* Add spacy.errors module

* Update deprecation and user warnings

* Replace errors and asserts with new error message system

* Remove redundant asserts

* Fix whitespace

* Add messages for print/util.prints statements

* Fix typo

* Fix typos

* Move CLI messages to spacy.cli._messages

* Add decorator to display error code with message

An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc.

* Remove unused link in spacy.about

* Update errors for invalid pipeline components

* Improve error for unknown factories

* Add displaCy warnings

* Update formatting consistency

* Move error message to spacy.errors

* Update errors and check if doc returned by component is None
2018-04-03 15:50:31 +02:00
Ines Montani a609a1ca29
Merge pull request #2152 from explosion/feature/tidy-up-dependencies
💫 Tidy up dependencies
2018-03-29 14:35:09 +02:00
Matthew Honnibal 8308bbc617 Get msgpack and msgpack_numpy via Thinc, to avoid potential version conflicts 2018-03-29 00:14:55 +02:00
Matthew Honnibal bc4afa9881 Remove print statement 2018-03-28 17:48:37 +02:00
Matthew Honnibal 9bf6e93b3e Set pretrained_vectors in begin_training 2018-03-28 16:32:41 +02:00
Matthew Honnibal 95a9615221 Fix loading of multiple pre-trained vectors
This patch addresses #1660, which was caused by keying all pre-trained
vectors with the same ID when telling Thinc how to refer to them. This
meant that if multiple models were loaded that had pre-trained vectors,
errors or incorrect behaviour resulted.

The vectors class now includes a .name attribute, which defaults to:
{nlp.meta['lang']_nlp.meta['name']}.vectors
The vectors name is set in the cfg of the pipeline components under the
key pretrained_vectors. This replaces the previous cfg key
pretrained_dims.

In order to make existing models compatible with this change, we check
for the pretrained_dims key when loading models in from_disk and
from_bytes, and add the cfg key pretrained_vectors if we find it.
2018-03-28 16:02:59 +02:00
Matthew Honnibal 1f7229f40f Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit c9ba3d3c2d, reversing
changes made to 92c26a35d4.
2018-03-27 19:23:02 +02:00
Matthew Honnibal f57bfbccdc Fix non-projective label filtering 2018-03-27 13:41:33 +02:00
Matthew Honnibal dd54511c4f Pass data as a function in begin_training methods 2018-03-27 09:39:59 +00:00
Matthew Honnibal bede11b67c
Improve label management in parser and NER (#2108)
This patch does a few smallish things that tighten up the training workflow a little, and allow memory use during training to be reduced by letting the GoldCorpus stream data properly.

Previously, the parser and entity recognizer read and saved labels as lists, with extra labels noted separately. Lists were used becaue ordering is very important, to ensure that the label-to-class mapping is stable.

We now manage labels as nested dictionaries, first keyed by the action, and then keyed by the label. Values are frequencies. The trick is, how do we save new labels? We need to make sure we iterate over these in the same order they're added. Otherwise, we'll get different class IDs, and the model's predictions won't make sense.

To allow stable sorting, we map the new labels to negative values. If we have two new labels, they'll be noted as having "frequency" -1 and -2. The next new label will then have "frequency" -3. When we sort by (frequency, label), we then get a stable sort.

Storing frequencies then allows us to make the next nice improvement. Previously we had to iterate over the whole training set, to pre-process it for the deprojectivisation. This led to storing the whole training set in memory. This was most of the required memory during training.

To prevent this, we now store the frequencies as we stream in the data, and deprojectivize as we go. Once we've built the frequencies, we can then apply a frequency cut-off when we decide how many classes to make.

Finally, to allow proper data streaming, we also have to have some way of shuffling the iterator. This is awkward if the training files have multiple documents in them. To solve this, the GoldCorpus class now writes the training data to disk in msgpack files, one per document. We can then shuffle the data by shuffling the paths.

This is a squash merge, as I made a lot of very small commits. Individual commit messages below.

* Simplify label management for TransitionSystem and its subclasses

* Fix serialization for new label handling format in parser

* Simplify and improve GoldCorpus class. Reduce memory use, write to temp dir

* Set actions in transition system

* Require thinc 6.11.1.dev4

* Fix error in parser init

* Add unicode declaration

* Fix unicode declaration

* Update textcat test

* Try to get model training on less memory

* Print json loc for now

* Try rapidjson to reduce memory use

* Remove rapidjson requirement

* Try rapidjson for reduced mem usage

* Handle None heads when projectivising

* Stream json docs

* Fix train script

* Handle projectivity in GoldParse

* Fix projectivity handling

* Add minibatch_by_words util from ud_train

* Minibatch by number of words in spacy.cli.train

* Move minibatch_by_words util to spacy.util

* Fix label handling

* More hacking at label management in parser

* Fix encoding in msgpack serialization in GoldParse

* Adjust batch sizes in parser training

* Fix minibatch_by_words

* Add merge_subtokens function to pipeline.pyx

* Register merge_subtokens factory

* Restore use of msgpack tmp directory

* Use minibatch-by-words in train

* Handle retokenization in scorer

* Change back-off approach for missing labels. Use 'dep' label

* Update NER for new label management

* Set NER tags for over-segmented words

* Fix label alignment in gold

* Fix label back-off for infrequent labels

* Fix int type in labels dict key

* Fix int type in labels dict key

* Update feature definition for 8 feature set

* Update ud-train script for new label stuff

* Fix json streamer

* Print the line number if conll eval fails

* Update children and sentence boundaries after deprojectivisation

* Export set_children_from_heads from doc.pxd

* Render parses during UD training

* Remove print statement

* Require thinc 6.11.1.dev6. Try adding wheel as install_requires

* Set different dev version, to flush pip cache

* Update thinc version

* Update GoldCorpus docs

* Remove print statements

* Fix formatting and links [ci skip]
2018-03-19 02:58:08 +01:00
Matthew Honnibal 13067095a1 Disable broken add-after-train in textcat 2018-03-16 12:33:33 +01:00
Matthew Honnibal 565ef8c4d8 Improve argument passing in textcat 2018-03-16 12:30:51 +01:00
ines f3f8bfc367 Add built-in factories for merge_entities and merge_noun_chunks
Allows adding those components to the pipeline out-of-the-box if they're defined in a model's meta.json. Also allows usage as nlp.add_pipe(nlp.create_pipe('merge_entities')).
2018-03-15 17:16:54 +01:00
ines d854f69fe3 Add built-in factories for merge_entities and merge_noun_chunks
Allows adding those components to the pipeline out-of-the-box if they're defined in a model's meta.json. Also allows usage as nlp.add_pipe(nlp.create_pipe('merge_entities')).
2018-03-15 00:18:51 +01:00
ines 9ad5df41fe Fix whitespace 2018-03-15 00:11:18 +01:00
Matthew Honnibal 968dabdde4 Fix bug in multi-task objective 2018-02-23 23:48:09 +01:00
Matthew Honnibal 4492a33a9d Fix sent_start multi-task objective when alignment fails 2018-02-23 16:50:59 +01:00
Matthew Honnibal 12264f9296 Add multi-task objective for sentence segmentation 2018-02-23 16:25:57 +01:00
Matthew Honnibal 8f06903e09 Fix multitask objectives 2018-02-17 18:41:36 +01:00
Matthew Honnibal d1246c95fb Fix model loading when using multitask objectives 2018-02-17 18:11:36 +01:00
Matthew Honnibal 3e541de440 Merge branch 'master' of https://github.com/explosion/spaCy 2018-02-15 21:02:55 +01:00
Claudiu-Vlad Ursache e28de12cbd
Ensure files opened in `from_disk` are closed
Fixes [issue 1706](https://github.com/explosion/spaCy/issues/1706).
2018-02-13 20:49:43 +01:00
Matthew Honnibal d7c9b53120 Pass kwargs into pipeline components during begin_training 2018-02-12 10:18:39 +01:00
Matthew Honnibal f3753c2453 Further model deserialization fixes re #1727 2018-01-23 19:16:05 +01:00
Matthew Honnibal 85c942a6e3 Dont overwrite pretrained_dims setting from cfg. Fixes #1727 2018-01-23 19:10:49 +01:00
Matthew Honnibal 203d2ea830 Allow multitask objectives to be added to the parser and NER more easily 2018-01-21 19:37:02 +01:00
Matthew Honnibal 61a051f2c0 Fix MultitaskObjective 2018-01-21 19:21:34 +01:00
Matthew Honnibal c27c82d5f9 Fix serialization 2017-11-08 13:08:48 +01:00
Matthew Honnibal 072ff38a01 Try to fix python3.5 serialization 2017-11-08 12:10:49 +01:00
Matthew Honnibal dd90fe09f5 Remove extraneous label from textcat class 2017-11-06 22:09:02 +01:00