Commit Graph

320 Commits

Author SHA1 Message Date
Matthew Honnibal dd54511c4f Pass data as a function in begin_training methods 2018-03-27 09:39:59 +00:00
Matthew Honnibal bede11b67c
Improve label management in parser and NER (#2108)
This patch does a few smallish things that tighten up the training workflow a little, and allow memory use during training to be reduced by letting the GoldCorpus stream data properly.

Previously, the parser and entity recognizer read and saved labels as lists, with extra labels noted separately. Lists were used becaue ordering is very important, to ensure that the label-to-class mapping is stable.

We now manage labels as nested dictionaries, first keyed by the action, and then keyed by the label. Values are frequencies. The trick is, how do we save new labels? We need to make sure we iterate over these in the same order they're added. Otherwise, we'll get different class IDs, and the model's predictions won't make sense.

To allow stable sorting, we map the new labels to negative values. If we have two new labels, they'll be noted as having "frequency" -1 and -2. The next new label will then have "frequency" -3. When we sort by (frequency, label), we then get a stable sort.

Storing frequencies then allows us to make the next nice improvement. Previously we had to iterate over the whole training set, to pre-process it for the deprojectivisation. This led to storing the whole training set in memory. This was most of the required memory during training.

To prevent this, we now store the frequencies as we stream in the data, and deprojectivize as we go. Once we've built the frequencies, we can then apply a frequency cut-off when we decide how many classes to make.

Finally, to allow proper data streaming, we also have to have some way of shuffling the iterator. This is awkward if the training files have multiple documents in them. To solve this, the GoldCorpus class now writes the training data to disk in msgpack files, one per document. We can then shuffle the data by shuffling the paths.

This is a squash merge, as I made a lot of very small commits. Individual commit messages below.

* Simplify label management for TransitionSystem and its subclasses

* Fix serialization for new label handling format in parser

* Simplify and improve GoldCorpus class. Reduce memory use, write to temp dir

* Set actions in transition system

* Require thinc 6.11.1.dev4

* Fix error in parser init

* Add unicode declaration

* Fix unicode declaration

* Update textcat test

* Try to get model training on less memory

* Print json loc for now

* Try rapidjson to reduce memory use

* Remove rapidjson requirement

* Try rapidjson for reduced mem usage

* Handle None heads when projectivising

* Stream json docs

* Fix train script

* Handle projectivity in GoldParse

* Fix projectivity handling

* Add minibatch_by_words util from ud_train

* Minibatch by number of words in spacy.cli.train

* Move minibatch_by_words util to spacy.util

* Fix label handling

* More hacking at label management in parser

* Fix encoding in msgpack serialization in GoldParse

* Adjust batch sizes in parser training

* Fix minibatch_by_words

* Add merge_subtokens function to pipeline.pyx

* Register merge_subtokens factory

* Restore use of msgpack tmp directory

* Use minibatch-by-words in train

* Handle retokenization in scorer

* Change back-off approach for missing labels. Use 'dep' label

* Update NER for new label management

* Set NER tags for over-segmented words

* Fix label alignment in gold

* Fix label back-off for infrequent labels

* Fix int type in labels dict key

* Fix int type in labels dict key

* Update feature definition for 8 feature set

* Update ud-train script for new label stuff

* Fix json streamer

* Print the line number if conll eval fails

* Update children and sentence boundaries after deprojectivisation

* Export set_children_from_heads from doc.pxd

* Render parses during UD training

* Remove print statement

* Require thinc 6.11.1.dev6. Try adding wheel as install_requires

* Set different dev version, to flush pip cache

* Update thinc version

* Update GoldCorpus docs

* Remove print statements

* Fix formatting and links [ci skip]
2018-03-19 02:58:08 +01:00
ines f3f8bfc367 Add built-in factories for merge_entities and merge_noun_chunks
Allows adding those components to the pipeline out-of-the-box if they're defined in a model's meta.json. Also allows usage as nlp.add_pipe(nlp.create_pipe('merge_entities')).
2018-03-15 17:16:54 +01:00
ines d854f69fe3 Add built-in factories for merge_entities and merge_noun_chunks
Allows adding those components to the pipeline out-of-the-box if they're defined in a model's meta.json. Also allows usage as nlp.add_pipe(nlp.create_pipe('merge_entities')).
2018-03-15 00:18:51 +01:00
Aaron Marquez 3765d84d57 Fix issue #1959 2018-02-15 12:51:49 -08:00
Claudiu-Vlad Ursache e28de12cbd
Ensure files opened in `from_disk` are closed
Fixes [issue 1706](https://github.com/explosion/spaCy/issues/1706).
2018-02-13 20:49:43 +01:00
Motoki Wu f4a7d1a423 make to sure pass in **cfg to each component when training 2018-01-30 18:29:54 -08:00
ines 4046823699 Only check component in factories if string (see #1911) 2018-01-30 16:29:07 +01:00
ines ce10d320c4 Fix component check in self.factories (see #1911) 2018-01-30 16:09:37 +01:00
ines 8901814248 Improve error handling if pipeline component is not callable (resolves #1911)
Also add help message if user accidentally calls nlp.add_pipe() with a string of a built-in component name.
2018-01-30 15:43:03 +01:00
ines a31506e060 Fix off-by-one error in nlp.add_pipe(after=name) (fixes #1654) 2017-11-28 20:37:55 +01:00
Ines Montani 6362024cf8
Merge pull request #1645 from GreenRiverRUS/fix_default_meta
Fixed spaCy version string in default meta
2017-11-27 11:58:02 +00:00
Vadim Mazaev 59f03ab1d7 Fixed spacy version string in default meta 2017-11-26 23:02:07 +03:00
Matthew Honnibal 8fec7268eb Move string cleanup under a setting flag 2017-11-23 12:19:18 +00:00
Matthew Honnibal 5949777b12 Fix misleading multi-threading docstring 2017-11-23 12:18:59 +00:00
Roman Domrachev 61d28d03e4 Try again to do selective remove cache 2017-11-15 19:11:12 +03:00
Roman Domrachev 505c6a2f2f Completely cleanup tokenizer cache
Tokenizer cache can have be different keys than string

That modification can slow down tokenizer and need to be measured
2017-11-15 17:55:48 +03:00
Roman Domrachev a33d5a068d Try to hold origin data instead of restore it 2017-11-14 22:40:03 +03:00
Roman Domrachev 91e2fa6561 Clean all caches 2017-11-14 21:15:04 +03:00
Roman Domrachev 86ca434c93 Merge github.com:explosion/spaCy 2017-11-14 17:46:22 +03:00
Roman Domrachev a2745b0e84 StringStore now actually cleaned
Do not lose docs in ref tracking
2017-11-14 17:45:50 +03:00
Matthew Honnibal dd1678eab3
Edit comment 2017-11-11 18:37:08 +01:00
Roman Domrachev ee60a52ee7 Fix test imports and last batch cleanup 2017-11-11 11:32:16 +03:00
Roman Domrachev 4a6b094e09 Remove unused import 2017-11-11 03:13:05 +03:00
Roman Domrachev 3c600adf23 Try to fix StringStore clean up (see #1506) 2017-11-11 03:11:27 +03:00
Matthew Honnibal 45e0617e61 Allow Language.update to take unicode text and dict objects 2017-11-06 22:07:38 +01:00
Matthew Honnibal 5c85bf3791 Fix missing import 2017-11-06 15:06:27 +01:00
Matthew Honnibal 465adfee94 Remove unused resume_training method, and pass optimizer through 2017-11-06 14:26:00 +01:00
Matthew Honnibal 38109a0e4a Register SentenceSegmenter in Language.factories 2017-11-05 18:45:57 +01:00
Matthew Honnibal d185927998 Undo harmful pickling hacks on Language class 2017-11-04 23:07:03 +01:00
Matthew Honnibal 2bf21cbe29 Update model after optimising it instead of waiting 2017-11-03 20:20:01 +01:00
ines 5f661a1b3a Remove tensorizer from pre-set pipe_names 2017-11-01 19:48:33 +01:00
ines bfe17b7df1 Fix begin_training if get_gold_tuples is None 2017-11-01 13:14:31 +01:00
ines 37e62ab0e2 Update vector meta in meta.json 2017-11-01 01:25:09 +01:00
ines 8e02294241 Add vectors to Language.meta 2017-10-30 18:39:48 +01:00
ines d96e72f656 Tidy up rest 2017-10-27 21:07:59 +02:00
ines 91899d337b Tidy up language, lemmatizer and scorer 2017-10-27 14:40:14 +02:00
Ines Montani 4033e70c71 Merge pull request #1461 from explosion/feature/disable-pipes
💫 Add Language.disable_pipes(), to temporarily edit pipeline and update code examples
2017-10-27 12:21:40 +02:00
ines 2d6ec99884 Set 'model' as default model name to prevent meta.json errors 2017-10-26 16:12:23 +02:00
Matthew Honnibal 90d1d9b230 Remove obsolete parser code 2017-10-26 13:22:45 +02:00
Matthew Honnibal b0f3ea2200 Fix names of pipeline components
NeuralDependencyParser --> DependencyParser
NeuralEntityRecognizer --> EntityRecognizer
TokenVectorEncoder     --> Tensorizer
NeuralLabeller         --> MultitaskObjective
2017-10-26 12:38:23 +02:00
ines 1a722dac31 Merge branch 'develop' into feature/disable-pipes 2017-10-25 15:18:18 +02:00
ines 6a00de4f77 Fix check of unexpected pipe names in restore() 2017-10-25 14:56:35 +02:00
ines 7f03932477 Return self on __enter__ 2017-10-25 14:56:16 +02:00
Matthew Honnibal e70f80f29e Add Language.disable_pipes() 2017-10-25 13:46:41 +02:00
ines 3484174e48 Add Language.path 2017-10-25 11:57:43 +02:00
Matthew Honnibal 65bf5e85bd Improve piping in language.pipe 2017-10-18 21:46:12 +02:00
Matthew Honnibal e35a83d142 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-17 18:22:06 +02:00
Matthew Honnibal 1cc85a89ef Allow reasonably efficient pickling of Language class, using to_bytes() and from_bytes(). 2017-10-17 18:18:49 +02:00
Ines Montani afa67de7ee Merge pull request #1428 from roanuz/develop
Fix trailing whitespace and Language.from_disk overwrites
2017-10-17 16:29:15 +02:00
Anto Binish Kaspar 8f5b60c168 Fix Language.from_disk overwrites the meta.json file. 2017-10-17 17:15:32 +05:30
ines 8ca344712d Add Language.has_pipe method 2017-10-17 11:20:07 +02:00
Matthew Honnibal 2bc06e4b22 Bump rolling buffer size to 10k 2017-10-16 19:38:29 +02:00
Matthew Honnibal 5c14f3f033 Create a rolling buffer for the StringStore in Language.pipe() 2017-10-16 19:22:40 +02:00
Ines Montani 37aa523a8e Merge pull request #1408 from explosion/feature/dot-underscore
💫 Custom attributes via Doc._, Token._ and Span._
2017-10-11 18:35:56 +02:00
ines 9620c1a640 Add lemma_lookup to Language defaults 2017-10-11 13:26:05 +02:00
ines 67350fa496 Use better logic for auto-generating component name
Instances don't have __name__, so we try __class__.__name__ as well,
before giving up and defaulting to repr(component).
2017-10-10 04:23:05 +02:00
Matthew Honnibal 0384f08218 Trigger nonproj.deprojectivize as a postprocess 2017-10-07 02:00:47 +02:00
ines e43530269c Update docstrings 2017-10-07 01:04:50 +02:00
ines 2586b61b15 Fix formatting, tidy up and remove unused imports 2017-10-07 00:26:05 +02:00
ines 212c8f0711 Implement new Language methods and pipeline API 2017-10-07 00:25:54 +02:00
Matthew Honnibal 96da86b3e5 Add support for verbose flag to Language 2017-10-03 09:14:57 -05:00
Matthew Honnibal 4ae9ea7684 Remove unused argument in Language 2017-09-26 05:41:35 -05:00
Matthew Honnibal 8716ffe57d Serialize vocab last 2017-09-24 05:01:45 -05:00
Matthew Honnibal 5a7fd0fd36 Fix vector linkage 2017-09-22 20:11:52 -05:00
Matthew Honnibal 4348c479fc Merge pre-trained vectors and noshare patches 2017-09-22 20:07:28 -05:00
Matthew Honnibal 7dc61b3f43 Whitespace 2017-09-22 20:00:50 -05:00
Matthew Honnibal 20193371f5 Don't share CNN, to reduce complexities 2017-09-21 14:59:48 +02:00
Matthew Honnibal b832f89ff8 Add resume_training function 2017-09-20 19:15:20 -05:00
Matthew Honnibal c858927271 Copy vectors to GPU on begin training 2017-09-18 18:04:16 -05:00
Matthew Honnibal 43210abacc Resolve fine-tuning conflict 2017-09-17 05:30:04 -05:00
Matthew Honnibal e37a50a436 Pass documents to tensorizer, not 'features' 2017-09-16 12:46:36 -05:00
Matthew Honnibal 70da88a3a7 Update comment on Language.begin_training 2017-09-14 16:18:30 +02:00
Matthew Honnibal 78a5f842e9 Fix update when update_shared=False 2017-08-20 15:58:34 -05:00
Matthew Honnibal 8875590081 Add optimizer in Language.update if sgd=None 2017-08-20 14:42:07 +02:00
Matthew Honnibal a3c51a0355 Fix creation of pipeline 2017-08-19 21:58:57 +02:00
Matthew Honnibal 97aabafb5f Document as_tuples keyword arg of Language.pipe 2017-08-19 12:21:33 +02:00
Matthew Honnibal 11c31d285c Restore changes from nn-beam-parser 2017-08-18 22:26:12 +02:00
Matthew Honnibal 52c180ecf5 Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit ea8de11ad5, reversing
changes made to 08e443e083.
2017-08-14 13:00:23 +02:00
Matthew Honnibal 4363b4aa4a Fix redundant tokvecs updates during update 2017-08-13 12:36:55 +02:00
Matthew Honnibal 0acce0521b Fix Language.update for pipeline 2017-08-06 14:13:03 +02:00
Matthew Honnibal 0eec7c9e9b Fix Language.evaluate 2017-08-06 02:18:31 +02:00
Matthew Honnibal cc19ea0e7c Add update_tensors flag to Language.update. Experimental, re #1182 2017-08-06 02:17:10 +02:00
Matthew Honnibal 2e00361522 Fix update when 0 docs 2017-08-01 22:10:17 +02:00
Matthew Honnibal 523b0df2c9 Update text classification model 2017-07-25 18:57:59 +02:00
Matthew Honnibal d8aa721664 Compute Language.meta with a property 2017-07-23 00:50:18 +02:00
Matthew Honnibal baa3d81c35 Add text categorizer to Language 2017-07-22 01:13:36 +02:00
Matthew Honnibal 836bfa2d0f Add factory for experimental SimilarityHook component 2017-06-05 15:40:22 +02:00
Matthew Honnibal 2479cde446 Support disable keyword in Language.__init__ 2017-06-05 13:13:07 +02:00
Matthew Honnibal 8f8f90b46b Disable labeller if not parsing 2017-06-04 20:18:54 -05:00
Matthew Honnibal 939e8ed567 Add lookup properties for components in Language 2017-06-04 15:52:09 -05:00
Matthew Honnibal 92ae36f84e Improve way noun chunks iterator is looked up 2017-06-04 21:53:39 +02:00
Matthew Honnibal 21eef90dbc Support specifying which GPU 2017-06-03 16:10:23 -05:00
Matthew Honnibal fea1144e6d Set max batch size in evaluate 2017-06-03 13:31:33 -05:00
ines a3e4f91f4a Only load vocab if it exists 2017-06-01 14:38:35 +02:00
Matthew Honnibal 33e5ec737f Fix to/from disk methods 2017-05-31 13:43:10 +02:00
Matthew Honnibal 1e6df0a2a1 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-05-29 14:30:12 -05:00
ines 6145fe6a93 Catch all kwargs on Language 2017-05-29 20:43:48 +02:00
Matthew Honnibal 9c9ee24411 Fix broken lambda scoping in Python 2 2017-05-29 13:23:28 -05:00
Matthew Honnibal aa4c33914b Work on serialization 2017-05-29 08:40:45 -05:00