Commit Graph

3 Commits

Author SHA1 Message Date
Sofie Van Landeghem 2998131416
Reproducibility for TextCat and Tok2Vec (#6218)
* ensure fixed seed in HashEmbed layers

* forgot about the joys of python 2
2020-10-08 00:43:46 +02:00
Matthew Honnibal 3e78e82a83
Experimental character-based pretraining (#5700)
* Use cosine loss in Cloze multitask

* Fix char_embed for gpu

* Call resume_training for base model in train CLI

* Fix bilstm_depth default in pretrain command

* Implement character-based pretraining objective

* Use chars loss in ClozeMultitask

* Add method to decode predicted characters

* Fix number characters

* Rescale gradients for mlm

* Fix char embed+vectors in ml

* Fix pipes

* Fix pretrain args

* Move get_characters_loss

* Fix import

* Fix import

* Mention characters loss option in pretrain

* Remove broken 'self attention' option in pretrain

* Revert "Remove broken 'self attention' option in pretrain"

This reverts commit 56b820f6af.

* Document 'characters' objective of pretrain
2020-07-05 15:48:39 +02:00
Matthew Honnibal e82306937e Put Tok2Vec refactor behind feature flag (#4563)
* Add back pre-2.2.2 tok2vec

* Add simple tok2vec tests

* Add simple tok2vec tests

* Reformat

* Fix CharacterEmbed in new tok2vec

* Fix legacy tok2vec

* Resolve circular imports

* Fix test for Python 2
2019-10-31 15:01:15 +01:00