Commit Graph

488 Commits

Author SHA1 Message Date
Giovanni Campagna dbeb5a4fdd Update Pipfile.lock 2020-03-27 13:08:34 -07:00
Mehrad Moradshahi 50fa36f354
Merge pull request #10 from stanford-oval/wip/multilanguage
Wip/multilanguage
2020-03-26 12:40:04 -07:00
mehrad 60adad8173 adding tests + bug fixes 2020-03-25 23:51:04 -07:00
mehrad 3bf3719edd addressing pr comments 2020-03-25 17:43:04 -07:00
mehrad 10b759f568 move general data util files to scripts dir 2020-03-25 11:36:20 -07:00
mehrad d870b634db remove obsolete local data 2020-03-25 11:31:38 -07:00
mehrad a2d2e740de adding multilingual task 2020-03-25 02:33:19 -07:00
mehrad 87536fe2cb allow caching multiple transformer embeddings 2020-03-24 19:52:42 -07:00
Giovanni Campagna 6810668b94 Post release version bump 2020-03-24 19:24:35 -07:00
Giovanni Campagna 02d9003539 v0.2.0a2 2020-03-24 18:55:59 -07:00
Giovanni Campagna 36b1197c9a
numericalizer/transformer: remove bogus assertions (#9)
These assertions do not mean much, because those tokens are guaranteed
to be in the decoder vocabulary regardless of the assertion, and
they won't necessarily have the same ID in the decoder and the true
vocabulary. Also the mask_id assertion fails for XLMR, because
mask_id is 250004.
2020-03-24 18:54:44 -07:00
mehrad 0acf64bdc3 address mask_id issue for XLM-R 2020-03-24 16:05:01 -07:00
Giovanni Campagna 1e2dbce017
Fix loading embeddings with untied embeddings (#8)
If embeddings for context & questions are untied with the "@" suffix,
we must not pass the suffix to the transformer library.
2020-03-23 00:56:41 -07:00
Giovanni Campagna 5a72ac7ff6 Post-release version bump 2020-03-21 19:31:38 -07:00
Giovanni Campagna 65338cb05d v0.2.0a1 2020-03-21 19:08:19 -07:00
Giovanni Campagna bb6018ba01 Add Pipfile script
So you can run "pipenv run genienlp" instead of "pipenv run python3 -m genienlp"
2020-03-19 10:37:16 -07:00
Giovanni Campagna f39cfac2a6
Merge pull request #6 from stanford-oval/wip/contextual
First batch of changes from dialogue work
2020-03-19 10:04:42 -07:00
Giovanni Campagna 2106ef1cb0
Merge pull request #7 from stanford-oval/wip/export
Add "genienlp export" command
2020-03-17 20:49:44 -07:00
Giovanni Campagna 123ea6802b Add "genienlp export" command
The command copies over the model files that are needed for inference,
without intermediate checkpoints.
2020-03-17 15:42:26 -07:00
Giovanni Campagna 69e6707773 Fix tests 2020-03-16 13:17:23 -07:00
Giovanni Campagna f971c31dde Merge remote-tracking branch 'origin/master' into wip/contextual 2020-03-16 12:42:21 -07:00
Giovanni Campagna 3618a169a0 fix 2020-03-11 14:04:53 -07:00
Giovanni Campagna 87a6716b97 BiLSTMEncoder: separate context & question embeddings 2020-03-11 12:21:18 -07:00
s-jse d4ff046c1a
Merge pull request #5 from stanford-oval/wip/sinaj/clean
Wip/sinaj/clean
2020-03-03 14:40:02 -08:00
Sina 6b4e29ae04 bug fixes 2020-03-03 14:26:03 -08:00
Sina 06131f12dc Merge branch 'master' into wip/sinaj/clean 2020-03-02 22:27:15 -08:00
Sina cc258c0e2a Fixed beam search 2020-03-02 22:06:03 -08:00
Mehrad Moradshahi 69997c6485
Merge pull request #4 from stanford-oval/wip/mehrad/multi-language-v3
Bug fixes and updates
2020-03-02 17:44:39 -08:00
Sina 11f37c590b Multilayer LSTM bug fixed 2020-03-02 16:57:49 -08:00
Sina ea078d8e46 Added tests and instructions for paraphrasing 2020-03-02 16:57:35 -08:00
Sina 4ffc93a65f Added beam search. By default (--num_beams=1) is disabled 2020-03-02 16:56:50 -08:00
Sina 6b0cc5549b option to train with very large batch sizes 2020-03-02 16:51:34 -08:00
Sina 7a5683b3ce Added paraphrasing train and generation scripts 2020-03-02 16:49:58 -08:00
Sina 190f953833 newer versions of transformers package are not backward compatible 2020-03-02 16:47:57 -08:00
mehrad 843913c951 cleanup 2020-03-02 11:23:56 -08:00
mehrad f9f38bc019 initialize embedding when resuming training 2020-03-01 14:53:17 -08:00
Giovanni Campagna bc94192f6e Add option to delay finetuning of BERT until the model is almost trained
So that we only do a few thousand iterations of finetuning
2020-02-28 15:25:45 -08:00
Giovanni Campagna 7a544cb502 Add option to force subword tokenization of ThingTalk tokens
For comparison. All tokens are treated as English words and split.
2020-02-28 14:19:09 -08:00
Giovanni Campagna 843a52c6c2 pretraining: fix GPU 2020-02-27 16:34:02 -08:00
Giovanni Campagna d929c7a7bd Add an option to pretrain the context encoder
With MLM objective
2020-02-27 16:16:51 -08:00
Giovanni Campagna b399b68448 train: simplify training loop
Make the code more readable by separating as much as possible
into helper functions, which keeps the indentation down to a
manageable level.
2020-02-27 14:53:38 -08:00
Giovanni Campagna 7f55fe4796 Add NLU task for the agent
Which can be used to facilitate annotation of human-human dialogues
(like multiwoz dialogues)
2020-02-26 09:05:40 -08:00
mehrad daa7131675 fix decoding 2020-02-26 02:07:31 -08:00
mehrad 9d2b1142be misc. updates 2020-02-25 11:59:59 -08:00
Giovanni Campagna b8d21d5c1d Fix tensorboard when crashes/restarted
Pass "purge_step" to clean up old events
2020-02-24 17:19:01 -08:00
Giovanni Campagna d04bab948d Introduce separate options for context & question embeddings
So those embeddings can be untied
2020-02-24 14:54:24 -08:00
Giovanni Campagna f97b872d84 embeddings: allow specifying the same embedding twice, untied
Use a "@..." suffix (e.g. "bert-base-uncased@0", "bert-base-uncased@1")
to specify two untied instances of the same pretrained embedding.
This is useful so they can be fined-tuned separately.
2020-02-24 14:19:08 -08:00
Mehrad Moradshahi 5ec1f9fe80
Merge pull request #2 from stanford-oval/wip/mehrad/multi-language-v3
XLM-R model as encoder
2020-02-20 11:56:04 -08:00
mehrad 1a5f979cea fixing lgtm alerts 2020-02-20 11:41:16 -08:00
mehrad c1487ce1db minor changes 2020-02-20 11:36:00 -08:00