Giovanni Campagna
dbeb5a4fdd
Update Pipfile.lock
2020-03-27 13:08:34 -07:00
Mehrad Moradshahi
50fa36f354
Merge pull request #10 from stanford-oval/wip/multilanguage
...
Wip/multilanguage
2020-03-26 12:40:04 -07:00
mehrad
60adad8173
adding tests + bug fixes
2020-03-25 23:51:04 -07:00
mehrad
3bf3719edd
addressing pr comments
2020-03-25 17:43:04 -07:00
mehrad
10b759f568
move general data util files to scripts dir
2020-03-25 11:36:20 -07:00
mehrad
d870b634db
remove obsolete local data
2020-03-25 11:31:38 -07:00
mehrad
a2d2e740de
adding multilingual task
2020-03-25 02:33:19 -07:00
mehrad
87536fe2cb
allow caching multiple transformer embeddings
2020-03-24 19:52:42 -07:00
Giovanni Campagna
6810668b94
Post release version bump
2020-03-24 19:24:35 -07:00
Giovanni Campagna
02d9003539
v0.2.0a2
2020-03-24 18:55:59 -07:00
Giovanni Campagna
36b1197c9a
numericalizer/transformer: remove bogus assertions ( #9 )
...
These assertions do not mean much, because those tokens are guaranteed
to be in the decoder vocabulary regardless of the assertion, and
they won't necessarily have the same ID in the decoder and the true
vocabulary. Also the mask_id assertion fails for XLMR, because
mask_id is 250004.
2020-03-24 18:54:44 -07:00
mehrad
0acf64bdc3
address mask_id issue for XLM-R
2020-03-24 16:05:01 -07:00
Giovanni Campagna
1e2dbce017
Fix loading embeddings with untied embeddings ( #8 )
...
If embeddings for context & questions are untied with the "@" suffix,
we must not pass the suffix to the transformer library.
2020-03-23 00:56:41 -07:00
Giovanni Campagna
5a72ac7ff6
Post-release version bump
2020-03-21 19:31:38 -07:00
Giovanni Campagna
65338cb05d
v0.2.0a1
2020-03-21 19:08:19 -07:00
Giovanni Campagna
bb6018ba01
Add Pipfile script
...
So you can run "pipenv run genienlp" instead of "pipenv run python3 -m genienlp"
2020-03-19 10:37:16 -07:00
Giovanni Campagna
f39cfac2a6
Merge pull request #6 from stanford-oval/wip/contextual
...
First batch of changes from dialogue work
2020-03-19 10:04:42 -07:00
Giovanni Campagna
2106ef1cb0
Merge pull request #7 from stanford-oval/wip/export
...
Add "genienlp export" command
2020-03-17 20:49:44 -07:00
Giovanni Campagna
123ea6802b
Add "genienlp export" command
...
The command copies over the model files that are needed for inference,
without intermediate checkpoints.
2020-03-17 15:42:26 -07:00
Giovanni Campagna
69e6707773
Fix tests
2020-03-16 13:17:23 -07:00
Giovanni Campagna
f971c31dde
Merge remote-tracking branch 'origin/master' into wip/contextual
2020-03-16 12:42:21 -07:00
Giovanni Campagna
3618a169a0
fix
2020-03-11 14:04:53 -07:00
Giovanni Campagna
87a6716b97
BiLSTMEncoder: separate context & question embeddings
2020-03-11 12:21:18 -07:00
s-jse
d4ff046c1a
Merge pull request #5 from stanford-oval/wip/sinaj/clean
...
Wip/sinaj/clean
2020-03-03 14:40:02 -08:00
Sina
6b4e29ae04
bug fixes
2020-03-03 14:26:03 -08:00
Sina
06131f12dc
Merge branch 'master' into wip/sinaj/clean
2020-03-02 22:27:15 -08:00
Sina
cc258c0e2a
Fixed beam search
2020-03-02 22:06:03 -08:00
Mehrad Moradshahi
69997c6485
Merge pull request #4 from stanford-oval/wip/mehrad/multi-language-v3
...
Bug fixes and updates
2020-03-02 17:44:39 -08:00
Sina
11f37c590b
Multilayer LSTM bug fixed
2020-03-02 16:57:49 -08:00
Sina
ea078d8e46
Added tests and instructions for paraphrasing
2020-03-02 16:57:35 -08:00
Sina
4ffc93a65f
Added beam search. By default (--num_beams=1) is disabled
2020-03-02 16:56:50 -08:00
Sina
6b0cc5549b
option to train with very large batch sizes
2020-03-02 16:51:34 -08:00
Sina
7a5683b3ce
Added paraphrasing train and generation scripts
2020-03-02 16:49:58 -08:00
Sina
190f953833
newer versions of transformers package are not backward compatible
2020-03-02 16:47:57 -08:00
mehrad
843913c951
cleanup
2020-03-02 11:23:56 -08:00
mehrad
f9f38bc019
initialize embedding when resuming training
2020-03-01 14:53:17 -08:00
Giovanni Campagna
bc94192f6e
Add option to delay finetuning of BERT until the model is almost trained
...
So that we only do a few thousand iterations of finetuning
2020-02-28 15:25:45 -08:00
Giovanni Campagna
7a544cb502
Add option to force subword tokenization of ThingTalk tokens
...
For comparison. All tokens are treated as English words and split.
2020-02-28 14:19:09 -08:00
Giovanni Campagna
843a52c6c2
pretraining: fix GPU
2020-02-27 16:34:02 -08:00
Giovanni Campagna
d929c7a7bd
Add an option to pretrain the context encoder
...
With MLM objective
2020-02-27 16:16:51 -08:00
Giovanni Campagna
b399b68448
train: simplify training loop
...
Make the code more readable by separating as much as possible
into helper functions, which keeps the indentation down to a
manageable level.
2020-02-27 14:53:38 -08:00
Giovanni Campagna
7f55fe4796
Add NLU task for the agent
...
Which can be used to facilitate annotation of human-human dialogues
(like multiwoz dialogues)
2020-02-26 09:05:40 -08:00
mehrad
daa7131675
fix decoding
2020-02-26 02:07:31 -08:00
mehrad
9d2b1142be
misc. updates
2020-02-25 11:59:59 -08:00
Giovanni Campagna
b8d21d5c1d
Fix tensorboard when crashes/restarted
...
Pass "purge_step" to clean up old events
2020-02-24 17:19:01 -08:00
Giovanni Campagna
d04bab948d
Introduce separate options for context & question embeddings
...
So those embeddings can be untied
2020-02-24 14:54:24 -08:00
Giovanni Campagna
f97b872d84
embeddings: allow specifying the same embedding twice, untied
...
Use a "@..." suffix (e.g. "bert-base-uncased@0", "bert-base-uncased@1")
to specify two untied instances of the same pretrained embedding.
This is useful so they can be fined-tuned separately.
2020-02-24 14:19:08 -08:00
Mehrad Moradshahi
5ec1f9fe80
Merge pull request #2 from stanford-oval/wip/mehrad/multi-language-v3
...
XLM-R model as encoder
2020-02-20 11:56:04 -08:00
mehrad
1a5f979cea
fixing lgtm alerts
2020-02-20 11:41:16 -08:00
mehrad
c1487ce1db
minor changes
2020-02-20 11:36:00 -08:00