Commit Graph

567 Commits

Author SHA1 Message Date
mehrad 50c4de9d60 merge master branch 2020-05-04 10:06:35 -07:00
mehrad 8ba1f4c490 code refactoring 2020-05-04 10:03:08 -07:00
mehrad 2da4d203c7 move script files
moving them to paraphrase directory where they are used
2020-05-03 19:43:20 -07:00
Sina 9274b7cb07 Added no_repeat_ngram_size option 2020-05-03 17:23:47 -07:00
Sina f677a04763 on multi-gpu machines, splitting files is balanced 2020-05-03 15:17:07 -07:00
Sina be7b8a5b42 fixed padding issues of GPT2 2020-05-03 14:13:16 -07:00
Sina 8c796beeab fixed maximum length issue 2020-05-02 17:12:04 -07:00
Sina 80d4afbdc6 option to remove prompt from the output 2020-05-02 16:47:40 -07:00
Sina 53846e142f move paraphrasing files to a separate folder 2020-05-01 23:36:26 -07:00
Sina f886b6f409 Added more logging 2020-05-01 23:09:50 -07:00
Sina 68ffaa1561 minor change 2020-05-01 16:17:11 -07:00
Sina 1be712284f Merge branch 'wip/mbart' into wip/paraphrase 2020-04-30 20:04:41 -07:00
Sina 517f87b66b fix 2020-04-30 16:08:11 -07:00
Sina 77ad950ba1 evaluation bug fixed 2020-04-30 02:00:10 -07:00
mehrad 5f871512a6 bart_evaluation is integrated with run_generation script 2020-04-29 21:14:03 -07:00
Sina fa0ef5b687 arguments of the generation script are now model-agnostic 2020-04-27 23:57:23 -07:00
mehrad a6a16895d6 update ckpt default value 2020-04-27 21:00:18 -07:00
Mehrad Moradshahi 9dc73b5ce4
Delete test_bart.sh 2020-04-27 19:59:10 -07:00
mehrad d0b934bd0d resolve conflicts 2020-04-27 19:27:13 -07:00
Sina 78fb8ab2bc combined BART and GPT2 generation into one script 2020-04-27 18:58:35 -07:00
mehrad 9d5945376f move files to test dir 2020-04-27 16:36:36 -07:00
mehrad 2eb78d7080 add tests 2020-04-27 16:30:19 -07:00
mehrad 4734a3a47c fix ckpt names 2020-04-27 16:30:08 -07:00
mehrad 6b56b4f2cb update BART code 2020-04-27 15:57:21 -07:00
Sina e7e6e3a1c4 simplified generation code using GPTseq2seq 2020-04-26 22:53:35 -07:00
Sina 03e09eddc5 removed unused models in generation scripts 2020-04-26 00:48:52 -07:00
Sina af8d097485 Added GPT2seq2seq model 2020-04-26 00:38:02 -07:00
Sina 37388d82b7 renamed script 2020-04-24 17:15:43 -07:00
Sina 9a7b37786a update transformers version to 2.8 2020-04-24 17:15:10 -07:00
Sina 7db4f8fb21 basic code for BART training and generation 2020-04-24 14:56:08 -07:00
Sina 73cf026f55 data script for n to 1 paraphrasing experiments 2020-04-24 14:26:38 -07:00
Sina b00932ccec simplified transform_dataset's inputs 2020-04-24 14:26:38 -07:00
Sina eda4f26502 paraphrasing now defaults to stdin and stdout 2020-04-24 14:26:38 -07:00
Sina d27323ebe6 efficient multi-gpu predictions 2020-04-24 14:26:28 -07:00
Sina b33fb529ab paraphrasing now keeps ids unique 2020-04-22 17:01:49 -07:00
Sina 2429d6c800 removed unused code in train.py 2020-04-22 17:01:13 -07:00
Mehrad Moradshahi 4b1a5fd95a
Merge pull request #15 from stanford-oval/wip/multilanguage
Wip/multilanguage
2020-04-19 18:58:31 -07:00
mehrad 36b0fb5317 fix bug in iter 2020-04-19 18:03:11 -07:00
Sina 52ca2a6caf fixed LGTM alerts 2020-04-18 20:23:55 -07:00
Sina e24e426a3c test datasets should not be ignored 2020-04-18 19:54:48 -07:00
Sina c68fb694e5 removed old scripts 2020-04-18 19:12:46 -07:00
Sina 02dbbc3bad scripts to support paraphrase generation for dialogues 2020-04-18 19:11:22 -07:00
Sina 1a5a5ebf9b improved paraphrase generation
- can copy from input, or specify the beginning of the output
- calculate BLEU score and EM for outputs
- generation accepts multiple hyperparameter values
- reverse_position_ids support for when output length is known
- can select the best output based on various criteria
2020-04-18 19:10:28 -07:00
Sina eeca171e46 - improved filtering of paraphrasing dataset
- better normalization during generation for punctuation and special tokens
- normalization for cased paraphrasing models
2020-04-18 19:06:32 -07:00
Sina b0a0398576 more features and fixes for paraphraser training
- auxiliary train set for mixing seq2seq and LM modeling loss
- auxiliary dev set to calculate perplexity on
- support training of masked LMs
- transformers==2.5.1
- reversed poisition ids for when the length of output is assumed to be known
2020-04-18 19:03:45 -07:00
Sina 423cc2330f improved speed for prediction
- can override batch size during prediction
- can skip calculating unnecessary metrics
2020-04-18 18:55:40 -07:00
Sina a122c05062 paraphrase training is now done with tsv files 2020-04-18 18:46:45 -07:00
mehrad 0c68277dfe add sep token 2020-04-17 14:57:59 -07:00
mehrad fb7f5fe979 fix shuffle + cap number of paired examples 2020-04-16 20:26:38 -07:00
mehrad 6c7f14a34b filter out same-sentence pairs 2020-04-13 22:19:32 -07:00