genienlp/README.md

# Genie NLP library

[![Build Status](https://travis-ci.com/stanford-oval/genienlp.svg?branch=master)](https://travis-ci.com/stanford-oval/genienlp) [![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/stanford-oval/genienlp.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/stanford-oval/genienlp/context:python)

This library contains the NLP models for the [Genie](https://github.com/stanford-oval/genie-toolkit) toolkit for
virtual assistants. It is derived from the [decaNLP](https://github.com/salesforce/decaNLP) library by Salesforce,
but has diverged significantly.

The library is suitable for all NLP tasks that can be framed as Contextual Question Answering, that is, with 3 inputs:
- text or structured input as _context_
- text input as _question_
- text or structured output as _answer_

As the [decaNLP paper](https://arxiv.org/abs/1806.08730) shows, many different NLP tasks can be framed in this way.
Genie primarily uses the library for semantic parsing, dialogue state tracking, and natural language generation 
given a formal dialogue state, and this is what the models work best for.

## Installation

genienlp is available on PyPi. You can install it with:
```bash
pip3 install genienlp
```

After installation, a `genienlp` command becomes available.

Likely, you will also want to download the word embeddings ahead of time:

```bash
genienlp cache-embeddings --embeddings glove+char -d <embeddingdir>
```

## Usage

Train a model:
```bash
genienlp train --tasks almond --train_iterations 50000 --embeddings <embeddingdir> --data <datadir> --save <modeldir>
```

Generate predictions:
```bash
genienlp predict --tasks almond --data <datadir> --path <modeldir>
```

Train a paraphrasing model:
```bash
genienlp train-paraphrase --train_data_file <train_data_file> --eval_data_file <dev_data_file> --output_dir <modeldir> --model_type gpt2 --do_train --do_eval --evaluate_during_training --logging_steps 1000 --save_steps 1000 --max_steps 40000 --save_total_limit 2 --gradient_accumulation_steps 16 --per_gpu_eval_batch_size 4 --per_gpu_train_batch_size 4 --num_train_epochs 1 --model_name_or_path <gpt2/gpt2-medium/gpt2-large/gpt2-xlarge>
```

Generate paraphrases:
```bash
genienlp run-paraphrase --model_type gpt2 --model_name_or_path <modeldir> --temperature 0.3 --repetition_penalty 1.0 --num_samples 4 --length 15 --batch_size 32 --input_file <input tsv file> --input_column 1
```

See `genienlp --help` and `genienlp <command> --help` for details about each argument.

## Citation

If you use the MultiTask Question Answering model in your work, please cite [*The Natural Language Decathlon: Multitask Learning as Question Answering*](https://arxiv.org/abs/1806.08730).

```bibtex
@article{McCann2018decaNLP,
  title={The Natural Language Decathlon: Multitask Learning as Question Answering},
  author={Bryan McCann and Nitish Shirish Keskar and Caiming Xiong and Richard Socher},
  journal={arXiv preprint arXiv:1806.08730},
  year={2018}
}
```

If you use the BERT-LSTM model (Identity encoder + MQAN decoder), please cite [Schema2QA: High-Quality and Low-Cost Q&A Agents for the Structured Web](https://arxiv.org/abs/2001.05609)

```bibtex
@InProceedings{xu2020schema2qa,
  title={{Schema2QA}: High-Quality and Low-Cost {Q\&A} Agents for the Structured Web},
  author={Silei Xu and Giovanni Campagna and Jian Li and Monica S. Lam},
  booktitle={Proceedings of the 29th ACM International Conference on Information and Knowledge Management},
  year={2020},
  doi={https://doi.org/10.1145/3340531.3411974}
}
```

If you use the paraphrasing model (BART or GPT-2 fine-tuned on a paraphrasing dataset), please cite [AutoQA: From Databases to QA Semantic Parsers with Only Synthetic Training Data](https://arxiv.org/abs/2010.04806)

```bibtex
@inproceedings{xu2020autoqa,
  title={Auto{QA}: From Databases to {QA} Semantic Parsers with Only Synthetic Training Data},
  author={Silei Xu and Sina J. Semnani  and Giovanni Campagna and Monica S. Lam},
  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing},
  year={2020}
}
```
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`# Genie NLP library`
Update README.md 2018-06-22 17:46:59 +00:00
Fix readme badges 2020-01-29 16:32:11 +00:00			`[![Build Status](https://travis-ci.com/stanford-oval/genienlp.svg?branch=master)](https://travis-ci.com/stanford-oval/genienlp) [![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/stanford-oval/genienlp.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/stanford-oval/genienlp/context:python)`
Detailed updates for the README 2018-08-31 01:54:12 +00:00
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`This library contains the NLP models for the [Genie](https://github.com/stanford-oval/genie-toolkit) toolkit for`
			`virtual assistants. It is derived from the [decaNLP](https://github.com/salesforce/decaNLP) library by Salesforce,`
			`but has diverged significantly.`
Initial commit 2018-06-20 06:22:34 +00:00
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`The library is suitable for all NLP tasks that can be framed as Contextual Question Answering, that is, with 3 inputs:`
			`- text or structured input as _context_`
			`- text input as _question_`
			`- text or structured output as _answer_`
Initial commit 2018-06-20 06:22:34 +00:00
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`As the [decaNLP paper](https://arxiv.org/abs/1806.08730) shows, many different NLP tasks can be framed in this way.`
			`Genie primarily uses the library for semantic parsing, dialogue state tracking, and natural language generation`
			`given a formal dialogue state, and this is what the models work best for.`
Initial commit 2018-06-20 06:22:34 +00:00
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`## Installation`
Initial commit 2018-06-20 06:22:34 +00:00
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`genienlp is available on PyPi. You can install it with:`
Initial commit 2018-06-20 06:22:34 +00:00			```bash
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`pip3 install genienlp`
Initial commit 2018-06-20 06:22:34 +00:00			```

Rebrand as genienlp 2020-01-29 02:03:13 +00:00			After installation, a `genienlp` command becomes available.
Initial commit 2018-06-20 06:22:34 +00:00
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`Likely, you will also want to download the word embeddings ahead of time:`
Initial commit 2018-06-20 06:22:34 +00:00
Detailed updates for the README 2018-08-31 01:54:12 +00:00			```bash
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`genienlp cache-embeddings --embeddings glove+char -d <embeddingdir>`
Detailed updates for the README 2018-08-31 01:54:12 +00:00			```

Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`## Usage`
Initial commit 2018-06-20 06:22:34 +00:00
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`Train a model:`
Initial commit 2018-06-20 06:22:34 +00:00			```bash
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`genienlp train --tasks almond --train_iterations 50000 --embeddings <embeddingdir> --data <datadir> --save <modeldir>`
Initial commit 2018-06-20 06:22:34 +00:00			```

Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`Generate predictions:`
Initial commit 2018-06-20 06:22:34 +00:00			```bash
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`genienlp predict --tasks almond --data <datadir> --path <modeldir>`
Initial commit 2018-06-20 06:22:34 +00:00			```

Added tests and instructions for paraphrasing 2020-03-03 00:57:35 +00:00			`Train a paraphrasing model:`
			```bash
			`genienlp train-paraphrase --train_data_file <train_data_file> --eval_data_file <dev_data_file> --output_dir <modeldir> --model_type gpt2 --do_train --do_eval --evaluate_during_training --logging_steps 1000 --save_steps 1000 --max_steps 40000 --save_total_limit 2 --gradient_accumulation_steps 16 --per_gpu_eval_batch_size 4 --per_gpu_train_batch_size 4 --num_train_epochs 1 --model_name_or_path <gpt2/gpt2-medium/gpt2-large/gpt2-xlarge>`
			```

			`Generate paraphrases:`
			```bash
			`genienlp run-paraphrase --model_type gpt2 --model_name_or_path <modeldir> --temperature 0.3 --repetition_penalty 1.0 --num_samples 4 --length 15 --batch_size 32 --input_file <input tsv file> --input_column 1`
			```

			See `genienlp --help` and `genienlp <command> --help` for details about each argument.
easy inference on a custom dataset 2018-08-16 19:42:37 +00:00
Initial commit 2018-06-20 06:22:34 +00:00			`## Citation`

Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`If you use the MultiTask Question Answering model in your work, please cite [The Natural Language Decathlon: Multitask Learning as Question Answering](https://arxiv.org/abs/1806.08730).`
Initial commit 2018-06-20 06:22:34 +00:00
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			```bibtex
Initial commit 2018-06-20 06:22:34 +00:00			`@article{McCann2018decaNLP,`
			`title={The Natural Language Decathlon: Multitask Learning as Question Answering},`
			`author={Bryan McCann and Nitish Shirish Keskar and Caiming Xiong and Richard Socher},`
Updating links to arXiv 2018-06-25 17:38:08 +00:00			`journal={arXiv preprint arXiv:1806.08730},`
Initial commit 2018-06-20 06:22:34 +00:00			`year={2018}`
			`}`
			```

Update citations for Schema2QA and add AutoQA (#43) 2020-11-04 22:07:48 +00:00			`If you use the BERT-LSTM model (Identity encoder + MQAN decoder), please cite [Schema2QA: High-Quality and Low-Cost Q&A Agents for the Structured Web](https://arxiv.org/abs/2001.05609)`
Initial commit 2018-06-20 06:22:34 +00:00
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			```bibtex
Update citations for Schema2QA and add AutoQA (#43) 2020-11-04 22:07:48 +00:00			`@InProceedings{xu2020schema2qa,`
			`title={{Schema2QA}: High-Quality and Low-Cost {Q\&A} Agents for the Structured Web},`
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`author={Silei Xu and Giovanni Campagna and Jian Li and Monica S. Lam},`
Update citations for Schema2QA and add AutoQA (#43) 2020-11-04 22:07:48 +00:00			`booktitle={Proceedings of the 29th ACM International Conference on Information and Knowledge Management},`
			`year={2020},`
			`doi={https://doi.org/10.1145/3340531.3411974}`
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`}`
Fix readme badges 2020-01-29 16:32:11 +00:00			```
Update citations for Schema2QA and add AutoQA (#43) 2020-11-04 22:07:48 +00:00
			`If you use the paraphrasing model (BART or GPT-2 fine-tuned on a paraphrasing dataset), please cite [AutoQA: From Databases to QA Semantic Parsers with Only Synthetic Training Data](https://arxiv.org/abs/2010.04806)`

			```bibtex
			`@inproceedings{xu2020autoqa,`
			`title={Auto{QA}: From Databases to {QA} Semantic Parsers with Only Synthetic Training Data},`
			`author={Silei Xu and Sina J. Semnani and Giovanni Campagna and Monica S. Lam},`
			`booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing},`
			`year={2020}`
			`}`
			```