2020-01-29 02:03:13 +00:00
# Genie NLP library
2018-06-22 17:46:59 +00:00
2020-01-29 16:32:11 +00:00
[![Build Status ](https://travis-ci.com/stanford-oval/genienlp.svg?branch=master )](https://travis-ci.com/stanford-oval/genienlp) [![Language grade: Python ](https://img.shields.io/lgtm/grade/python/g/stanford-oval/genienlp.svg?logo=lgtm&logoWidth=18 )](https://lgtm.com/projects/g/stanford-oval/genienlp/context:python)
2018-08-31 01:54:12 +00:00
2020-01-29 02:03:13 +00:00
This library contains the NLP models for the [Genie ](https://github.com/stanford-oval/genie-toolkit ) toolkit for
virtual assistants. It is derived from the [decaNLP ](https://github.com/salesforce/decaNLP ) library by Salesforce,
but has diverged significantly.
2018-06-20 06:22:34 +00:00
2020-01-29 02:03:13 +00:00
The library is suitable for all NLP tasks that can be framed as Contextual Question Answering, that is, with 3 inputs:
- text or structured input as _context_
- text input as _question_
- text or structured output as _answer_
2018-06-20 06:22:34 +00:00
2020-01-29 02:03:13 +00:00
As the [decaNLP paper ](https://arxiv.org/abs/1806.08730 ) shows, many different NLP tasks can be framed in this way.
Genie primarily uses the library for semantic parsing, dialogue state tracking, and natural language generation
given a formal dialogue state, and this is what the models work best for.
2018-06-20 06:22:34 +00:00
2020-01-29 02:03:13 +00:00
## Installation
2018-06-20 06:22:34 +00:00
2020-01-29 02:03:13 +00:00
genienlp is available on PyPi. You can install it with:
2018-06-20 06:22:34 +00:00
```bash
2020-01-29 02:03:13 +00:00
pip3 install genienlp
2018-06-20 06:22:34 +00:00
```
2020-01-29 02:03:13 +00:00
After installation, a `genienlp` command becomes available.
2018-06-20 06:22:34 +00:00
2020-01-29 02:03:13 +00:00
Likely, you will also want to download the word embeddings ahead of time:
2018-06-20 06:22:34 +00:00
2018-08-31 01:54:12 +00:00
```bash
2020-01-29 02:03:13 +00:00
genienlp cache-embeddings --embeddings glove+char -d < embeddingdir >
2018-08-31 01:54:12 +00:00
```
2020-01-29 02:03:13 +00:00
## Usage
2018-06-20 06:22:34 +00:00
2020-01-29 02:03:13 +00:00
Train a model:
2018-06-20 06:22:34 +00:00
```bash
2020-01-29 02:03:13 +00:00
genienlp train --tasks almond --train_iterations 50000 --embeddings < embeddingdir > --data < datadir > --save < modeldir >
2018-06-20 06:22:34 +00:00
```
2020-01-29 02:03:13 +00:00
Generate predictions:
2018-06-20 06:22:34 +00:00
```bash
2020-01-29 02:03:13 +00:00
genienlp predict --tasks almond --data < datadir > --path < modeldir >
2018-06-20 06:22:34 +00:00
```
2020-03-03 00:57:35 +00:00
Train a paraphrasing model:
```bash
genienlp train-paraphrase --train_data_file < train_data_file > --eval_data_file < dev_data_file > --output_dir < modeldir > --model_type gpt2 --do_train --do_eval --evaluate_during_training --logging_steps 1000 --save_steps 1000 --max_steps 40000 --save_total_limit 2 --gradient_accumulation_steps 16 --per_gpu_eval_batch_size 4 --per_gpu_train_batch_size 4 --num_train_epochs 1 --model_name_or_path < gpt2 / gpt2-medium / gpt2-large / gpt2-xlarge >
```
Generate paraphrases:
```bash
genienlp run-paraphrase --model_type gpt2 --model_name_or_path < modeldir > --temperature 0.3 --repetition_penalty 1.0 --num_samples 4 --length 15 --batch_size 32 --input_file < input tsv file > --input_column 1
```
See `genienlp --help` and `genienlp <command> --help` for details about each argument.
2018-08-16 19:42:37 +00:00
2018-06-20 06:22:34 +00:00
## Citation
2020-01-29 02:03:13 +00:00
If you use the MultiTask Question Answering model in your work, please cite [*The Natural Language Decathlon: Multitask Learning as Question Answering* ](https://arxiv.org/abs/1806.08730 ).
2018-06-20 06:22:34 +00:00
2020-01-29 02:03:13 +00:00
```bibtex
2018-06-20 06:22:34 +00:00
@article {McCann2018decaNLP,
title={The Natural Language Decathlon: Multitask Learning as Question Answering},
author={Bryan McCann and Nitish Shirish Keskar and Caiming Xiong and Richard Socher},
2018-06-25 17:38:08 +00:00
journal={arXiv preprint arXiv:1806.08730},
2018-06-20 06:22:34 +00:00
year={2018}
}
```
2020-11-04 22:07:48 +00:00
If you use the BERT-LSTM model (Identity encoder + MQAN decoder), please cite [Schema2QA: High-Quality and Low-Cost Q&A Agents for the Structured Web ](https://arxiv.org/abs/2001.05609 )
2018-06-20 06:22:34 +00:00
2020-01-29 02:03:13 +00:00
```bibtex
2020-11-04 22:07:48 +00:00
@InProceedings {xu2020schema2qa,
title={{Schema2QA}: High-Quality and Low-Cost {Q\&A} Agents for the Structured Web},
2020-01-29 02:03:13 +00:00
author={Silei Xu and Giovanni Campagna and Jian Li and Monica S. Lam},
2020-11-04 22:07:48 +00:00
booktitle={Proceedings of the 29th ACM International Conference on Information and Knowledge Management},
year={2020},
doi={https://doi.org/10.1145/3340531.3411974}
2020-01-29 02:03:13 +00:00
}
2020-01-29 16:32:11 +00:00
```
2020-11-04 22:07:48 +00:00
If you use the paraphrasing model (BART or GPT-2 fine-tuned on a paraphrasing dataset), please cite [AutoQA: From Databases to QA Semantic Parsers with Only Synthetic Training Data ](https://arxiv.org/abs/2010.04806 )
```bibtex
@inproceedings {xu2020autoqa,
title={Auto{QA}: From Databases to {QA} Semantic Parsers with Only Synthetic Training Data},
author={Silei Xu and Sina J. Semnani and Giovanni Campagna and Monica S. Lam},
booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing},
year={2020}
}
```