genienlp/README.md

# Genie NLP library

[![Build Status](https://travis-ci.com/stanford-oval/genienlp.svg?branch=master)](https://travis-ci.com/stanford-oval/genienlp) [![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/stanford-oval/genienlp.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/stanford-oval/genienlp/context:python)

This library contains the NLP models for the [Genie](https://github.com/stanford-oval/genie-toolkit) toolkit for
virtual assistants. It is derived from the [decaNLP](https://github.com/salesforce/decaNLP) library by Salesforce,
but has diverged significantly.

The library is suitable for all NLP tasks that can be framed as Contextual Question Answering, that is, with 3 inputs:

- text or structured input as _context_
- text input as _question_
- text or structured output as _answer_

As the work by [McCann et al.](https://arxiv.org/abs/1806.08730) shows, many NLP tasks can be framed in this way.
Genie primarily uses the library for semantic parsing, paraphrasing, translation, and dialogue state tracking, and this is
what the models work best for.

## Installation

genienlp is available on PyPi. You can install it with:

```bash
pip3 install genienlp
```

After installation, `genienlp` command becomes available.

## Usage

### Training a semantic parser

The general form is:

```bash
genienlp train --train_tasks almond --train_iterations 50000 --data <datadir> --save <model_dir> <flags>
```

The `<datadir>` should contain a single folder called "almond" (the name of the task). That folder should
contain the files "train.tsv" and "eval.tsv" for train and dev set respectively.

To train a BERT-LSTM (or other MLM-based models) use:

```bash
genienlp train --train_tasks almond --train_iterations 50000 --data <datadir> --save <model_dir> \
  --model TransformerLSTM --pretrained_model bert-base-cased --trainable_decoder_embedding 50
```

To train a BART or other Seq2Seq model, use:

```bash
genienlp train --train_tasks almond --train_iterations 50000 --data <datadir> --save <model_dir> \
  --model TransformerSeq2Seq --pretrained_model facebook/bart-large --gradient_accumulation_steps 20
```

The default batch sizes are tuned for training on a single V100 GPU. Use `--train_batch_tokens` and `--val_batch_size`
to control the batch sizes. See `genienlp train --help` for the full list of options.

**NOTE**: the BERT-LSTM model used by the current version of the library is not comparable with the
one used in our published paper (cited below), because the input preprocessing is different. If you
wish to compare with published results you should use genienlp <= 0.5.0.

### Inference on a semantic parser

In batch mode:

```bash
genienlp predict --tasks almond --data <datadir> --path <model_dir> --eval_dir <output>
```

The `<datadir>` should contain a single folder called "almond" (the name of the task). That folder should
contain the files "train.tsv" and "eval.tsv" for train and dev set respectively. The result of batch prediction
will be saved in `<output>/almond/valid.tsv`, as a TSV file containing ID and prediction.

In interactive mode:

```bash
genienlp server --path <model_dir>
```

Opens a TCP server that listens to requests, formatted as JSON objects containing `id` (the ID of the request),
`task` (the name of the task), `context`, and `question`. The server writes out JSON objects containing `id` and
`answer`. The server listens to port 8401 by default. Use `--port` to specify a different port or `--stdin` to
use standard input/output instead of TCP.

### Calibrating a trained model

Calibrate the confidence scores of a trained model:

1. Calculate and save confidence features of the evaluation set in a pickle file:

   ```bash
   genienlp predict --tasks almond --data <datadir> --path <model_dir> --save_confidence_features --confidence_feature_path <confidence_feature_file>
   ```
2. Train a boosted tree to map confidence features to a score between 0 and 1:

   ```bash
   genienlp calibrate --confidence_path <confidence_feature_file> --save <calibrator_directory> --name_prefix <calibrator_name>
   ````
3. Now if you provide `--calibrator_paths` during prediction, it will output confidence scores for each output:

   ```bash
   genienlp predict --tasks almond --data <datadir> --path <model_dir> --calibrator_paths <calibrator_directory>/<calibrator_name>.calib
   ```

### Paraphrasing

Train a paraphrasing model:

```bash
genienlp train-paraphrase --train_data_file <train_data_file> --eval_data_file <dev_data_file> --output_dir <model_dir> --model_type gpt2 --do_train --do_eval --evaluate_during_training --logging_steps 1000 --save_steps 1000 --max_steps 40000 --save_total_limit 2 --gradient_accumulation_steps 16 --per_gpu_eval_batch_size 4 --per_gpu_train_batch_size 4 --num_train_epochs 1 --model_name_or_path <gpt2/gpt2-medium/gpt2-large/gpt2-xlarge>
```

Generate paraphrases:

```bash
genienlp run-paraphrase --model_name_or_path <model_dir> --temperature 0.3 --repetition_penalty 1.0 --num_samples 4 --batch_size 32 --input_file <input_tsv_file> --input_column 1
```

### Translation

Use the following command for training/ finetuning an NMT model:

```bash
genienlp train --train_tasks almond_translate --data <data_directory> --train_languages <src_lang> --eval_languages <tgt_lang> --no_commit --train_iterations <iterations> --preserve_case --save <save_dir> --exist_ok --skip_cache --model TransformerSeq2Seq --pretrained_model <hf_model_name>
```

We currently support MarianMT, MBART, MT5, and M2M100 models.<br>
To save a pretrained model in genienlp format without any finetuning, set train_iterations to 0. You can then use this model to do inference.

To produce translations for an eval/ test set run the following command:

```bash
genienlp predict --tasks almond_translate --data <data_directory> --pred_languages <src_lang> --pred_tgt_languages <tgt_lang> --path <path_to_saved_model> --eval_dir <eval_dir> --skip_cache --val_batch_size 4000 --evaluate <valid/test>  --overwrite --silent
```

If your dataset is a document or contains long examples, pass `--translate_example_split` to break the examples down into individual sentences before translation for better results. <br>
To use [alignment](https://aclanthology.org/2020.emnlp-main.481.pdf), pass `--do_alignment` which ensures the tokens between quotations marks in the sentence are preserved during translation.

### Named Entity Disambiguation

First run a bootleg model to extract mentions, entity candidates, and contextual embeddings for the mentions.
```bash
genienlp bootleg-dump-features --train_tasks <train_task_names> --save <savedir> --preserve_case --data <dataset_dir> --train_batch_tokens 1200 --val_batch_size 2000 --database_type json --database_dir <database_dir> --min_entity_len 1 --max_entity_len 4 --bootleg_model <bootleg_model>
```
This command generates several output files. In `<dataset_dir>` you should see a `prep` dir which contains preprocessed data (e.g. data converted to memory-mapped format, several arrays to facilitate embedding lookup, etc.) If your dataset doesn't change you can reuse the same files.
It will also generate several files in <results_temp> folder. In `eval_bootleg/[train|eval]/<bootleg_model>/bootleg_lables.jsonl` you can see the examples, mentions, predicted candidates and their probabilities according to bootleg.

Now you can use the extracted features from bootleg in downstream tasks such as semantic parsing to improve named entity understanding and consequently generation:
```bash
genienlp train --train_tasks <train_task_names> --train_iterations <iterations> --preserve_case --save <savedir> --data <dataset_dir> --model TransformerSeq2Seq --pretrained_model facebook/bart-base --train_batch_tokens 1000 --val_batch_size 1000 --do_ned --database_dir <database_dir> --ned_retrieve_method bootleg --entity_attributes type_id type_prob --add_entities_to_text append --bootleg_model <bootleg_model>
```


See `genienlp --help` and `genienlp <command> --help` for more details about each argument.


## Citation

If you use the MultiTask Question Answering model in your work, please cite [*The Natural Language Decathlon: Multitask Learning as Question Answering*](https://arxiv.org/abs/1806.08730).

```bibtex
@article{McCann2018decaNLP,
  title={The Natural Language Decathlon: Multitask Learning as Question Answering},
  author={Bryan McCann and Nitish Shirish Keskar and Caiming Xiong and Richard Socher},
  journal={arXiv preprint arXiv:1806.08730},
  year={2018}
}
```

If you use the BERT-LSTM model (Identity encoder + MQAN decoder), please cite [Schema2QA: High-Quality and Low-Cost Q&A Agents for the Structured Web](https://arxiv.org/abs/2001.05609)

```bibtex
@InProceedings{xu2020schema2qa,
  title={{Schema2QA}: High-Quality and Low-Cost {Q\&A} Agents for the Structured Web},
  author={Silei Xu and Giovanni Campagna and Jian Li and Monica S. Lam},
  booktitle={Proceedings of the 29th ACM International Conference on Information and Knowledge Management},
  year={2020},
  doi={https://doi.org/10.1145/3340531.3411974}
}
```

If you use the paraphrasing model (BART or GPT-2 fine-tuned on a paraphrasing dataset), please cite [AutoQA: From Databases to QA Semantic Parsers with Only Synthetic Training Data](https://arxiv.org/abs/2010.04806)

```bibtex
@inproceedings{xu-etal-2020-autoqa,
    title = "{A}uto{QA}: From Databases to {QA} Semantic Parsers with Only Synthetic Training Data",
    author = "Xu, Silei  and Semnani, Sina  and Campagna, Giovanni  and Lam, Monica",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.31",
    pages = "422--434",
}
```

If you use multilingual models such as MarianMT, MBART, MT5, or XLMR-LSTM for Seq2Seq tasks, please cite [Localizing Open-Ontology QA Semantic Parsers in a Day Using Machine Translation](https://aclanthology.org/2020.emnlp-main.481/),
[Contextual Semantic Parsing for Multilingual Task-Oriented Dialogues](https://arxiv.org/abs/2111.02574), and the original paper that introduced the model.

```bibtex
@inproceedings{moradshahi-etal-2020-localizing,
    title = "Localizing Open-Ontology {QA} Semantic Parsers in a Day Using Machine Translation",
    author = "Moradshahi, Mehrad and Campagna, Giovanni and Semnani, Sina and Xu, Silei and Lam, Monica",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = November,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.481",
    pages = "5970--5983",
}
```
```bibtex
@article{moradshahi2021contextual,
  title={Contextual Semantic Parsing for Multilingual Task-Oriented Dialogues},
  author={Moradshahi, Mehrad and Tsai, Victoria and Campagna, Giovanni and Lam, Monica S},
  journal={arXiv preprint arXiv:2111.02574},
  year={2021}
}
```
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`# Genie NLP library`
Update README.md 2018-06-22 17:46:59 +00:00
Fix readme badges 2020-01-29 16:32:11 +00:00			`[![Build Status](https://travis-ci.com/stanford-oval/genienlp.svg?branch=master)](https://travis-ci.com/stanford-oval/genienlp) [![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/stanford-oval/genienlp.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/stanford-oval/genienlp/context:python)`
Detailed updates for the README 2018-08-31 01:54:12 +00:00
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`This library contains the NLP models for the [Genie](https://github.com/stanford-oval/genie-toolkit) toolkit for`
			`virtual assistants. It is derived from the [decaNLP](https://github.com/salesforce/decaNLP) library by Salesforce,`
			`but has diverged significantly.`
Initial commit 2018-06-20 06:22:34 +00:00
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`The library is suitable for all NLP tasks that can be framed as Contextual Question Answering, that is, with 3 inputs:`
Document calibrator related commands 2021-01-05 07:07:06 +00:00
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`- text or structured input as _context_`
			`- text input as _question_`
			`- text or structured output as _answer_`
Initial commit 2018-06-20 06:22:34 +00:00
Update README with bootleg instructions 2021-01-13 05:10:24 +00:00			`As the work by [McCann et al.](https://arxiv.org/abs/1806.08730) shows, many NLP tasks can be framed in this way.`
README: add translation 2021-06-25 18:18:57 +00:00			`Genie primarily uses the library for semantic parsing, paraphrasing, translation, and dialogue state tracking, and this is`
Update README Expand on the usage instructions, and explain how to get each model. 2020-12-20 03:17:52 +00:00			`what the models work best for.`
Initial commit 2018-06-20 06:22:34 +00:00
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`## Installation`
Initial commit 2018-06-20 06:22:34 +00:00
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`genienlp is available on PyPi. You can install it with:`
Document calibrator related commands 2021-01-05 07:07:06 +00:00
Initial commit 2018-06-20 06:22:34 +00:00			```bash
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`pip3 install genienlp`
Initial commit 2018-06-20 06:22:34 +00:00			```

Document calibrator related commands 2021-01-05 07:07:06 +00:00			After installation, `genienlp` command becomes available.
Initial commit 2018-06-20 06:22:34 +00:00
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`## Usage`
Initial commit 2018-06-20 06:22:34 +00:00
Update README Expand on the usage instructions, and explain how to get each model. 2020-12-20 03:17:52 +00:00			`### Training a semantic parser`

			`The general form is:`
Add support for multiple calibrators In both prediction.py and server.py 2021-02-05 05:12:48 +00:00
Update README Expand on the usage instructions, and explain how to get each model. 2020-12-20 03:17:52 +00:00			```bash
Fixed typo of argument "train_tasks" (#226) * Fixed typo of argument "train_tasks" * Fixed typo of argument "train_tasks" 2021-11-11 22:28:53 +00:00			`genienlp train --train_tasks almond --train_iterations 50000 --data <datadir> --save <model_dir> <flags>`
Update README Expand on the usage instructions, and explain how to get each model. 2020-12-20 03:17:52 +00:00			```

			The `<datadir>` should contain a single folder called "almond" (the name of the task). That folder should
			`contain the files "train.tsv" and "eval.tsv" for train and dev set respectively.`

Update Readme 2021-11-11 23:18:04 +00:00			`To train a BERT-LSTM (or other MLM-based models) use:`
Add support for multiple calibrators In both prediction.py and server.py 2021-02-05 05:12:48 +00:00
Initial commit 2018-06-20 06:22:34 +00:00			```bash
Fixed typo of argument "train_tasks" (#226) * Fixed typo of argument "train_tasks" * Fixed typo of argument "train_tasks" 2021-11-11 22:28:53 +00:00			`genienlp train --train_tasks almond --train_iterations 50000 --data <datadir> --save <model_dir> \`
Update README Expand on the usage instructions, and explain how to get each model. 2020-12-20 03:17:52 +00:00			`--model TransformerLSTM --pretrained_model bert-base-cased --trainable_decoder_embedding 50`
Initial commit 2018-06-20 06:22:34 +00:00			```

Update README Expand on the usage instructions, and explain how to get each model. 2020-12-20 03:17:52 +00:00			`To train a BART or other Seq2Seq model, use:`
Add support for multiple calibrators In both prediction.py and server.py 2021-02-05 05:12:48 +00:00
Initial commit 2018-06-20 06:22:34 +00:00			```bash
Fixed typo of argument "train_tasks" (#226) * Fixed typo of argument "train_tasks" * Fixed typo of argument "train_tasks" 2021-11-11 22:28:53 +00:00			`genienlp train --train_tasks almond --train_iterations 50000 --data <datadir> --save <model_dir> \`
Update README Expand on the usage instructions, and explain how to get each model. 2020-12-20 03:17:52 +00:00			`--model TransformerSeq2Seq --pretrained_model facebook/bart-large --gradient_accumulation_steps 20`
Initial commit 2018-06-20 06:22:34 +00:00			```

Update README Expand on the usage instructions, and explain how to get each model. 2020-12-20 03:17:52 +00:00			The default batch sizes are tuned for training on a single V100 GPU. Use `--train_batch_tokens` and `--val_batch_size`
			to control the batch sizes. See `genienlp train --help` for the full list of options.

			`NOTE: the BERT-LSTM model used by the current version of the library is not comparable with the`
			`one used in our published paper (cited below), because the input preprocessing is different. If you`
			`wish to compare with published results you should use genienlp <= 0.5.0.`

			`### Inference on a semantic parser`

			`In batch mode:`
Add support for multiple calibrators In both prediction.py and server.py 2021-02-05 05:12:48 +00:00
Update README Expand on the usage instructions, and explain how to get each model. 2020-12-20 03:17:52 +00:00			```bash
Update README 2021-02-05 06:34:07 +00:00			`genienlp predict --tasks almond --data <datadir> --path <model_dir> --eval_dir <output>`
Update README Expand on the usage instructions, and explain how to get each model. 2020-12-20 03:17:52 +00:00			```

			The `<datadir>` should contain a single folder called "almond" (the name of the task). That folder should
			`contain the files "train.tsv" and "eval.tsv" for train and dev set respectively. The result of batch prediction`
			will be saved in `<output>/almond/valid.tsv`, as a TSV file containing ID and prediction.

			`In interactive mode:`
Add support for multiple calibrators In both prediction.py and server.py 2021-02-05 05:12:48 +00:00
Update README Expand on the usage instructions, and explain how to get each model. 2020-12-20 03:17:52 +00:00			```bash
Update README 2021-02-05 06:34:07 +00:00			`genienlp server --path <model_dir>`
Update README Expand on the usage instructions, and explain how to get each model. 2020-12-20 03:17:52 +00:00			```

			Opens a TCP server that listens to requests, formatted as JSON objects containing `id` (the ID of the request),
Update Readme 2021-11-11 23:18:04 +00:00			`task` (the name of the task), `context`, and `question`. The server writes out JSON objects containing `id` and
			`answer`. The server listens to port 8401 by default. Use `--port` to specify a different port or `--stdin` to
Update README Expand on the usage instructions, and explain how to get each model. 2020-12-20 03:17:52 +00:00			`use standard input/output instead of TCP.`

Document calibrator related commands 2021-01-05 07:07:06 +00:00			`### Calibrating a trained model`
Add support for multiple calibrators In both prediction.py and server.py 2021-02-05 05:12:48 +00:00
Document calibrator related commands 2021-01-05 07:07:06 +00:00			`Calibrate the confidence scores of a trained model:`

Update Readme 2021-11-11 23:18:04 +00:00			`1. Calculate and save confidence features of the evaluation set in a pickle file:`
Document calibrator related commands 2021-01-05 07:07:06 +00:00
Add support for multiple calibrators In both prediction.py and server.py 2021-02-05 05:12:48 +00:00			```bash
Fixed typo of argument "train_tasks" (#226) * Fixed typo of argument "train_tasks" * Fixed typo of argument "train_tasks" 2021-11-11 22:28:53 +00:00			`genienlp predict --tasks almond --data <datadir> --path <model_dir> --save_confidence_features --confidence_feature_path <confidence_feature_file>`
Add support for multiple calibrators In both prediction.py and server.py 2021-02-05 05:12:48 +00:00			```
			`2. Train a boosted tree to map confidence features to a score between 0 and 1:`
Update README Expand on the usage instructions, and explain how to get each model. 2020-12-20 03:17:52 +00:00
Add support for multiple calibrators In both prediction.py and server.py 2021-02-05 05:12:48 +00:00			```bash
Fix calibrator file names 2021-02-05 05:55:56 +00:00			`genienlp calibrate --confidence_path <confidence_feature_file> --save <calibrator_directory> --name_prefix <calibrator_name>`
Add support for multiple calibrators In both prediction.py and server.py 2021-02-05 05:12:48 +00:00			````
Update README 2021-02-05 06:34:07 +00:00			3. Now if you provide `--calibrator_paths` during prediction, it will output confidence scores for each output:
Add support for multiple calibrators In both prediction.py and server.py 2021-02-05 05:12:48 +00:00
			```bash
Update README 2021-02-05 06:34:07 +00:00			`genienlp predict --tasks almond --data <datadir> --path <model_dir> --calibrator_paths <calibrator_directory>/<calibrator_name>.calib`
Add support for multiple calibrators In both prediction.py and server.py 2021-02-05 05:12:48 +00:00			```
Document calibrator related commands 2021-01-05 07:07:06 +00:00
			`### Paraphrasing`
Add support for multiple calibrators In both prediction.py and server.py 2021-02-05 05:12:48 +00:00
Added tests and instructions for paraphrasing 2020-03-03 00:57:35 +00:00			`Train a paraphrasing model:`
Document calibrator related commands 2021-01-05 07:07:06 +00:00
Added tests and instructions for paraphrasing 2020-03-03 00:57:35 +00:00			```bash
Update README 2021-02-05 06:34:07 +00:00			`genienlp train-paraphrase --train_data_file <train_data_file> --eval_data_file <dev_data_file> --output_dir <model_dir> --model_type gpt2 --do_train --do_eval --evaluate_during_training --logging_steps 1000 --save_steps 1000 --max_steps 40000 --save_total_limit 2 --gradient_accumulation_steps 16 --per_gpu_eval_batch_size 4 --per_gpu_train_batch_size 4 --num_train_epochs 1 --model_name_or_path <gpt2/gpt2-medium/gpt2-large/gpt2-xlarge>`
Added tests and instructions for paraphrasing 2020-03-03 00:57:35 +00:00			```

			`Generate paraphrases:`
Document calibrator related commands 2021-01-05 07:07:06 +00:00
Added tests and instructions for paraphrasing 2020-03-03 00:57:35 +00:00			```bash
Update README 2021-02-05 06:34:07 +00:00			`genienlp run-paraphrase --model_name_or_path <model_dir> --temperature 0.3 --repetition_penalty 1.0 --num_samples 4 --batch_size 32 --input_file <input_tsv_file> --input_column 1`
Added tests and instructions for paraphrasing 2020-03-03 00:57:35 +00:00			```

README: add translation 2021-06-25 18:18:57 +00:00			`### Translation`

			`Use the following command for training/ finetuning an NMT model:`

			```bash
Misc. code updates update translation tests, address some bugs 2021-06-30 02:30:43 +00:00			`genienlp train --train_tasks almond_translate --data <data_directory> --train_languages <src_lang> --eval_languages <tgt_lang> --no_commit --train_iterations <iterations> --preserve_case --save <save_dir> --exist_ok --skip_cache --model TransformerSeq2Seq --pretrained_model <hf_model_name>`
README: add translation 2021-06-25 18:18:57 +00:00			```

			`We currently support MarianMT, MBART, MT5, and M2M100 models.<br>`
			`To save a pretrained model in genienlp format without any finetuning, set train_iterations to 0. You can then use this model to do inference.`

			`To produce translations for an eval/ test set run the following command:`

			```bash
			`genienlp predict --tasks almond_translate --data <data_directory> --pred_languages <src_lang> --pred_tgt_languages <tgt_lang> --path <path_to_saved_model> --eval_dir <eval_dir> --skip_cache --val_batch_size 4000 --evaluate <valid/test> --overwrite --silent`
			```

			If your dataset is a document or contains long examples, pass `--translate_example_split` to break the examples down into individual sentences before translation for better results. <br>
Update Readme 2021-11-11 23:18:04 +00:00			To use [alignment](https://aclanthology.org/2020.emnlp-main.481.pdf), pass `--do_alignment` which ensures the tokens between quotations marks in the sentence are preserved during translation.
Update README with bootleg instructions 2021-01-13 05:10:24 +00:00
			`### Named Entity Disambiguation`

			`First run a bootleg model to extract mentions, entity candidates, and contextual embeddings for the mentions.`
			```bash
Update README.md 2021-08-01 17:44:45 +00:00			`genienlp bootleg-dump-features --train_tasks <train_task_names> --save <savedir> --preserve_case --data <dataset_dir> --train_batch_tokens 1200 --val_batch_size 2000 --database_type json --database_dir <database_dir> --min_entity_len 1 --max_entity_len 4 --bootleg_model <bootleg_model>`
Update README with bootleg instructions 2021-01-13 05:10:24 +00:00			```
Update Readme 2021-11-11 23:18:04 +00:00			This command generates several output files. In `<dataset_dir>` you should see a `prep` dir which contains preprocessed data (e.g. data converted to memory-mapped format, several arrays to facilitate embedding lookup, etc.) If your dataset doesn't change you can reuse the same files.
Style check all files 2021-05-24 21:54:36 +00:00			It will also generate several files in <results_temp> folder. In `eval_bootleg/[train\|eval]/<bootleg_model>/bootleg_lables.jsonl` you can see the examples, mentions, predicted candidates and their probabilities according to bootleg.
Update README with bootleg instructions 2021-01-13 05:10:24 +00:00
			`Now you can use the extracted features from bootleg in downstream tasks such as semantic parsing to improve named entity understanding and consequently generation:`
			```bash
Update README.md 2021-08-01 17:44:45 +00:00			`genienlp train --train_tasks <train_task_names> --train_iterations <iterations> --preserve_case --save <savedir> --data <dataset_dir> --model TransformerSeq2Seq --pretrained_model facebook/bart-base --train_batch_tokens 1000 --val_batch_size 1000 --do_ned --database_dir <database_dir> --ned_retrieve_method bootleg --entity_attributes type_id type_prob --add_entities_to_text append --bootleg_model <bootleg_model>`
Update README with bootleg instructions 2021-01-13 05:10:24 +00:00			```


Update README 2021-02-05 06:34:07 +00:00			See `genienlp --help` and `genienlp <command> --help` for more details about each argument.
easy inference on a custom dataset 2018-08-16 19:42:37 +00:00
Update README with bootleg instructions 2021-01-13 05:10:24 +00:00
Initial commit 2018-06-20 06:22:34 +00:00			`## Citation`

Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`If you use the MultiTask Question Answering model in your work, please cite [The Natural Language Decathlon: Multitask Learning as Question Answering](https://arxiv.org/abs/1806.08730).`
Initial commit 2018-06-20 06:22:34 +00:00
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			```bibtex
Initial commit 2018-06-20 06:22:34 +00:00			`@article{McCann2018decaNLP,`
			`title={The Natural Language Decathlon: Multitask Learning as Question Answering},`
			`author={Bryan McCann and Nitish Shirish Keskar and Caiming Xiong and Richard Socher},`
Updating links to arXiv 2018-06-25 17:38:08 +00:00			`journal={arXiv preprint arXiv:1806.08730},`
Initial commit 2018-06-20 06:22:34 +00:00			`year={2018}`
			`}`
			```

Update citations for Schema2QA and add AutoQA (#43) 2020-11-04 22:07:48 +00:00			`If you use the BERT-LSTM model (Identity encoder + MQAN decoder), please cite [Schema2QA: High-Quality and Low-Cost Q&A Agents for the Structured Web](https://arxiv.org/abs/2001.05609)`
Initial commit 2018-06-20 06:22:34 +00:00
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			```bibtex
Update citations for Schema2QA and add AutoQA (#43) 2020-11-04 22:07:48 +00:00			`@InProceedings{xu2020schema2qa,`
			`title={{Schema2QA}: High-Quality and Low-Cost {Q\&A} Agents for the Structured Web},`
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`author={Silei Xu and Giovanni Campagna and Jian Li and Monica S. Lam},`
Update citations for Schema2QA and add AutoQA (#43) 2020-11-04 22:07:48 +00:00			`booktitle={Proceedings of the 29th ACM International Conference on Information and Knowledge Management},`
			`year={2020},`
			`doi={https://doi.org/10.1145/3340531.3411974}`
Rebrand as genienlp 2020-01-29 02:03:13 +00:00			`}`
Fix readme badges 2020-01-29 16:32:11 +00:00			```
Update citations for Schema2QA and add AutoQA (#43) 2020-11-04 22:07:48 +00:00
			`If you use the paraphrasing model (BART or GPT-2 fine-tuned on a paraphrasing dataset), please cite [AutoQA: From Databases to QA Semantic Parsers with Only Synthetic Training Data](https://arxiv.org/abs/2010.04806)`

			```bibtex
Change AutoQA citation to ACL anthology 2020-11-13 21:14:44 +00:00			`@inproceedings{xu-etal-2020-autoqa,`
			`title = "{A}uto{QA}: From Databases to {QA} Semantic Parsers with Only Synthetic Training Data",`
			`author = "Xu, Silei and Semnani, Sina and Campagna, Giovanni and Lam, Monica",`
			`booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",`
			`month = nov,`
			`year = "2020",`
			`address = "Online",`
			`publisher = "Association for Computational Linguistics",`
			`url = "https://www.aclweb.org/anthology/2020.emnlp-main.31",`
			`pages = "422--434",`
Update citations for Schema2QA and add AutoQA (#43) 2020-11-04 22:07:48 +00:00			`}`
Add SPL citation 2020-11-13 18:19:14 +00:00			```

Update Readme 2021-11-11 23:18:04 +00:00			`If you use multilingual models such as MarianMT, MBART, MT5, or XLMR-LSTM for Seq2Seq tasks, please cite [Localizing Open-Ontology QA Semantic Parsers in a Day Using Machine Translation](https://aclanthology.org/2020.emnlp-main.481/),`
			`[Contextual Semantic Parsing for Multilingual Task-Oriented Dialogues](https://arxiv.org/abs/2111.02574), and the original paper that introduced the model.`
Add SPL citation 2020-11-13 18:19:14 +00:00
			```bibtex
			`@inproceedings{moradshahi-etal-2020-localizing,`
			`title = "Localizing Open-Ontology {QA} Semantic Parsers in a Day Using Machine Translation",`
			`author = "Moradshahi, Mehrad and Campagna, Giovanni and Semnani, Sina and Xu, Silei and Lam, Monica",`
			`booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",`
			`month = November,`
			`year = "2020",`
			`address = "Online",`
			`publisher = "Association for Computational Linguistics",`
			`url = "https://www.aclweb.org/anthology/2020.emnlp-main.481",`
			`pages = "5970--5983",`
			`}`
Document calibrator related commands 2021-01-05 07:07:06 +00:00			```
Update Readme 2021-11-11 23:18:04 +00:00			```bibtex
			`@article{moradshahi2021contextual,`
			`title={Contextual Semantic Parsing for Multilingual Task-Oriented Dialogues},`
			`author={Moradshahi, Mehrad and Tsai, Victoria and Campagna, Giovanni and Lam, Monica S},`
			`journal={arXiv preprint arXiv:2111.02574},`
			`year={2021}`
			`}`
			```