GenieNLP
GenieNLP is suitable for all NLP tasks, including text generation (e.g. translation, paraphasing), token classification (e.g. named entity recognition) and sequence classification (e.g. NLI, sentiment analysis).
This library contains the code to run NLP models for the [Genie Toolkit](https://github.com/stanford-oval/genie-toolkit) and the [Genie Virtual Assistant](https://genie.stanford.edu/).
Genie primarily uses this library for semantic parsing, paraphrasing, translation, and dialogue state tracking. Therefore, GenieNLP has a lot of extra features for these tasks.
Works with [🤗 models](https://huggingface.co/models) and [🤗 Datasets](https://huggingface.co/datasets).
## Table of Contents
- [Installation](#installation)
- [Usage](#usage)
- [Training a semantic parser](#training-a-semantic-parser)
- [Inference on a semantic parser](#inference-on-a-semantic-parser)
- [Calibrating a trained model](#calibrating-a-trained-model)
- [Paraphrasing](#paraphrasing)
- [Translation](#translation)
- [Named Entity Disambiguation](#named-entity-disambiguation)
- [Citation](#citation)
## Installation
GenieNLP is tested with Python 3.8.
You can install the latest release with pip from PyPI:
```bash
pip install genienlp
```
Or from source:
```bash
git clone https://github.com/stanford-oval/genienlp.git
cd genienlp
pip install -e . # -e means your changes to the code will automatically take effect without the need to reinstall
```
After installation, `genienlp` command becomes available.
Some GenieNLP commands have additional dependencies for plotting and entity detection. If you are using those commands, you can obtain their dependencies by running the following:
```
pip install matplotlib~=3.0 seaborn~=0.9
python -m spacy download en_core_web_sm
```
## Usage
### Training a semantic parser
The general form is:
```bash
genienlp train --train_tasks almond --train_iterations 50000 --data --save
```
The `` should contain a single folder called "almond" (the name of the task). That folder should
contain the files "train.tsv" and "eval.tsv" for train and dev set respectively.
To train a BERT-LSTM (or other MLM-based models) use:
```bash
genienlp train --train_tasks almond --train_iterations 50000 --data --save \
--model TransformerLSTM --pretrained_model bert-base-cased --trainable_decoder_embedding 50
```
To train a BART or other Seq2Seq model, use:
```bash
genienlp train --train_tasks almond --train_iterations 50000 --data --save \
--model TransformerSeq2Seq --pretrained_model facebook/bart-large --gradient_accumulation_steps 20
```
The default batch sizes are tuned for training on a single V100 GPU. Use `--train_batch_tokens` and `--val_batch_size`
to control the batch sizes. See `genienlp train --help` for the full list of options.
**NOTE**: the BERT-LSTM model used by the current version of the library is not comparable with the
one used in our published paper (cited below), because the input preprocessing is different. If you
wish to compare with published results you should use genienlp <= 0.5.0.
### Inference on a semantic parser
In batch mode:
```bash
genienlp predict --tasks almond --data --path --eval_dir