genienlp/decanlp/text
Giovanni Campagna 1aa6306702 server: automatically grow the embedding matrix when new words are encountered
Otherwise, we feed actual unks to the model, while the model
is trained with character embeddings and expects to see some
actual value for everything.
2019-03-02 00:44:51 -08:00
..
build_tools/travis Move all python files to a decanlp/ package 2019-03-01 15:43:02 -08:00
docs Move all python files to a decanlp/ package 2019-03-01 15:43:02 -08:00
test Move all python files to a decanlp/ package 2019-03-01 15:43:02 -08:00
torchtext server: automatically grow the embedding matrix when new words are encountered 2019-03-02 00:44:51 -08:00
.flake8 Move all python files to a decanlp/ package 2019-03-01 15:43:02 -08:00
.gitignore Move all python files to a decanlp/ package 2019-03-01 15:43:02 -08:00
.travis.yml Move all python files to a decanlp/ package 2019-03-01 15:43:02 -08:00
LICENSE Move all python files to a decanlp/ package 2019-03-01 15:43:02 -08:00
README.md Move all python files to a decanlp/ package 2019-03-01 15:43:02 -08:00
__init__.py Move all python files to a decanlp/ package 2019-03-01 15:43:02 -08:00
codecov.yml Move all python files to a decanlp/ package 2019-03-01 15:43:02 -08:00
pytest.ini Move all python files to a decanlp/ package 2019-03-01 15:43:02 -08:00
setup.py Move all python files to a decanlp/ package 2019-03-01 15:43:02 -08:00

README.md

Build Status codecov

torchtext

This repository consists of:

  • torchtext.data : Generic data loaders, abstractions, and iterators for text (including vocabulary and word vectors)
  • torchtext.datasets : Pre-built loaders for common NLP datasets

Data

The data module provides the following:

  • Ability to describe declaratively how to load a custom NLP dataset that's in a "normal" format:
pos = data.TabularDataset(
    path='data/pos/pos_wsj_train.tsv', format='tsv',
    fields=[('text', data.Field()),
            ('labels', data.Field())])

sentiment = data.TabularDataset(
    path='data/sentiment/train.json', format='json',
    fields={'sentence_tokenized': ('text', data.Field(sequential=True)),
             'sentiment_gold': ('labels', data.Field(sequential=False))})
  • Ability to define a preprocessing pipeline:
src = data.Field(tokenize=my_custom_tokenizer)
trg = data.Field(tokenize=my_custom_tokenizer)
mt_train = datasets.TranslationDataset(
    path='data/mt/wmt16-ende.train', exts=('.en', '.de'),
    fields=(src, trg))
  • Batching, padding, and numericalizing (including building a vocabulary object):
# continuing from above
mt_dev = data.TranslationDataset(
    path='data/mt/newstest2014', exts=('.en', '.de'),
    fields=(src, trg))
src.build_vocab(mt_train, max_size=80000)
trg.build_vocab(mt_train, max_size=40000)
# mt_dev shares the fields, so it shares their vocab objects

train_iter = data.BucketIterator(
    dataset=mt_train, batch_size=32, 
    sort_key=lambda x: data.interleave_keys(len(x.src), len(x.trg)))
# usage
>>>next(iter(train_iter))
<data.Batch(batch_size=32, src=[LongTensor (32, 25)], trg=[LongTensor (32, 28)])>
  • Wrapper for dataset splits (train, validation, test):
TEXT = data.Field()
LABELS = data.Field()

train, val, test = data.TabularDataset.splits(
    path='/data/pos_wsj/pos_wsj', train='_train.tsv',
    validation='_dev.tsv', test='_test.tsv', format='tsv',
    fields=[('text', TEXT), ('labels', LABELS)])

train_iter, val_iter, test_iter = data.BucketIterator.splits(
    (train, val, test), batch_sizes=(16, 256, 256),
    sort_key=lambda x: len(x.text), device=0)

TEXT.build_vocab(train)
LABELS.build_vocab(train)

Datasets

The datasets module currently contains:

  • Sentiment analysis: SST and IMDb
  • Question classification: TREC
  • Entailment: SNLI
  • Language modeling: abstract class + WikiText-2
  • Machine translation: abstract class + Multi30k, IWSLT, WMT14
  • Sequence tagging (e.g. POS/NER): abstract class + UDPOS

Others are planned or a work in progress:

  • Question answering: SQuAD

See the "test" directory for examples of dataset usage.