Commit Graph

321 Commits

Author SHA1 Message Date
Giovanni Campagna 1c0a71aa0b Add debugging to docker hooks
To see exactly how they get called...
2019-08-30 21:37:42 +02:00
Giovanni Campagna fb72c84f73 docker: move embeddings to a shared directory
This way, the image can be used as a base image by almond-cloud
(which runs as a different user)
2019-08-30 21:25:54 +02:00
Giovanni Campagna ff311bd35b Add docker hub hooks to build both CPU and GPU images 2019-08-30 21:01:47 +02:00
Giovanni Campagna a9264a699d Remove obsolete Dockerfiles, and replace with a new one
A new one that wraps the new decanlp command and installs all
the dependencies correctly.
2019-08-30 18:55:28 +02:00
Giovanni Campagna 48b88317b7 Refuse to run with pytorch 1.2.0
Because pytorch 1.2.0 changed the behavior of Bools vs Uint8 and
that broke us...
2019-08-08 16:07:48 -07:00
Giovanni Campagna f5ea63ecb0
Merge pull request #15 from stanford-oval/wip/mehrad/multi-language
Wip/mehrad/multi language
2019-05-30 08:37:37 -07:00
mehrad 21db011ad2 fixes 2019-05-29 20:57:09 -07:00
mehrad 4919b28627 fixes 2019-05-29 19:38:24 -07:00
mehrad 672aa14117 merge branch wip/mehrad/multi_lingual 2019-05-29 18:43:10 -07:00
mehrad 4004664259 fixes 2019-05-29 16:42:36 -07:00
mehrad a27c3a8d8b bug fixes 2019-05-29 13:37:18 -07:00
mehrad d52d862310 fixing bugs 2019-05-29 12:18:57 -07:00
mehrad eb10a788b0 updates 2019-05-28 18:12:31 -07:00
mehrad bb90a35bc0 updates 2019-05-28 18:11:32 -07:00
mehrad 612e3bdd4d output context sentneces as well for predict.py 2019-05-28 17:43:10 -07:00
mehrad 85c7a99ec2 bunch of updates 2019-05-22 14:04:16 -07:00
mehrad 90308a84e3 minor fix 2019-05-20 17:50:50 -07:00
mehrad acfcbf88c4 updating prediction scripts 2019-05-20 17:43:45 -07:00
mehrad 7db36a90af fixes 2019-05-20 14:29:25 -07:00
mehrad ec77faacca Glueing the models together
end-to-end finetuning works now
2019-05-20 13:41:59 -07:00
mehrad 9bf2217324 adding arguments 2019-05-20 11:02:42 -07:00
Giovanni Campagna 25db020107 Reduce memory usage while loading almond datasets
Don't load all lines in memory
2019-05-15 09:27:00 -07:00
Giovanni Campagna 488a4feb64 Fix contextual almond 2019-05-15 09:26:54 -07:00
mehrad df94f7dd3a ignore malformed sentences 2019-05-13 14:50:08 -07:00
mehrad 0495d4d0eb Updates
1) use FastText for encoding persian text
2) let the user choose the question for almond task
3) bug fixes
2019-05-13 13:03:51 -07:00
Giovanni Campagna 27a3a8b173
Merge pull request #14 from stanford-oval/wip/contextual
Contextual Almond
2019-05-10 09:54:51 -07:00
mehrad 925d839e15 adding an end-to-end combined model 2019-04-29 13:56:39 -07:00
Giovanni Campagna 46eaae8ba8 fix almond dataset name 2019-04-23 09:35:57 -07:00
Giovanni Campagna 64020a497f Add ContextualAlmond task
Its training files have four columns: <id> <context> <sentence> <target_program>
2019-04-23 09:30:49 -07:00
Giovanni Campagna 6837cc7e7b Add missing dependency
Probably before it was coming from somewhere else, like allennlp
2019-04-17 12:10:45 -07:00
Giovanni Campagna 73ffec5365 Populate "install_requires" package metadata
This is necessary to automatically install dependencies when
the user installs the library with pip.
2019-04-17 11:40:40 -07:00
Giovanni Campagna 68e76f7990 Remove unused dependencies
These are not used anywhere I can see.
2019-04-17 11:39:54 -07:00
Giovanni Campagna 13e1c0335e Load allenlp, cove libraries lazily
These libraries are only needed if one passes --elmo or --cove
on the command line. They are annoyingly big libraries, so
it makes sense to keep them optional.
2019-04-17 11:39:15 -07:00
mehrad 19067c71ba option to retrain encoder embeddings 2019-04-15 16:11:10 -07:00
Giovanni Campagna bb84b2b130
Merge pull request #13 from stanford-oval/wip/mmap-embeddings
Memory-mappable embeddings
2019-04-10 23:21:33 -07:00
Giovanni Campagna 8399064f15 vocab: restore "dim" property on load 2019-04-10 11:21:31 -07:00
Giovanni Campagna aed5576756 vocab: use a better hash function
the previous one was not great, and it was particularly bad for
char ngrams, where it would produce collisions almost constantly
2019-04-10 10:59:57 -07:00
Giovanni Campagna 94bebc4435 update tests 2019-04-10 10:38:16 -07:00
Giovanni Campagna 335c792a27 mmappable embeddings: make it work
- handle integer overflow correctly in hashing
- store table, itos and vectors in separate files, because numpy
  ignores mmap_mode for npz files
- optimize the loading of the txt vectors and free memory eagerly
  because otherwise we run out of memory before saving
2019-04-10 10:31:25 -07:00
Giovanni Campagna 8112a985c8 Add "cache-embeddings" subcommand to download embeddings
It's useful to download the embeddings as a separate step
from training or deployment, for example to train on a
firewalled machine.
2019-04-09 16:54:12 -07:00
Giovanni Campagna 3f8f836d02 torchtext.Vocab: store word embeddings in mmap-friendly format on disk
torch.load/save uses pickle, which is not mmappable and causes high
memory usage: the vectors must be completely stored in memory.
This is fine during training, because the training machines are
large and have a lot of ram, but during inference we want to reduce
memory usage to deploy more models on one machine.

Instead, if we use numpy's npz format (uncompressed), all the word
vectors can be stored on disk and loaded on demand when the page
is faulted in. Furthemore, all pages are shared between processes
(so multiple models only use one copy of the embeddings), and the
kernel can free the memory back to disk under pressure.

The annoying part is that we can only store numpy ndarrays in this
format, and not Python native dicts. So instead we need a custom
HashTable implementation that is backed by numpy ndarrays.
As a side bonus, the custom implementation keeps only one copy
of all the words in memory, so memory usage is lower.
2019-04-09 16:54:12 -07:00
Giovanni Campagna 1021c4851c word vectors: ignore all words longer than 100 characters
There's ~100 of these in GloVe and they are all garbage (horizontal
lines, sequences of numbers and urls). This will keep the maximum
word length in check.
2019-04-09 16:54:11 -07:00
mehrad 4905ad6ce8 Fixes
Apparently layer norm implementation can't be tampered with!
Reverting the change for now and switching to a new branch for truly fixing this.
2019-04-08 17:24:02 -07:00
mehrad 03cdc2d0c1 consistent formatting 2019-04-08 16:18:30 -07:00
mehrad a7a2d752d2 Fixes
std() in layer normalization is the culprit for generating NAN.
It happens in the backward pass for values with zero variance.
Just update the mean for these batches.
2019-04-08 14:48:23 -07:00
mehrad 4acdba6c22 fix for NAN loss 2019-04-05 10:26:35 -07:00
Giovanni Campagna d16277b4d3 stop if loss is less than 1e-5 for more than 100 iterations 2019-03-31 17:12:38 -07:00
Giovanni Campagna 09c6e77525
Merge pull request #12 from Stanford-Mobisocial-IoT-Lab/wip/thingtalk-lm
Pretrained decoder language model
2019-03-28 17:58:58 -07:00
mehrad 34ba4d2600 skip batches with NAN loss 2019-03-28 12:37:01 -07:00
Giovanni Campagna 3e3755b19b use a slightly different strategy to make the pretrained lm non-trainable 2019-03-28 00:31:36 -07:00