Commit Graph

289 Commits

Author SHA1 Message Date
Giovanni Campagna 68e76f7990 Remove unused dependencies
These are not used anywhere I can see.
2019-04-17 11:39:54 -07:00
Giovanni Campagna 13e1c0335e Load allenlp, cove libraries lazily
These libraries are only needed if one passes --elmo or --cove
on the command line. They are annoyingly big libraries, so
it makes sense to keep them optional.
2019-04-17 11:39:15 -07:00
Giovanni Campagna bb84b2b130
Merge pull request #13 from stanford-oval/wip/mmap-embeddings
Memory-mappable embeddings
2019-04-10 23:21:33 -07:00
Giovanni Campagna 8399064f15 vocab: restore "dim" property on load 2019-04-10 11:21:31 -07:00
Giovanni Campagna aed5576756 vocab: use a better hash function
the previous one was not great, and it was particularly bad for
char ngrams, where it would produce collisions almost constantly
2019-04-10 10:59:57 -07:00
Giovanni Campagna 94bebc4435 update tests 2019-04-10 10:38:16 -07:00
Giovanni Campagna 335c792a27 mmappable embeddings: make it work
- handle integer overflow correctly in hashing
- store table, itos and vectors in separate files, because numpy
  ignores mmap_mode for npz files
- optimize the loading of the txt vectors and free memory eagerly
  because otherwise we run out of memory before saving
2019-04-10 10:31:25 -07:00
Giovanni Campagna 8112a985c8 Add "cache-embeddings" subcommand to download embeddings
It's useful to download the embeddings as a separate step
from training or deployment, for example to train on a
firewalled machine.
2019-04-09 16:54:12 -07:00
Giovanni Campagna 3f8f836d02 torchtext.Vocab: store word embeddings in mmap-friendly format on disk
torch.load/save uses pickle, which is not mmappable and causes high
memory usage: the vectors must be completely stored in memory.
This is fine during training, because the training machines are
large and have a lot of ram, but during inference we want to reduce
memory usage to deploy more models on one machine.

Instead, if we use numpy's npz format (uncompressed), all the word
vectors can be stored on disk and loaded on demand when the page
is faulted in. Furthemore, all pages are shared between processes
(so multiple models only use one copy of the embeddings), and the
kernel can free the memory back to disk under pressure.

The annoying part is that we can only store numpy ndarrays in this
format, and not Python native dicts. So instead we need a custom
HashTable implementation that is backed by numpy ndarrays.
As a side bonus, the custom implementation keeps only one copy
of all the words in memory, so memory usage is lower.
2019-04-09 16:54:12 -07:00
Giovanni Campagna 1021c4851c word vectors: ignore all words longer than 100 characters
There's ~100 of these in GloVe and they are all garbage (horizontal
lines, sequences of numbers and urls). This will keep the maximum
word length in check.
2019-04-09 16:54:11 -07:00
mehrad 4905ad6ce8 Fixes
Apparently layer norm implementation can't be tampered with!
Reverting the change for now and switching to a new branch for truly fixing this.
2019-04-08 17:24:02 -07:00
mehrad 03cdc2d0c1 consistent formatting 2019-04-08 16:18:30 -07:00
mehrad a7a2d752d2 Fixes
std() in layer normalization is the culprit for generating NAN.
It happens in the backward pass for values with zero variance.
Just update the mean for these batches.
2019-04-08 14:48:23 -07:00
mehrad 4acdba6c22 fix for NAN loss 2019-04-05 10:26:35 -07:00
Giovanni Campagna d16277b4d3 stop if loss is less than 1e-5 for more than 100 iterations 2019-03-31 17:12:38 -07:00
Giovanni Campagna 09c6e77525
Merge pull request #12 from Stanford-Mobisocial-IoT-Lab/wip/thingtalk-lm
Pretrained decoder language model
2019-03-28 17:58:58 -07:00
mehrad 34ba4d2600 skip batches with NAN loss 2019-03-28 12:37:01 -07:00
Giovanni Campagna 3e3755b19b use a slightly different strategy to make the pretrained lm non-trainable 2019-03-28 00:31:36 -07:00
Giovanni Campagna 25cc4ee55e support pretrained embeddings smaller than the model size
add a feed-forward layer in that case
2019-03-27 23:50:14 -07:00
Giovanni Campagna 182d2698da fix prediction
*_elmo was renamed to *_tokens
2019-03-27 14:07:30 -07:00
Giovanni Campagna 82d15a4ae3 load pretrained_decoder_lm from config.json 2019-03-27 14:06:44 -07:00
Giovanni Campagna 6a97970b13 fix typo 2019-03-27 12:46:20 -07:00
Giovanni Campagna fbe17b565e make it work
Fix time/batch confusion
2019-03-27 12:18:47 -07:00
Giovanni Campagna 9814d6bf4f Implement using a pretrained language model for the decoder embedding
Let's see if it makes a difference
2019-03-27 11:40:59 -07:00
Giovanni Campagna cea6092f90 Fix evaluating
- fix loading old config.json files that are missing some parameters
- fix expanding the trained embedding
- add a default context for "almond_with_thingpedia_as_context"
  (to include thingpedia)
- fix handling empty sentences
2019-03-23 17:28:22 -07:00
Giovanni Campagna d22e13f6c5
Merge pull request #9 from Stanford-Mobisocial-IoT-Lab/wip/thingpedia_as_context
Wip/thingpedia as context
2019-03-23 16:59:42 -07:00
mehrad d6198efc77 fix small bug 2019-03-21 21:15:29 -07:00
mehrad 487bdb8317 suppress logging epoch number 2019-03-21 21:12:44 -07:00
mehrad a85923264b Bug fixes 2019-03-21 16:01:14 -07:00
mehrad 91e6f5ded8 merge master + updates 2019-03-21 14:38:34 -07:00
Mehrad Moradshahi 48bd1d67ef
Merge pull request #8 from Stanford-Mobisocial-IoT-Lab/wip/curriculum
Wip/curriculum
2019-03-21 12:24:08 -07:00
mehrad 7555ec6b82 master updates + additional tweaks 2019-03-21 11:20:48 -07:00
Giovanni Campagna e41c9d89c3
Merge pull request #10 from Stanford-Mobisocial-IoT-Lab/wip/grammar
Grammar support
2019-03-20 17:33:03 -07:00
Giovanni Campagna 799d8c4993 fix syntax 2019-03-19 20:40:01 -07:00
Giovanni Campagna d18eca650b add new argument to load_json 2019-03-19 20:38:24 -07:00
Giovanni Campagna a3cf02cbe7 Add a way to disable glove embeddings on the decoder side
With grammar, they just add noise and overfit badly
2019-03-19 20:36:20 -07:00
Giovanni Campagna 7f1a8b2578 fix 2019-03-19 18:34:02 -07:00
Giovanni Campagna 63c96cd76a Fix plain thingtalk grammar
I copied the wrong version of genieparser...
2019-03-19 18:32:23 -07:00
Giovanni Campagna d67ef67fb8 Fix 2019-03-19 17:49:42 -07:00
Giovanni Campagna 2769cc96e3 Add the option to train a portion of decoder embeddings
This will be needed because GloVe/char embeddings are meaningless
for tokens that encode grammar productions (which are of the form
"R<id>")
2019-03-19 17:31:53 -07:00
Giovanni Campagna 112bb0bbbf Fix 2019-03-19 17:23:36 -07:00
Giovanni Campagna c4ba6d7bcd Add a progbar when loading the almond dataset
Because it takes a while
2019-03-19 14:53:11 -07:00
Giovanni Campagna 7325ca1cc7 Add option to use grammar in Almond task 2019-03-19 14:38:18 -07:00
Giovanni Campagna 17f4381ea3 Import the grammar code from genie-parser
Now purged of unnecessary messing with numpy, and of unnecessary
tensorflow
2019-03-19 12:06:22 -07:00
Giovanni Campagna f40f168f17 Reshuffle code around
Move task specific stuff into tasks/
2019-03-19 11:22:54 -07:00
Giovanni Campagna 02e4d6ddac Prepare for supporting grammar
Use a consistent preprocessing function, provided by the task class,
between server and train/predict, and load the tasks once.
2019-03-19 11:14:32 -07:00
Giovanni Campagna 14caf01e49 server: update to use task classes 2019-03-19 10:58:34 -07:00
Giovanni Campagna 83d113dc48 Fix 2019-03-19 10:50:53 -07:00
Giovanni Campagna 42331a3c08 Fix JSON serialization of arguments 2019-03-19 10:07:00 -07:00
Giovanni Campagna 6f777425ea Remove --reverse_task argument
If you want to train on the reverse Almond task, use "reverse_almond"
as a task name, as you should.
2019-03-19 10:03:28 -07:00