Giovanni Campagna
68e76f7990
Remove unused dependencies
...
These are not used anywhere I can see.
2019-04-17 11:39:54 -07:00
Giovanni Campagna
13e1c0335e
Load allenlp, cove libraries lazily
...
These libraries are only needed if one passes --elmo or --cove
on the command line. They are annoyingly big libraries, so
it makes sense to keep them optional.
2019-04-17 11:39:15 -07:00
Giovanni Campagna
bb84b2b130
Merge pull request #13 from stanford-oval/wip/mmap-embeddings
...
Memory-mappable embeddings
2019-04-10 23:21:33 -07:00
Giovanni Campagna
8399064f15
vocab: restore "dim" property on load
2019-04-10 11:21:31 -07:00
Giovanni Campagna
aed5576756
vocab: use a better hash function
...
the previous one was not great, and it was particularly bad for
char ngrams, where it would produce collisions almost constantly
2019-04-10 10:59:57 -07:00
Giovanni Campagna
94bebc4435
update tests
2019-04-10 10:38:16 -07:00
Giovanni Campagna
335c792a27
mmappable embeddings: make it work
...
- handle integer overflow correctly in hashing
- store table, itos and vectors in separate files, because numpy
ignores mmap_mode for npz files
- optimize the loading of the txt vectors and free memory eagerly
because otherwise we run out of memory before saving
2019-04-10 10:31:25 -07:00
Giovanni Campagna
8112a985c8
Add "cache-embeddings" subcommand to download embeddings
...
It's useful to download the embeddings as a separate step
from training or deployment, for example to train on a
firewalled machine.
2019-04-09 16:54:12 -07:00
Giovanni Campagna
3f8f836d02
torchtext.Vocab: store word embeddings in mmap-friendly format on disk
...
torch.load/save uses pickle, which is not mmappable and causes high
memory usage: the vectors must be completely stored in memory.
This is fine during training, because the training machines are
large and have a lot of ram, but during inference we want to reduce
memory usage to deploy more models on one machine.
Instead, if we use numpy's npz format (uncompressed), all the word
vectors can be stored on disk and loaded on demand when the page
is faulted in. Furthemore, all pages are shared between processes
(so multiple models only use one copy of the embeddings), and the
kernel can free the memory back to disk under pressure.
The annoying part is that we can only store numpy ndarrays in this
format, and not Python native dicts. So instead we need a custom
HashTable implementation that is backed by numpy ndarrays.
As a side bonus, the custom implementation keeps only one copy
of all the words in memory, so memory usage is lower.
2019-04-09 16:54:12 -07:00
Giovanni Campagna
1021c4851c
word vectors: ignore all words longer than 100 characters
...
There's ~100 of these in GloVe and they are all garbage (horizontal
lines, sequences of numbers and urls). This will keep the maximum
word length in check.
2019-04-09 16:54:11 -07:00
mehrad
4905ad6ce8
Fixes
...
Apparently layer norm implementation can't be tampered with!
Reverting the change for now and switching to a new branch for truly fixing this.
2019-04-08 17:24:02 -07:00
mehrad
03cdc2d0c1
consistent formatting
2019-04-08 16:18:30 -07:00
mehrad
a7a2d752d2
Fixes
...
std() in layer normalization is the culprit for generating NAN.
It happens in the backward pass for values with zero variance.
Just update the mean for these batches.
2019-04-08 14:48:23 -07:00
mehrad
4acdba6c22
fix for NAN loss
2019-04-05 10:26:35 -07:00
Giovanni Campagna
d16277b4d3
stop if loss is less than 1e-5 for more than 100 iterations
2019-03-31 17:12:38 -07:00
Giovanni Campagna
09c6e77525
Merge pull request #12 from Stanford-Mobisocial-IoT-Lab/wip/thingtalk-lm
...
Pretrained decoder language model
2019-03-28 17:58:58 -07:00
mehrad
34ba4d2600
skip batches with NAN loss
2019-03-28 12:37:01 -07:00
Giovanni Campagna
3e3755b19b
use a slightly different strategy to make the pretrained lm non-trainable
2019-03-28 00:31:36 -07:00
Giovanni Campagna
25cc4ee55e
support pretrained embeddings smaller than the model size
...
add a feed-forward layer in that case
2019-03-27 23:50:14 -07:00
Giovanni Campagna
182d2698da
fix prediction
...
*_elmo was renamed to *_tokens
2019-03-27 14:07:30 -07:00
Giovanni Campagna
82d15a4ae3
load pretrained_decoder_lm from config.json
2019-03-27 14:06:44 -07:00
Giovanni Campagna
6a97970b13
fix typo
2019-03-27 12:46:20 -07:00
Giovanni Campagna
fbe17b565e
make it work
...
Fix time/batch confusion
2019-03-27 12:18:47 -07:00
Giovanni Campagna
9814d6bf4f
Implement using a pretrained language model for the decoder embedding
...
Let's see if it makes a difference
2019-03-27 11:40:59 -07:00
Giovanni Campagna
cea6092f90
Fix evaluating
...
- fix loading old config.json files that are missing some parameters
- fix expanding the trained embedding
- add a default context for "almond_with_thingpedia_as_context"
(to include thingpedia)
- fix handling empty sentences
2019-03-23 17:28:22 -07:00
Giovanni Campagna
d22e13f6c5
Merge pull request #9 from Stanford-Mobisocial-IoT-Lab/wip/thingpedia_as_context
...
Wip/thingpedia as context
2019-03-23 16:59:42 -07:00
mehrad
d6198efc77
fix small bug
2019-03-21 21:15:29 -07:00
mehrad
487bdb8317
suppress logging epoch number
2019-03-21 21:12:44 -07:00
mehrad
a85923264b
Bug fixes
2019-03-21 16:01:14 -07:00
mehrad
91e6f5ded8
merge master + updates
2019-03-21 14:38:34 -07:00
Mehrad Moradshahi
48bd1d67ef
Merge pull request #8 from Stanford-Mobisocial-IoT-Lab/wip/curriculum
...
Wip/curriculum
2019-03-21 12:24:08 -07:00
mehrad
7555ec6b82
master updates + additional tweaks
2019-03-21 11:20:48 -07:00
Giovanni Campagna
e41c9d89c3
Merge pull request #10 from Stanford-Mobisocial-IoT-Lab/wip/grammar
...
Grammar support
2019-03-20 17:33:03 -07:00
Giovanni Campagna
799d8c4993
fix syntax
2019-03-19 20:40:01 -07:00
Giovanni Campagna
d18eca650b
add new argument to load_json
2019-03-19 20:38:24 -07:00
Giovanni Campagna
a3cf02cbe7
Add a way to disable glove embeddings on the decoder side
...
With grammar, they just add noise and overfit badly
2019-03-19 20:36:20 -07:00
Giovanni Campagna
7f1a8b2578
fix
2019-03-19 18:34:02 -07:00
Giovanni Campagna
63c96cd76a
Fix plain thingtalk grammar
...
I copied the wrong version of genieparser...
2019-03-19 18:32:23 -07:00
Giovanni Campagna
d67ef67fb8
Fix
2019-03-19 17:49:42 -07:00
Giovanni Campagna
2769cc96e3
Add the option to train a portion of decoder embeddings
...
This will be needed because GloVe/char embeddings are meaningless
for tokens that encode grammar productions (which are of the form
"R<id>")
2019-03-19 17:31:53 -07:00
Giovanni Campagna
112bb0bbbf
Fix
2019-03-19 17:23:36 -07:00
Giovanni Campagna
c4ba6d7bcd
Add a progbar when loading the almond dataset
...
Because it takes a while
2019-03-19 14:53:11 -07:00
Giovanni Campagna
7325ca1cc7
Add option to use grammar in Almond task
2019-03-19 14:38:18 -07:00
Giovanni Campagna
17f4381ea3
Import the grammar code from genie-parser
...
Now purged of unnecessary messing with numpy, and of unnecessary
tensorflow
2019-03-19 12:06:22 -07:00
Giovanni Campagna
f40f168f17
Reshuffle code around
...
Move task specific stuff into tasks/
2019-03-19 11:22:54 -07:00
Giovanni Campagna
02e4d6ddac
Prepare for supporting grammar
...
Use a consistent preprocessing function, provided by the task class,
between server and train/predict, and load the tasks once.
2019-03-19 11:14:32 -07:00
Giovanni Campagna
14caf01e49
server: update to use task classes
2019-03-19 10:58:34 -07:00
Giovanni Campagna
83d113dc48
Fix
2019-03-19 10:50:53 -07:00
Giovanni Campagna
42331a3c08
Fix JSON serialization of arguments
2019-03-19 10:07:00 -07:00
Giovanni Campagna
6f777425ea
Remove --reverse_task argument
...
If you want to train on the reverse Almond task, use "reverse_almond"
as a task name, as you should.
2019-03-19 10:03:28 -07:00