Commit Graph

769 Commits

Author SHA1 Message Date
Hendrik Schröter 36f0b5bbd0 Use getter instead of python property for the dataloaders (#275)
* Use getter instead of python property for the dataloaders

* Fix lint

* Update trainer.py
2019-10-04 15:35:02 -04:00
William Falcon 32e74b8f36
Ddp2 (#261)
* adds ddp2 option where on each node a single  process  uses all gpus

* added ddp2  test

* added ddp2 docs

* Update Distributed training.md

* delete ref to old update_training_log_metrics

* delete ref to old update_training_log_metrics

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* cheesecake
2019-10-04 15:07:54 -04:00
Hendrik Schröter 42764d18c7 Better error message if no loss was returned from model.training_step() (#294) 2019-10-04 07:15:19 -04:00
kvhooreb 41236c7bbb WIP: Moved grad_norm tracking code to __run_tng_batch (#278)
* Moved grad_norm tracking code to __run_tng_batch + added norms to tqdm_metrics

* Update trainer.py

* Update trainer.py

* Update trainer.py

* Update trainer.py
2019-10-02 11:11:08 -04:00
Nic Eggert 614cb3c03b Initialize loggers only once (#270)
* Create underlying loggers lazily

This avoids creating duplicate experiments or run in multi-node DDP.

* Save hyperparameters automatically

* Update docs for snapshotting hyperparams

* Fix test tube

* Fix test tube pickling
2019-10-02 11:10:40 -04:00
Nic Eggert 480eed5cb6 Enable any ML experiment tracking framework (#223)
* Implement generic loggers for experiment tracking

* Add tests for loggers

* Get model tests passing

* Test and fix logger pickling

* Expand pickle test and fix bug

* Missed exp -> logger conversion

* Remove commented code

* Add docstrings

* Update logging docs

* Add mlflow to test requirements

* Make linter happy

* Fix mlflow timestamp

* Update Logging.md

* Update test_models.py

* Update test_models.py

* Update test_models.py

* Update properties.md

* Fix tests

* Line length
2019-09-27 12:05:29 -04:00
William Falcon 481aa24974
always calls the lr scheduler with epoch nb. Fixes #98 (#252)
* always calls the lr scheduler  with epoch nb

* added docs for cluster grid search

* added docs for cluster grid search

* undo test changes

* undo test changes
2019-09-26 16:36:41 -04:00
William Falcon 25d2f93256
enables samplers which don't need set epoch (or when ppl don't need a sampler) (#254)
* enables samplers which dont need set epoch

* added docs for single gpu ddp

* added docs for single gpu ddp

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search
2019-09-26 14:39:04 -04:00
William Falcon 8b2a2aeda3
Dim 0 warning (#256)
* added ignore warnings module

* added ignore warnings module

* Fixes #249

* Update ignored_warnings.py
2019-09-26 13:20:54 -04:00
Alok Singh b0a0a47a0b Rename variables (#124)
-   data_batch → batch
-   batch_i → batch_idx
-   dataloader_i → dataloader_idx
-   tng → training
-   training_dataloader → train_dataloader
-   add_log_row_interval → row_log_interval
-   gradient_clip → gradient_clip_val
-   prog → progress
-   tqdm_dic → tqdm_dict
2019-09-25 19:05:06 -04:00
William Falcon 87708157bc
Update trainer.py (#233) 2019-09-19 08:23:48 -04:00
Ananya Harsh Jha c0f3b6b035 added set_epoch for distributed sampler, fix for #224 (#225) 2019-09-16 10:21:00 -04:00
William Falcon 9576dd28b2
added load on CPU first (#221)
* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added print logs

* added print logs

* changed close order

* changed close order
2019-09-11 07:52:36 -04:00
William Falcon 30b25c8146
Sai prasanna master (#219)
* Fix incorrect warning for DistributedSampler.

Check whether `dataloader.sampler` is an instance of DistributedSampler instead of checking the `dataloader`.

* Update trainer.py

* merged
2019-09-09 11:36:24 -04:00
William Falcon 506d5da68b
enable single gpu per node (#218)
* enable single gpu per node

* enable single gpu per node

* enable single gpu per node

* enable single gpu per node

* enable single gpu per node

* enable single gpu per node
2019-09-09 07:37:20 -04:00
William Falcon 10d190e045
Simplified gpu api. No NVIDIA flag managing by lightning for cluster (#213)
* added nvidia flag set

* added nvidia flag set

* added nvidia flag set

* added nvidia flag set

* added nvidia flag set

* added nvidia flag set

* added nvidia flag set

* added nvidia flag set

* added simple cluster template

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs
2019-09-08 15:36:58 -04:00
Alok Singh 81df2259ef Make print_nan_grads print grad (#208)
This seems more useful for debugging.
2019-09-07 01:08:09 -04:00
William Falcon 0c7fbc7178
Weights path (#211)
* added docs. removed options. added weights_save option

* removed old restore

* cleaned up save path

* cleaned up save path

* flake8
2019-09-06 17:01:03 -04:00
William Falcon 7099f8dbfb
split trainer mixins (#209)
* split trainer mixins

* Update multi_node_cluster_template.py

* Update single_cpu_template.py

* Update single_gpu_node_16bit_template.py

* Update single_gpu_node_ddp_template.py

* Update single_gpu_node_dp_template.py

* Update trainer_cpu_template.py

* Update trainer_io.py

* split trainer mixins

* Update multi_node_cluster_template.py

* deconflicted

* deconflicted

* deconflicted
2019-09-06 14:11:07 -04:00