Commit Graph

769 Commits

Author SHA1 Message Date
Vadim Bereznyuk 9f8ab7c29e Fixed total number of batches (#439)
* Fixed total number of batches

* Fixed flake8 warning

* Update train_loop_mixin.py

* Update train_loop_mixin.py
2019-10-30 12:13:40 -04:00
William Falcon 8347a6c87e
mem clear (#440)
* mem clear

* mem clear
2019-10-30 12:11:21 -04:00
William Falcon b86d223889
makes checkpoint process safe (#431) 2019-10-25 08:57:05 -04:00
William Falcon d5ca464cc6
Back hook (#424)
* Fixes #356

* Fixes #356

* Fixes #356

* Fixes #356

* Fixes #356

* Fixes #356
2019-10-24 07:56:56 -04:00
William Falcon a4b43ce095
Loaders (#422)
* refactor dataloading

* refactor dataloading

* refactor dataloading

* refactor dataloading

* refactor dataloading

* refactor dataloading

* refactor dataloading

* refactor dataloading
2019-10-24 06:43:35 -04:00
William Falcon 5db90e32eb
hpc restore takes priority over non hpc weights (#419)
* hpc restore takes priority over non hpc weights

* hpc restore takes priority over non hpc weights

* hpc restore takes priority over non hpc weights

* hpc restore takes priority over non hpc weights

* hpc restore takes priority over non hpc weights

* hpc restore takes priority over non hpc weights

* hpc restore takes priority over non hpc weights
2019-10-23 20:18:26 -04:00
William Falcon c6244594a6
clear memory cache before train starts (#418)
* clear memory cache before train starts

* clear memory cache before train starts
2019-10-23 11:41:00 -04:00
David Kossnick 56fa2075a5 Move `global_step` incrementing (#412)
* Move global_step incrementing to the end of a batch loop, per https://github.com/williamFalcon/pytorch-lightning/issues/411

* Move met_batch_limit condition to the end

* cleanup whitespace

* Update train_loop_mixin.py
2019-10-23 06:11:18 -04:00
Vismantas 2aba70e228 parse_gpu_ids fix (#382)
* Unit tests for num_gpu property as proxy for __parse_gpu_ids.

* Refactoring __parse_gpu_ids

* Moved the function outside the class as it is
an utility function and did not depend on class in any way.
* Added unit tests for it.

* Mocked torch.cuda.device_count function in tests.

This allows the tests to be run on machines that do not have gpus.

* Fixed the parse_gpu_ids function to handle -1 case.

Function now handles -1 the same way as it does for '-1'.

* Unit tests for root_gpu added.

Added backend as a parameter as currently depending on backend set
or not, code fails with exception in certain circumstances, before
giving a wrong answer.

* Moved __set_root_gpu function out of the class.

This function does not depend on the class and can be tested
more easily this way.
Also added unit tests for this function. They simply reuse
data for the root_gpu property.

* determine_root_gpu_device passes unit tests.

* num_gpus passes unit tests.

Also added a None test for this function.

* parse_gpu_ids tests changed to reflect desired state after refactoring.

Planning to refactor parse_gpu_ids to always return list of ints.
This will simplify code that use output of this function.

* * parse_gpu_ids always returns lists
* parse_gpu_ids checks given ids against available ids
* parse_gpu_ids raises exception for non existant ids
* parse_gpu_ids returns None when no gpus are available
* cleaned up determine_root_gpu_device
* cleaned up num_gpus property
* Updated unit tests to reflect changes in the functions

* Flake8 fixes

* Moved fixture code up before where it is used.

* Updated documentation.

* Changed tests to match the API:
* gpus=-1 or gpus='-1' should use all available gpu devices
* gpus=N
    * N=0: no gpus should be used.
    * N>0: N gpus should be used
* gpus=list of ints or a comma separated string of numbers:
    Use the gpus indicated by the list or the string.

* Fixed code to pass all the changed tests for parsing gpus param.

* Refactoring parse_gpu_ids function.

* flake8 fixes.

* Updating documentation.

* flake8 fixes.

* flake8 fixes.

* flake8 fixes

* Update trainer.py

* Update dp_mixin.py

* Make reduce_distributed_output a stand alone function.
Fix imports.
Fix flake8.

* Add comet_ml dependency to tests requirements.txt

* Revert "Make reduce_distributed_output a stand alone function. Fix imports. Fix flake8."

This reverts commit eac0338

* Merge with master.
2019-10-23 05:05:09 -04:00
Nic Eggert 05cea3ff8b Save / Load Hyperparameters with checkpoint (#415)
* Save and load hparams from checkpoints

* Update docs

* Add warning when not saving hparams

* Missing import

* Update .run_local_tests.sh

* Update lm_test_module_mixins.py

* Update lightning_module_template.py
2019-10-23 04:48:24 -04:00
Hata Ryosuke e7c12d936e fixed bag callback=False or None at trainer_io.py (#409) 2019-10-22 13:07:48 -04:00
Jirka Borovec f18aee30a5 Minor imports cleaning (#402)
* code cleaning

* drop unused imports

* optimize imports
2019-10-22 11:32:40 +03:00
William Falcon 792ad00ff9
Fixed val interval (#405)
* added fixed frequency val batch check

* added fixed frequency val batch check

* Finished IterableDataset support

* flake8

* flake8

* flake8
2019-10-22 05:10:00 +03:00
William Falcon 1424157731
Refactor (#407)
* moved dp, ddp outside of trainer

* added main mixins

* finished major mixin refactor

* flake8

* finished major mixin refactor

* finished major mixin refactor

* finished major mixin refactor

* finished major mixin refactor

* finished major mixin refactor

* finished major mixin refactor

* finished major mixin refactor
2019-10-22 04:16:51 +03:00
tamyiuchau 4103a5ca73 Provide backward compatibility for #124 (#400)
* Provide backward compatibility for e681253

* typo fix
2019-10-21 08:16:55 +02:00
William Falcon 6111edaf82
Test fx (#390)
* changes to test fx

* changes to test fx

* changes to test fx

* changes to test fx

* changes to test fx

* changes to test fx

* changes to test fx

* changes to test fx

* changes to test fx

* changes to test fx
2019-10-19 00:39:30 +02:00
William Falcon 699bd2cb50
removed mlflow and custom logger tests (#389)
* changes to seed for tests

* changes to seed for tests

* changes to seed for tests

* changes to seed for tests

* changes to seed for tests

* changes to seed for tests

* changes to seed for tests

* changes to seed for tests

* changes to seed for tests

* changes to seed for tests

* changes to seed for tests

* changes to seed for tests

* changes to seed for tests

* changes to seed for tests

* changes to seed for tests
2019-10-18 23:03:28 +02:00
William Falcon 3dfcef6994
Loss keys (#387)
* any key in logs or progress bar is a candidate for callback metric

* any key in logs or progress bar is a candidate for callback metric
2019-10-18 15:28:13 +02:00
Hiroyuki Vincent Yamazaki 0fac2d64cf Fix off-by-one epoch length (#377) 2019-10-18 10:18:05 +02:00
William Falcon e5050700ce docs 2019-10-18 00:17:27 +02:00
William Falcon 2044126821
fixing tests (#372)
* fixing tests

* fixing tests

* fixing tests

* fixing tests

* fixing tests

* fixing tests

* fixing tests

* fixed tests

* fixed tests

* fixed tests

* fixed tests

* fixed tests

* fixed tests

* fixed tests

* fixed tests

* fixed tests
2019-10-16 07:28:47 -04:00
William Falcon e2cabb03ba
fix val logging (#362)
* fix test

* fix test

* fix test

* fix test

* fix test

* fix test

* fix test

* fix test

* fix test

* fix test

* fix test

* fix test

* fix test

* no warnings always

* no warnings always

* no warnings always

* no warnings always
2019-10-15 12:44:20 -04:00
Nic Eggert 19c2b8fc9e Allow disabling default logger, checkpoint_callback, and early_stop_callback (#360)
* Allow disabling logger, early stopping, and checkpoints

* Typo

* Get tests passing

* Update trainer.py
2019-10-12 06:00:24 -04:00
Yasser Souri 792ba59b78 Pad experiment version with zero for easier listing (#355) 2019-10-10 19:39:26 -04:00
William Falcon 426bb19846
Update trainer.py 2019-10-10 18:17:26 -04:00
William Falcon 46322b906b
fixed ckpt tests (#352)
* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests

* fixed ckpt tests
2019-10-10 15:16:19 -04:00
William Falcon 96c2a2de50 fixes Flake8 2019-10-09 17:49:29 -04:00
William Falcon 453568179b
Logger default (#351)
* weights go into default logger folder

* weights go into default logger folder

* weights go into default logger folder

* weights go into default logger folder

* weights go into default logger folder

* weights go into default logger folder

* ckpt callback in pretrain routine so exp already has version

* ckpt callback in pretrain routine so exp already has version

* ckpt callback in pretrain routine so exp already has version
2019-10-09 17:46:27 -04:00
William Falcon d95e693598
Logger default (#350)
* weights go into default logger folder

* weights go into default logger folder

* weights go into default logger folder

* weights go into default logger folder

* weights go into default logger folder

* weights go into default logger folder
2019-10-09 16:25:04 -04:00
William Falcon 6e0a562ecb fixed callback metrics ddp bug 2019-10-09 12:53:33 -04:00
William Falcon 5f1f3f6acc removed pdb 2019-10-09 10:45:06 -04:00
William Falcon 608a90a490
fixes non python type callback metrics and fast_dev_run (#345)
* fixes non python type callback metrics

* fixed fast dev run

* fixed fast dev run

* fixed fast dev run

* fixed fast dev run

* fixed fast dev run

* fixed fast dev run

* fixed fast dev run
2019-10-09 10:23:08 -04:00
Nic Eggert 8088052825 Finalize logger (#337)
* Ensure logger.finalize is called

* Call logger.finalize

* Update mlflow_logger.py

* Update test_logging.py

* Update trainer.py
2019-10-08 17:33:33 -04:00
William Falcon 49e04de5ac
Ports (#338)
* remove os.exit from early stopping

* remove os.exit from early stopping

* fixed weight summary

* fixed weight summary

* fixed weight summary

* fixed weight summary

* fixed weight summary

* fixed weight summary

* fixed weight summary
2019-10-08 17:11:47 -04:00
William Falcon dcaba55251
Early stopping (#332)
* callbacks use all other keys in return dict

* callbacks use all other keys in return dict

* callbacks use all other keys in return dict

* callbacks use all other keys in return dict

* remove os.exit from early stopping
2019-10-08 16:21:00 -04:00
Adrian Wälchli 6e3e740a7f Param printing (#336)
* print thousands as K, M, B, T, ...

* add option to print top-level modules only

* added doc string and added spacing

* do not print summary if neither "full" nor "top"

* updated docs showing summary print options

* fix line length for travis
2019-10-08 15:30:06 -04:00
William Falcon ff2a21a08a
default to O1 (#334) 2019-10-08 09:09:57 -04:00
Jon Tamir 1cf2e228ba fix CONTRIBUTING link and silence checkpoint callback message (#325) 2019-10-08 07:40:14 -04:00
William Falcon ac6d0154c2
Fixes lack of logging in logger (#319)
* changed rank 0

* changed rank 0

* changed rank 0

* changed rank 0

* changed rank 0

* changed rank 0

* changed rank 0

* changed rank 0

* changed rank 0

* changed rank 0

* changed rank 0

* changed rank 0

* changed rank 0

* changed rank 0

* changed rank 0

* changed rank 0

* changed rank 0

* changed rank 0

* changed rank 0

* changed rank 0

* changed rank 0

* changed rank 0

* changed rank 0

* changed rank 0

* models wait to restore weights

* models wait to restore weights
2019-10-06 17:57:23 -04:00
William Falcon 491100abdd
Docs (#315)
* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up docs

* cleaned up test_tube logger

* cleaned up test_tube logger

* cleaned up test_tube logger
2019-10-05 23:52:32 -04:00
William Falcon ef98931d18 flake8 2019-10-05 16:56:24 -04:00
William Falcon 07c5d22ae3
cleaning up demos (#313)
* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos

* cleaning up demos
2019-10-05 16:39:05 -04:00
William Falcon cdfcb01073
Fixes #234 (#311)
* Fixes #234

* default logger version is now slurm job id

* default logger version is now slurm job id
2019-10-05 14:45:37 -04:00
William Falcon 6cc3f1757f
decouple returns from each step (#307)
* decoupled training metrics from logging metrics

* decoupled validation metrics from log metrics

* updated docs

* updated docs

* updated docs

* Fixed test

* merged master

* merged master

* merged master

* merged master

* merged master

* merged master

* merged master

* merged master

* merged master

* merged master

* merged master

* merged master

* merged master

* merged master

* merged master

* merged master

* merged master

* merged master

* merged master

* merged master

* merged master

* merged master

* merged master

* merged master

* merged master
2019-10-05 13:35:20 -04:00
William Falcon 8f5a06bfb8
Gpu mem (#308)
* Fixes #289

* Fixes #289

* added lbfgs support

* Fixes #280 (#309)

* added test seeds (#306)

* added test seeds

* added test seeds

* updated docs

* added lbfgs support (#310)

* added lbfgs support

* added lbfgs support

* added lbfgs support

* Fixes #280 (#309)

* added test seeds (#306)

* added test seeds

* added test seeds

* updated docs

* added lbfgs support

* added lbfgs support

* added lbfgs support

* added lbfgs support

* added lbfgs support

* added lbfgs support

* added lbfgs support

* added lbfgs support

* Fixes #289

* Fixes #289

* merged master

* merged master
2019-10-05 11:29:34 -04:00
William Falcon 75fd89106f
added lbfgs support (#310)
* added lbfgs support

* added lbfgs support

* added lbfgs support

* Fixes #280 (#309)

* added test seeds (#306)

* added test seeds

* added test seeds

* updated docs

* added lbfgs support

* added lbfgs support

* added lbfgs support

* added lbfgs support

* added lbfgs support

* added lbfgs support

* added lbfgs support

* added lbfgs support
2019-10-05 11:10:21 -04:00
William Falcon 2ac9f1aea7
Fixes #280 (#309) 2019-10-05 10:55:50 -04:00
William Falcon 967957e55c added lbfgs support 2019-10-05 10:47:18 -04:00
William Falcon bf09060fef
Fixes #292 (#303)
* early stopping callback is not default

* added a default logger

* added default checkpoint callback

* added default checkpoint/loggers

* added default checkpoint/loggers

* updated docs

* cleaned demos

* cleaned demos

* cleaned demos

* clean up docs around loggers

* clean up docs around loggers

* clean up docs around loggers

* clean up docs around loggers

* clean up docs around loggers

* clean up docs around loggers

* clean up docs around loggers

* clean up docs around loggers

* clean up docs around loggers

* clean up docs around loggers

* clean up docs around loggers

* clean up docs around loggers

* clean up docs around loggers
2019-10-04 19:48:57 -04:00
William Falcon a578de511d
clean up docs around loggers (#304) 2019-10-04 18:53:38 -04:00
Hendrik Schröter 36f0b5bbd0 Use getter instead of python property for the dataloaders (#275)
* Use getter instead of python property for the dataloaders

* Fix lint

* Update trainer.py
2019-10-04 15:35:02 -04:00
William Falcon 32e74b8f36
Ddp2 (#261)
* adds ddp2 option where on each node a single  process  uses all gpus

* added ddp2  test

* added ddp2 docs

* Update Distributed training.md

* delete ref to old update_training_log_metrics

* delete ref to old update_training_log_metrics

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* banana pancakes

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* cheesecake
2019-10-04 15:07:54 -04:00
Hendrik Schröter 42764d18c7 Better error message if no loss was returned from model.training_step() (#294) 2019-10-04 07:15:19 -04:00
kvhooreb 41236c7bbb WIP: Moved grad_norm tracking code to __run_tng_batch (#278)
* Moved grad_norm tracking code to __run_tng_batch + added norms to tqdm_metrics

* Update trainer.py

* Update trainer.py

* Update trainer.py

* Update trainer.py
2019-10-02 11:11:08 -04:00
Nic Eggert 614cb3c03b Initialize loggers only once (#270)
* Create underlying loggers lazily

This avoids creating duplicate experiments or run in multi-node DDP.

* Save hyperparameters automatically

* Update docs for snapshotting hyperparams

* Fix test tube

* Fix test tube pickling
2019-10-02 11:10:40 -04:00
Nic Eggert 480eed5cb6 Enable any ML experiment tracking framework (#223)
* Implement generic loggers for experiment tracking

* Add tests for loggers

* Get model tests passing

* Test and fix logger pickling

* Expand pickle test and fix bug

* Missed exp -> logger conversion

* Remove commented code

* Add docstrings

* Update logging docs

* Add mlflow to test requirements

* Make linter happy

* Fix mlflow timestamp

* Update Logging.md

* Update test_models.py

* Update test_models.py

* Update test_models.py

* Update properties.md

* Fix tests

* Line length
2019-09-27 12:05:29 -04:00
William Falcon 481aa24974
always calls the lr scheduler with epoch nb. Fixes #98 (#252)
* always calls the lr scheduler  with epoch nb

* added docs for cluster grid search

* added docs for cluster grid search

* undo test changes

* undo test changes
2019-09-26 16:36:41 -04:00
William Falcon 25d2f93256
enables samplers which don't need set epoch (or when ppl don't need a sampler) (#254)
* enables samplers which dont need set epoch

* added docs for single gpu ddp

* added docs for single gpu ddp

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search

* added docs for cluster grid search
2019-09-26 14:39:04 -04:00
William Falcon 8b2a2aeda3
Dim 0 warning (#256)
* added ignore warnings module

* added ignore warnings module

* Fixes #249

* Update ignored_warnings.py
2019-09-26 13:20:54 -04:00
Alok Singh b0a0a47a0b Rename variables (#124)
-   data_batch → batch
-   batch_i → batch_idx
-   dataloader_i → dataloader_idx
-   tng → training
-   training_dataloader → train_dataloader
-   add_log_row_interval → row_log_interval
-   gradient_clip → gradient_clip_val
-   prog → progress
-   tqdm_dic → tqdm_dict
2019-09-25 19:05:06 -04:00
William Falcon 87708157bc
Update trainer.py (#233) 2019-09-19 08:23:48 -04:00
Ananya Harsh Jha c0f3b6b035 added set_epoch for distributed sampler, fix for #224 (#225) 2019-09-16 10:21:00 -04:00
William Falcon 9576dd28b2
added load on CPU first (#221)
* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added load on CPU first

* added print logs

* added print logs

* changed close order

* changed close order
2019-09-11 07:52:36 -04:00
William Falcon 30b25c8146
Sai prasanna master (#219)
* Fix incorrect warning for DistributedSampler.

Check whether `dataloader.sampler` is an instance of DistributedSampler instead of checking the `dataloader`.

* Update trainer.py

* merged
2019-09-09 11:36:24 -04:00
William Falcon 506d5da68b
enable single gpu per node (#218)
* enable single gpu per node

* enable single gpu per node

* enable single gpu per node

* enable single gpu per node

* enable single gpu per node

* enable single gpu per node
2019-09-09 07:37:20 -04:00
William Falcon 10d190e045
Simplified gpu api. No NVIDIA flag managing by lightning for cluster (#213)
* added nvidia flag set

* added nvidia flag set

* added nvidia flag set

* added nvidia flag set

* added nvidia flag set

* added nvidia flag set

* added nvidia flag set

* added nvidia flag set

* added simple cluster template

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs

* sets correct backend for possible combinations of gpu inputs
2019-09-08 15:36:58 -04:00
Alok Singh 81df2259ef Make print_nan_grads print grad (#208)
This seems more useful for debugging.
2019-09-07 01:08:09 -04:00
William Falcon 0c7fbc7178
Weights path (#211)
* added docs. removed options. added weights_save option

* removed old restore

* cleaned up save path

* cleaned up save path

* flake8
2019-09-06 17:01:03 -04:00
William Falcon 7099f8dbfb
split trainer mixins (#209)
* split trainer mixins

* Update multi_node_cluster_template.py

* Update single_cpu_template.py

* Update single_gpu_node_16bit_template.py

* Update single_gpu_node_ddp_template.py

* Update single_gpu_node_dp_template.py

* Update trainer_cpu_template.py

* Update trainer_io.py

* split trainer mixins

* Update multi_node_cluster_template.py

* deconflicted

* deconflicted

* deconflicted
2019-09-06 14:11:07 -04:00