* add ClusterEnvironment for LSF systems
* update init file
* add available cluster environments
* clean up LSFEnvironment
* add ddp_hpc as a distributed backend
* clean up SLURMEnvironment
* remove extra blank line
* init device for DDPHPCAccelerator
We need to do this so we don't send the model to the same device from multiple ranks
* committing current state
* add additional methods to ClusterEnvironments
* add NVIDIA mixin for setting up CUDA envars
* remove troubleshooting prints
* cleanup SLURMEnvironment
* fix docstring
* cleanup TorchElasticEnvironment and add documentation
* PEP8 puts a cork in it
* add set_ranks_to_trainer
* remove unused import
* move to new location
* update LSF environment
* remove mixin
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* changelog
* reset slurm env
* add tests
* add licence
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* test node_rank
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add lsf env to docs
* add auto detection for lsf environment
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix is_using_lsf() and test
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add callback to hook tests and add predict test
* Fix lambda callback test
* Simplify lambda call test
* Use LambdaCallback
* Dynamically append to called for the model
* Remove print
* Consistency
* Consistency
* Prepare args/kwargs testing
* yapf doesn't like dict literals
* Add arguments for fit no val test
* Add arguments for fit no val test
* add before_backward_hook
* add test
* resolve flake8
* resolve tests
* update changelog
* add on_before_backward to LightningModule
* update on comments
* Test arguments
* Datamodule refactor
* Fix eval test
* remove extra file
* resolve bug
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* move to hooks
* update
* resolve flake8
* update on comments
* Update full fit + val test
* Update test
* Remove FIXME
* Remove FIXME
* Undo change
* Fix
* Parametrize fit hook test
* Comment
* Parametrize fit hook test with different precision plugins
* Fix tests
* Parametrize fit hook test with manual optimization
* Unnecessary parenthesis
* WIP
* Comments
* Fix message
* Test CI error
* Revert "Test CI error"
This reverts commit 39c4a85a83.
* Add ddp training type teardown
* Update CHANGELOG
* Adrian's fix
* Use destructor
* Update CHANGELOG.md
* RPC destructor
* Update pytorch_lightning/plugins/training_type/ddp.py
* Why do you not work :(
* Missing condition
* Fix deepspeed test
* GC collect in conftest
* Do not show warnings for special tests
* Needs to run on 1.8
To avoid: "RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:32, unhandled cuda error, NCCL version 2.4.8"
* Run torch 1.8
* Skip test due to 'Python bus error'
* Debug NCCL
* shm size
* Disable warnings for special tests
* Remove NCCL_DEBUG statement
* Try smaller shm size
* Revert "Skip test due to 'Python bus error'"
This reverts commit e0a3e8785d.
* README and adjust versions
* Avoid self.on_gpu call
* empty cache cleanup
* More garbage collection
* Unroll parametrizations
* Do not reuse mock
* Undo changes
* Undo notebooks modification
* resolve test
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* delete file
* Undo
* Fix test
* Revert "WIP"
This reverts commit f5828a8c42.
* Rename
* Remove optimizers
* Fix bug with LightningOptimizer
* Add optimizers
* update
* update
* Update CHANGELOG
* On after backward refactor
* Do not call super
* Fixes
* Remove should_accumulate
* pre/post backward refactor
* Call the LM backward hook
* Update tests
* Remove dev debug patch
* Fix test
* Remove optimizer arguments and typing
* Docs fixes
* Fix comment
* Undo changes
* Split manual and auto
* Undo change
* Deepsource
* Remove optimizers
* Undo changes
* Call the hook
* Docs
* Docs
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* edit arg to reload_dataloaders_every_n_epoch
* init reload_dataloaders_every_n_epoch
* edit logic to reload dl
* update arg to test datamodule
* update arg test dataloader
* edit reload dl logic in eval loop
* fix var name in reset_train_val_dataloaders
* fix error, use current_epoch attribute
* edit every_n_epoch to every_n_epochs
* edit every_n_epoch to every_n_epochs
* edit every_n_epoch to every_n_epochs
* edit every_n_epoch to every_n_epochs
* edit every_n_epoch to every_n_epochs
* edit every_n_epoch to every_n_epochs
* assert reload_dataloaders_every_n_epochs positive
* assert reload_dataloaders_every_n_epochs positive
* add trainer property should reload dl
* update should reload dl in train loop
* condition on should reload dl in eval loop
* pep8
* fix update should reload dl in train loop
* add test case
* replace assertion with misconfig exception
* remove unused variable
* remove unnecessary checks
* replace to BoringModel
* remove unrequired comment
* deprecate _every_epoch
* add deprecated argument to trainer
* test case for deprecated arg
* remove unrequired assertion in train loop
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* modify misconfig exception for int
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* conv bool to int of depreciated _every_epoch
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* update description of deprecated param
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* update deprecation warning
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* modify argument to int only
* fix deprecated test function name
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* merge tests for reload dls
* add propery should reload dl
* removed and added to trainer property
* use property in train loop
* remove deprecated test
* add deprecated test to new file
* test case for exception
* update test datamodule every_n_epochs
* update trainer docs
* update hooks with every_n_epochs
* edit format if statement
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* Update CHANGELOG.md
* Apply suggestions from code review
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* typo in exception
* pytest check only misconfig exception
* remove unnecessary code in test
* remove unnecessary code in deprec test
* added match in test
* typo in comment
* revert to prev, keep only req in context manager
* Apply suggestions from code review
* docs
* rebase
* Apply suggestions from code review
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix import: model_helpers instead of model_utils
* fix, add reload_dataloaders_every_n_epochs argument to data connector
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add required imports
* move deprecated log
* add missing import rank_zero_warn
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update varname in should_reload_dl_epoch
suggestion from code review
* Fix CHANGELOG. Update deprecation versions
* Minor change
* change property name, mark protected
* update property name
* update property name
* Remove deprecated *_loop.py files
* Rename test func
* Update CHANGELOG.md
* use rank_zero_deprecation
* update deprecation message in trainer api docs
* test deprecation with real arg name in message
* fix typo in trainer docs
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Fix mypy for utilities.device_parser
* Fix remaining mypy issues + disable ignoring mypy errors
* Return one Optional type annotation back
* Fix annotation for the parse_tpu_cores method
* Remove unused import
* Include carmocca's suggestion and fix mypy issue
* include carmocca's suggestion
* add `else` statement to `parse_gpu_ids` to inform mypy `gpus` is a type of `List[int]`
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>