* expose extract_batch and make public
* first pass
* early return
* add changelog
* move to utilities/data.py
* add test_data.py
* tests are passing
* precommit hook
* address pep8 failure
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
* add ClusterEnvironment for LSF systems
* update init file
* add available cluster environments
* clean up LSFEnvironment
* add ddp_hpc as a distributed backend
* clean up SLURMEnvironment
* remove extra blank line
* init device for DDPHPCAccelerator
We need to do this so we don't send the model to the same device from multiple ranks
* committing current state
* add additional methods to ClusterEnvironments
* add NVIDIA mixin for setting up CUDA envars
* remove troubleshooting prints
* cleanup SLURMEnvironment
* fix docstring
* cleanup TorchElasticEnvironment and add documentation
* PEP8 puts a cork in it
* add set_ranks_to_trainer
* remove unused import
* move to new location
* update LSF environment
* remove mixin
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* changelog
* reset slurm env
* add tests
* add licence
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* test node_rank
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add lsf env to docs
* add auto detection for lsf environment
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix is_using_lsf() and test
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add callback to hook tests and add predict test
* Fix lambda callback test
* Simplify lambda call test
* Use LambdaCallback
* Dynamically append to called for the model
* Remove print
* Consistency
* Consistency
* Prepare args/kwargs testing
* yapf doesn't like dict literals
* Add arguments for fit no val test
* Add arguments for fit no val test
* add before_backward_hook
* add test
* resolve flake8
* resolve tests
* update changelog
* add on_before_backward to LightningModule
* update on comments
* Test arguments
* Datamodule refactor
* Fix eval test
* remove extra file
* resolve bug
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* move to hooks
* update
* resolve flake8
* update on comments
* Update full fit + val test
* Update test
* Remove FIXME
* Remove FIXME
* Undo change
* Fix
* Parametrize fit hook test
* Comment
* Parametrize fit hook test with different precision plugins
* Fix tests
* Parametrize fit hook test with manual optimization
* Unnecessary parenthesis
* WIP
* Comments
* Fix message
* Test CI error
* Revert "Test CI error"
This reverts commit 39c4a85a83.
* Add ddp training type teardown
* Update CHANGELOG
* Adrian's fix
* Use destructor
* Update CHANGELOG.md
* RPC destructor
* Update pytorch_lightning/plugins/training_type/ddp.py
* Why do you not work :(
* Missing condition
* Fix deepspeed test
* GC collect in conftest
* Do not show warnings for special tests
* Needs to run on 1.8
To avoid: "RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:32, unhandled cuda error, NCCL version 2.4.8"
* Run torch 1.8
* Skip test due to 'Python bus error'
* Debug NCCL
* shm size
* Disable warnings for special tests
* Remove NCCL_DEBUG statement
* Try smaller shm size
* Revert "Skip test due to 'Python bus error'"
This reverts commit e0a3e8785d.
* README and adjust versions
* Avoid self.on_gpu call
* empty cache cleanup
* More garbage collection
* Unroll parametrizations
* Do not reuse mock
* Undo changes
* Undo notebooks modification
* resolve test
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* delete file
* Undo
* Fix test
* Revert "WIP"
This reverts commit f5828a8c42.
* Rename
* Remove optimizers
* Fix bug with LightningOptimizer
* Add optimizers
* update
* update
* Update CHANGELOG
* On after backward refactor
* Do not call super
* Fixes
* Remove should_accumulate
* pre/post backward refactor
* Call the LM backward hook
* Update tests
* Remove dev debug patch
* Fix test
* Remove optimizer arguments and typing
* Docs fixes
* Fix comment
* Undo changes
* Split manual and auto
* Undo change
* Deepsource
* Remove optimizers
* Undo changes
* Call the hook
* Docs
* Docs
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* edit arg to reload_dataloaders_every_n_epoch
* init reload_dataloaders_every_n_epoch
* edit logic to reload dl
* update arg to test datamodule
* update arg test dataloader
* edit reload dl logic in eval loop
* fix var name in reset_train_val_dataloaders
* fix error, use current_epoch attribute
* edit every_n_epoch to every_n_epochs
* edit every_n_epoch to every_n_epochs
* edit every_n_epoch to every_n_epochs
* edit every_n_epoch to every_n_epochs
* edit every_n_epoch to every_n_epochs
* edit every_n_epoch to every_n_epochs
* assert reload_dataloaders_every_n_epochs positive
* assert reload_dataloaders_every_n_epochs positive
* add trainer property should reload dl
* update should reload dl in train loop
* condition on should reload dl in eval loop
* pep8
* fix update should reload dl in train loop
* add test case
* replace assertion with misconfig exception
* remove unused variable
* remove unnecessary checks
* replace to BoringModel
* remove unrequired comment
* deprecate _every_epoch
* add deprecated argument to trainer
* test case for deprecated arg
* remove unrequired assertion in train loop
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* modify misconfig exception for int
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* conv bool to int of depreciated _every_epoch
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* update description of deprecated param
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* update deprecation warning
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* modify argument to int only
* fix deprecated test function name
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* merge tests for reload dls
* add propery should reload dl
* removed and added to trainer property
* use property in train loop
* remove deprecated test
* add deprecated test to new file
* test case for exception
* update test datamodule every_n_epochs
* update trainer docs
* update hooks with every_n_epochs
* edit format if statement
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* Update CHANGELOG.md
* Apply suggestions from code review
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* typo in exception
* pytest check only misconfig exception
* remove unnecessary code in test
* remove unnecessary code in deprec test
* added match in test
* typo in comment
* revert to prev, keep only req in context manager
* Apply suggestions from code review
* docs
* rebase
* Apply suggestions from code review
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix import: model_helpers instead of model_utils
* fix, add reload_dataloaders_every_n_epochs argument to data connector
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add required imports
* move deprecated log
* add missing import rank_zero_warn
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update varname in should_reload_dl_epoch
suggestion from code review
* Fix CHANGELOG. Update deprecation versions
* Minor change
* change property name, mark protected
* update property name
* update property name
* Remove deprecated *_loop.py files
* Rename test func
* Update CHANGELOG.md
* use rank_zero_deprecation
* update deprecation message in trainer api docs
* test deprecation with real arg name in message
* fix typo in trainer docs
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Move result teardown to loops
* Update CHANGELOG
* Remove teardown from run
* Move previous teardown to on_run_end
* Add comment
* Merge 8250
* Remove stage set to None where it shouldnt
* Skip test due to 'Python bus error'
* Debug NCCL
* Remove NCCL_DEBUG statement
* Revert "Skip test due to 'Python bus error'"
This reverts commit e0a3e8785d.
* fix
* add test
* changelog
* yapf
* patch os environ
* make a special test
* destroy pg
* debug
* revert
* revert
* problematic test
* skip
* try the fixture
* test
* update sensitive test
* update changelog
* remove comment
* update wrong test
* update test name
* parameterization
* Revert "parameterization"
This reverts commit b0542f43f59c5ce66800883b5e2f0c66a97408cc.
* remove conftest
* ignore test
* teardown
* fix merge
* deep speed parameterization
* uncomment test
* update chlog
* update changelog
* split tests
* update test
update test
update test
update test
* update test comments
* unroll test
* unroll test
* unroll test
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* increase shm
* sudo
* unroll ipu
* Revert "sudo"
This reverts commit 6cc68c1478.
* Revert "increase shm"
This reverts commit 8c27163483.
* x
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* find guilty test
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* POPTORCH_WAIT_FOR_IPU=1
* move test
* redo parameterize for ipu
* de-comment test
* move chlog
* Update tests/accelerators/test_accelerator_connector.py
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
* Update tests/accelerators/test_accelerator_connector.py
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
* device ids in barrier
x
x
s
same fix for spawn
fix non-nccl
x
* add changelog
* get nccl backend
* get backend
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
* add mechanism to prevent deadlock
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* resolve flake8 + update changelog
* update on comments
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update
* remove space
* resolve bugs
* overwrite config
* update on comments
* update on comments
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update
* update
* update test with comments
* Update pytorch_lightning/plugins/training_type/parallel.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* update on comments
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>