* Add property to delay checkpointing, move loading checkpoint file into the run function to allow deepspeed engine to be loaded
* Add a small test
* Apply suggestions from code review
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Update pytorch_lightning/accelerators/accelerator.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Address review
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* reduce memory leak
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update changelog
* Apply suggestions from code review
Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
* resolve flake8
* update on comments
* resolve bug
* update
* Undo whitespace changes
* remove bug
* resolve flake8
* revert change
* update on comments
* delete the ddp wrapper as it hold memory
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* resolve flake8
* update on comments
* update changelog
* resolve test
* Update CHANGELOG
* Refactor teardown
* Fix comment
* Do it for non-gpu too
* remove ref when the model is not a lightning_module
* Fix import error
* move down
* resolve bug
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* resolve assignement
* update
* move above
* Fix device calls to support tpu training
* Updat todo
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Kaushik B <kaushikbokka@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
These reports can be quite large and involve some processing
to produce. It means on larger models there's a noticeable performance
hit to produce the cycles/memory reports.
* [DDP] Remove the outdated limitations of DDP communication hook since 1.9
1. DDP communication hook can work on multiple backends since 1.9.
2. SPMD in DDP is completely retired in 1.9, and SPSD is the only option.
* Update ddp.py
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Remove error, add mixed to check
* Add test
* Remove test
* Add changelog
* Add test for mixed
* Update tests/plugins/test_deepspeed_plugin.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add special
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* add ClusterEnvironment for LSF systems
* update init file
* add available cluster environments
* clean up LSFEnvironment
* add ddp_hpc as a distributed backend
* clean up SLURMEnvironment
* remove extra blank line
* init device for DDPHPCAccelerator
We need to do this so we don't send the model to the same device from multiple ranks
* committing current state
* add additional methods to ClusterEnvironments
* add NVIDIA mixin for setting up CUDA envars
* remove troubleshooting prints
* cleanup SLURMEnvironment
* fix docstring
* cleanup TorchElasticEnvironment and add documentation
* PEP8 puts a cork in it
* add set_ranks_to_trainer
* remove unused import
* move to new location
* update LSF environment
* remove mixin
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* changelog
* reset slurm env
* add tests
* add licence
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* test node_rank
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add lsf env to docs
* add auto detection for lsf environment
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix is_using_lsf() and test
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add callback to hook tests and add predict test
* Fix lambda callback test
* Simplify lambda call test
* Use LambdaCallback
* Dynamically append to called for the model
* Remove print
* Consistency
* Consistency
* Prepare args/kwargs testing
* yapf doesn't like dict literals
* Add arguments for fit no val test
* Add arguments for fit no val test
* add before_backward_hook
* add test
* resolve flake8
* resolve tests
* update changelog
* add on_before_backward to LightningModule
* update on comments
* Test arguments
* Datamodule refactor
* Fix eval test
* remove extra file
* resolve bug
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* move to hooks
* update
* resolve flake8
* update on comments
* Update full fit + val test
* Update test
* Remove FIXME
* Remove FIXME
* Undo change
* Fix
* Parametrize fit hook test
* Comment
* Parametrize fit hook test with different precision plugins
* Fix tests
* Parametrize fit hook test with manual optimization
* Unnecessary parenthesis
* WIP
* Comments
* Fix message
* Test CI error
* Revert "Test CI error"
This reverts commit 39c4a85a83.
* Add ddp training type teardown
* Update CHANGELOG
* Adrian's fix
* Use destructor
* Update CHANGELOG.md
* RPC destructor
* Update pytorch_lightning/plugins/training_type/ddp.py
* Why do you not work :(
* Missing condition
* Fix deepspeed test
* GC collect in conftest
* Do not show warnings for special tests
* Needs to run on 1.8
To avoid: "RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:32, unhandled cuda error, NCCL version 2.4.8"
* Run torch 1.8
* Skip test due to 'Python bus error'
* Debug NCCL
* shm size
* Disable warnings for special tests
* Remove NCCL_DEBUG statement
* Try smaller shm size
* Revert "Skip test due to 'Python bus error'"
This reverts commit e0a3e8785d.
* README and adjust versions
* Avoid self.on_gpu call
* empty cache cleanup
* More garbage collection
* Unroll parametrizations
* Do not reuse mock
* Undo changes
* Undo notebooks modification
* resolve test
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* delete file
* Undo
* Fix test
* Revert "WIP"
This reverts commit f5828a8c42.
* Rename
* Remove optimizers
* Fix bug with LightningOptimizer
* Add optimizers
* update
* update
* Update CHANGELOG
* On after backward refactor
* Do not call super
* Fixes
* Remove should_accumulate
* pre/post backward refactor
* Call the LM backward hook
* Update tests
* Remove dev debug patch
* Fix test
* Remove optimizer arguments and typing
* Docs fixes
* Fix comment
* Undo changes
* Split manual and auto
* Undo change
* Deepsource
* Remove optimizers
* Undo changes
* Call the hook
* Docs
* Docs
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>