* Remove error, add mixed to check
* Add test
* Remove test
* Add changelog
* Add test for mixed
* Update tests/plugins/test_deepspeed_plugin.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add special
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* add ClusterEnvironment for LSF systems
* update init file
* add available cluster environments
* clean up LSFEnvironment
* add ddp_hpc as a distributed backend
* clean up SLURMEnvironment
* remove extra blank line
* init device for DDPHPCAccelerator
We need to do this so we don't send the model to the same device from multiple ranks
* committing current state
* add additional methods to ClusterEnvironments
* add NVIDIA mixin for setting up CUDA envars
* remove troubleshooting prints
* cleanup SLURMEnvironment
* fix docstring
* cleanup TorchElasticEnvironment and add documentation
* PEP8 puts a cork in it
* add set_ranks_to_trainer
* remove unused import
* move to new location
* update LSF environment
* remove mixin
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* changelog
* reset slurm env
* add tests
* add licence
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* test node_rank
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add lsf env to docs
* add auto detection for lsf environment
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix is_using_lsf() and test
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add callback to hook tests and add predict test
* Fix lambda callback test
* Simplify lambda call test
* Use LambdaCallback
* Dynamically append to called for the model
* Remove print
* Consistency
* Consistency
* Prepare args/kwargs testing
* yapf doesn't like dict literals
* Add arguments for fit no val test
* Add arguments for fit no val test
* add before_backward_hook
* add test
* resolve flake8
* resolve tests
* update changelog
* add on_before_backward to LightningModule
* update on comments
* Test arguments
* Datamodule refactor
* Fix eval test
* remove extra file
* resolve bug
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* move to hooks
* update
* resolve flake8
* update on comments
* Update full fit + val test
* Update test
* Remove FIXME
* Remove FIXME
* Undo change
* Fix
* Parametrize fit hook test
* Comment
* Parametrize fit hook test with different precision plugins
* Fix tests
* Parametrize fit hook test with manual optimization
* Unnecessary parenthesis
* WIP
* Comments
* Fix message
* Test CI error
* Revert "Test CI error"
This reverts commit 39c4a85a83.
* Add ddp training type teardown
* Update CHANGELOG
* Adrian's fix
* Use destructor
* Update CHANGELOG.md
* RPC destructor
* Update pytorch_lightning/plugins/training_type/ddp.py
* Why do you not work :(
* Missing condition
* Fix deepspeed test
* GC collect in conftest
* Do not show warnings for special tests
* Needs to run on 1.8
To avoid: "RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:32, unhandled cuda error, NCCL version 2.4.8"
* Run torch 1.8
* Skip test due to 'Python bus error'
* Debug NCCL
* shm size
* Disable warnings for special tests
* Remove NCCL_DEBUG statement
* Try smaller shm size
* Revert "Skip test due to 'Python bus error'"
This reverts commit e0a3e8785d.
* README and adjust versions
* Avoid self.on_gpu call
* empty cache cleanup
* More garbage collection
* Unroll parametrizations
* Do not reuse mock
* Undo changes
* Undo notebooks modification
* resolve test
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* delete file
* Undo
* Fix test
* Revert "WIP"
This reverts commit f5828a8c42.
* Rename
* Remove optimizers
* Fix bug with LightningOptimizer
* Add optimizers
* update
* update
* Update CHANGELOG
* On after backward refactor
* Do not call super
* Fixes
* Remove should_accumulate
* pre/post backward refactor
* Call the LM backward hook
* Update tests
* Remove dev debug patch
* Fix test
* Remove optimizer arguments and typing
* Docs fixes
* Fix comment
* Undo changes
* Split manual and auto
* Undo change
* Deepsource
* Remove optimizers
* Undo changes
* Call the hook
* Docs
* Docs
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Skip test due to 'Python bus error'
* Debug NCCL
* Remove NCCL_DEBUG statement
* Revert "Skip test due to 'Python bus error'"
This reverts commit e0a3e8785d.
* fix
* add test
* changelog
* yapf
* patch os environ
* make a special test
* destroy pg
* debug
* revert
* revert
* problematic test
* skip
* try the fixture
* test
* update sensitive test
* update changelog
* remove comment
* update wrong test
* update test name
* parameterization
* Revert "parameterization"
This reverts commit b0542f43f59c5ce66800883b5e2f0c66a97408cc.
* remove conftest
* ignore test
* teardown
* fix merge
* deep speed parameterization
* uncomment test
* update chlog
* update changelog
* split tests
* update test
update test
update test
update test
* update test comments
* unroll test
* unroll test
* unroll test
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* increase shm
* sudo
* unroll ipu
* Revert "sudo"
This reverts commit 6cc68c1478.
* Revert "increase shm"
This reverts commit 8c27163483.
* x
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* find guilty test
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* POPTORCH_WAIT_FOR_IPU=1
* move test
* redo parameterize for ipu
* de-comment test
* move chlog
* Update tests/accelerators/test_accelerator_connector.py
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
* Update tests/accelerators/test_accelerator_connector.py
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
* device ids in barrier
x
x
s
same fix for spawn
fix non-nccl
x
* add changelog
* get nccl backend
* get backend
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
* resolve manual optimization
* resolve manual optimization
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update changelog
* Simplify message
* Move from deprecated
* Split model parallel/manual model
* Use property
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: SeanNaren <sean@grid.ai>
* Update configs to match latest API
* Ensure we move the entire model to device before configure optimizer is called
* Add missing param
* Expose parameters
* Update references, drop local rank as it's now infered from the environment variable
* Fix ref
* Force install deepspeed 0.3.16
* Add guard for init
* Update pytorch_lightning/plugins/training_type/deepspeed.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Revert type checking
* Install master for CI for testing purposes
* Update CI
* Fix tests
* Add check
* Update versions
* Set precision
* Fix
* See if i can force upgrade
* Attempt to fix
* Drop
* Add changelog
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Add kubeflow cluster environment
* Add KubeflowEnvironment to docs
* Add KubeflowEnvironment to the changelog
* break up a long line
* Add method to detect kubeflow environment
* Select Kubeflow environment when available
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Run pre-commit
* task_idx == 0
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* deepspeed add train_micro_batch_size_per_gpu argument
* Update naming and doc
* Modify to use auto naming convention, add test
* Add iterable tests
* Fix tests, attempt by mocking
* Import correct package
* Fix comparison
* Set as special test
* Remove import
* Add Changelog
Co-authored-by: SeanNaren <sean@grid.ai>