* bump cuda in docker images to 11.6.1
* PUSH TO HUB. REVERT THIS!
* conda forge for 11.6
* cuda 11.5
* revert conda changes
* 11.6 back again
* 11.6 back again, all of them
* maybe all passes now
* maybe all passes now
* final push
* Revert "PUSH TO HUB. REVERT THIS!"
This reverts commit 602bfce224.
* Apply suggestions from code review
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* unpin CUDA docker image for GPU CI
* Apply suggestions from code review
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Akihiro Nitta <akihiro@pytorchlightning.ai>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* try skip horovod 0.24.0 only
* HOROVOD_BUILD_CUDA_CC_LIST
* fix test
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Update configs to match latest API
* Ensure we move the entire model to device before configure optimizer is called
* Add missing param
* Expose parameters
* Update references, drop local rank as it's now infered from the environment variable
* Fix ref
* Force install deepspeed 0.3.16
* Add guard for init
* Update pytorch_lightning/plugins/training_type/deepspeed.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Revert type checking
* Install master for CI for testing purposes
* Update CI
* Fix tests
* Add check
* Update versions
* Set precision
* Fix
* See if i can force upgrade
* Attempt to fix
* Drop
* Add changelog
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Try updating CI to latest fairscale
* Update availability of imports.py
* Remove some of the fairscale custom ci stuff
* Update grad scaler within the new process as reference is incorrect for spawn
* Remove fairscale from mocks
* Install fairscale 0.3.4 into the base container, remove from extra.txt
* Update docs/source/conf.py
* Fix import issues
* Mock fairscale for docs
* Fix DeepSpeed and FairScale to specific versions
* Swap back to greater than
* extras
* Revert "extras"
This reverts commit 7353479f
* ci
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: jirka <jirka.borovec@seznam.cz>
* ban TB 2.5
* note
* push
* Ban tb==2.5.0 and deepspeed==0.3.15
* Fix pip command
* pull
* up
* up
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
* Add single checkpoint capability
* Fix checkpointing in test, few cleanups
* Add comment
* Change restore logic
* Move vars around, add better explanation, make todo align with DeepSpeed team
* Fix checkpointing
* Remove deepspeed from extra, install in Dockerfile
* push
* pull
* Split to two tests to see if it fixes Deepspeed error
* Add comment
* Add context to call hook to handle all modules defined within the hook
* Expose some additional parameters
* Added docs, exposed parameters
* Make sure we only configure if necessary
* Setup activation checkpointing regardless, saves the user having to do it manually
* Add some tests that fail currently
* update
* update
* update
* add tests
* change docstring
* resolve accumulate_grad_batches
* resolve flake8
* Update DeepSpeed to use latest version, add some comments
* add metrics
* update
* Small formatting fixes, clean up some code
* Few cleanups
* No need for default state
* Fix tests, add some boilerplate that should move eventually
* Add hook removal
* Add a context manager to handle hook
* Small naming cleanup
* wip
* move save_checkpoint responsability to accelerator
* resolve flake8
* add BC
* Change recommended scale to 16
* resolve flake8
* update test
* update install
* update
* update test
* update
* update
* update test
* resolve flake8
* update
* update
* update on comments
* Push
* pull
* Update pytorch_lightning/plugins/training_type/deepspeed.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Update pytorch_lightning/plugins/training_type/deepspeed.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* update
* Apply suggestions from code review
* Swap to using world size defined by plugin
* update
* update todo
* Remove deepspeed from extra, keep it in the base cuda docker install
* Push
* pull
* update
* update
* update
* update
* Minor changes
* duplicate
* format
* format2
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>