* Add property to delay checkpointing, move loading checkpoint file into the run function to allow deepspeed engine to be loaded
* Add a small test
* Apply suggestions from code review
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Update pytorch_lightning/accelerators/accelerator.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Address review
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add test for poptorch Options
* Hacks to get manual plugin support
* Revert changes
* Fix tests + ensure logic follow suit
* Update pytorch_lightning/plugins/training_type/ipu.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Cleaner
* Cleaner
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fixes to ensure ipu options are respected
* Better setter
* Add test for poptorch Options
* Fix test
* fix ipu test
* Update pytorch_lightning/plugins/training_type/ipu.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Skip test due to 'Python bus error'
* Debug NCCL
* Remove NCCL_DEBUG statement
* Revert "Skip test due to 'Python bus error'"
This reverts commit e0a3e8785d.
* fix
* add test
* changelog
* yapf
* patch os environ
* make a special test
* destroy pg
* debug
* revert
* revert
* problematic test
* skip
* try the fixture
* test
* update sensitive test
* update changelog
* remove comment
* update wrong test
* update test name
* parameterization
* Revert "parameterization"
This reverts commit b0542f43f59c5ce66800883b5e2f0c66a97408cc.
* remove conftest
* ignore test
* teardown
* fix merge
* deep speed parameterization
* uncomment test
* update chlog
* update changelog
* split tests
* update test
update test
update test
update test
* update test comments
* unroll test
* unroll test
* unroll test
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* increase shm
* sudo
* unroll ipu
* Revert "sudo"
This reverts commit 6cc68c1478.
* Revert "increase shm"
This reverts commit 8c27163483.
* x
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* find guilty test
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* POPTORCH_WAIT_FOR_IPU=1
* move test
* redo parameterize for ipu
* de-comment test
* move chlog
* Update tests/accelerators/test_accelerator_connector.py
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
* Update tests/accelerators/test_accelerator_connector.py
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
* Add kubeflow cluster environment
* Add KubeflowEnvironment to docs
* Add KubeflowEnvironment to the changelog
* break up a long line
* Add method to detect kubeflow environment
* Select Kubeflow environment when available
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Run pre-commit
* task_idx == 0
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Updating docs and error message to specify that half precission not available on CPU
* update messages
Co-authored-by: Martin Kristiansen <martinkristiansen@sixgill.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: jirka <jirka.borovec@seznam.cz>
* Add base hook for model parallel
* fix callback signature
* Simplify hook
* Add hook logic
* add tests
* add property setter
* add logic for being called once
* Update changelog
* Fix
* fix return type
* fix lambda callback test
* Fix tests
* Apply code suggestions
* add logic for setup_optimizers_predispatch
* add common dummy model
* Swap call order
* Remove test that isn't needed anymore
* Update tests
* Add a bit more doc
* Few code review fixes
* Update pytorch_lightning/accelerators/accelerator.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Change hook name
* Fix test
* Test setup hook, refactor names
* Swap call order of callbacks and model initialization
* Change name of context manager
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Move connection setup into the setup function. Call setup hook after we set up the accelerator
* Added CHANGELOG.md
* fix setup order in callback test
* fix input arguments in test
* Mock distributed function, remove protection to turn into training type hook
* Remove import
* Add missing mock, ensure custom plugin does not create children process
* Skip test on windows
* Update deepspeed to init connection in setup
* Do not initialize distributed module
* Move DeepSpeed tests to special tests since dist communication is being set up
* Special the test to see if this fixes CI
* Delete accelerator connector test to see if its causing build to fail
* Delete deepspeed test
* Revert "Delete accelerator connector test to see if its causing build to fail"
This reverts commit edde60b8
* Revert "Delete deepspeed test"
This reverts commit 9d317429
* Reverse hook
* Reverse setup hooks to debug again
* Add todo so i know where i left off
* For single device move in pre_dispatch after setup function
* Add additional model to device hook if any additional parameters have been set
* See if we can enable deepspeed tests
* Revert "See if we can enable deepspeed tests"
This reverts commit b5450def
* See if this hook approach works
* Introduce new granular hooks
* Remove import, fix tpu spawn by moving the function to setup
* Added missing special test
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>