* Add kubeflow cluster environment
* Add KubeflowEnvironment to docs
* Add KubeflowEnvironment to the changelog
* break up a long line
* Add method to detect kubeflow environment
* Select Kubeflow environment when available
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Run pre-commit
* task_idx == 0
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* deepspeed add train_micro_batch_size_per_gpu argument
* Update naming and doc
* Modify to use auto naming convention, add test
* Add iterable tests
* Fix tests, attempt by mocking
* Import correct package
* Fix comparison
* Set as special test
* Remove import
* Add Changelog
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
* Try updating CI to latest fairscale
* Update availability of imports.py
* Remove some of the fairscale custom ci stuff
* Update grad scaler within the new process as reference is incorrect for spawn
* Remove fairscale from mocks
* Install fairscale 0.3.4 into the base container, remove from extra.txt
* Update docs/source/conf.py
* Fix import issues
* Mock fairscale for docs
* Fix DeepSpeed and FairScale to specific versions
* Swap back to greater than
* extras
* Revert "extras"
This reverts commit 7353479f
* ci
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: jirka <jirka.borovec@seznam.cz>
* Add single checkpoint capability
* Fix checkpointing in test, few cleanups
* Add comment
* Change restore logic
* Move vars around, add better explanation, make todo align with DeepSpeed team
* Fix checkpointing
* Remove deepspeed from extra, install in Dockerfile
* push
* pull
* Split to two tests to see if it fixes Deepspeed error
* Add comment
* Fix some test errors
Summary:
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
* checkpoint consolidation
* Update ddp_spawn.py
* Update test_metric_result_integration.py
* Update test_results.py
* Update utils.py
* Update utils.py
* Update test_all_gather_grad.py
* Update test_all_gather_grad.py
* Update test_results.py
* Revert "Update test_results.py"
This reverts commit 9d4a2b891d.
* Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkpoint_consolidate"
This reverts commit c5053da789, reversing
changes made to 0d23d75bc9.
* Revert "Update test_all_gather_grad.py"
This reverts commit 0d23d75bc9.
* Revert "Update utils.py"
This reverts commit 70fe5da9c6.
* Revert "Update utils.py"
This reverts commit a9aae99f6e.
* Revert "Update test_results.py"
This reverts commit ea74906878.
* Revert "Update test_metric_result_integration.py"
This reverts commit bf70e431b3.
* Revert "Update ddp_spawn.py"
This reverts commit f17210183b.
* Revert "checkpoint consolidation"
This reverts commit 536c1323b0.
* Revert "Revert "checkpoint consolidation""
This reverts commit 3a9fde915a.
* Revert "Revert "Revert "checkpoint consolidation"""
This reverts commit 7a369f47e1.
* Revert "Revert "Update ddp_spawn.py""
This reverts commit 8222dc98ea.
* Revert "Revert "Update test_metric_result_integration.py""
This reverts commit 6c095b2370.
* Revert "Revert "Update test_results.py""
This reverts commit 250d0aaaa2.
* Revert "Revert "Update utils.py""
This reverts commit 8651d54d79.
* Revert "Revert "Update test_all_gather_grad.py""
This reverts commit dcdcd29731.
* modify distributed environment to make test pass
* add DDP communication hook
* remove test related setting
* remove more test related setting
* fix ddp comm hook util import issue
* comments
* one more fix for test_custom_plugin
* fix ddp spwan
* fix sgd
* address comments and add tests
* 1. add is gpu checking 2. modify test a bit 3. formatting
* formatting nit
* fix conda 3.7 1.7 issue for no torch.distributed.algorithms module
* need at least 1.8.0
* minor fix
* modify changelog
* changelog should link to PR number instead of issue number
* refine a bit on doc for register_ddp_comm_hook function, like ddp_comm_wrapper explanation and add hyperparameter for power sgd states in example usge
* move single device checking before call register_ddp_comm_hook
* formatting
* comments
* typo
* pre-commit formatting