Commit Graph

196 Commits

Author SHA1 Message Date
Kaushik B bf46730d92
Support TPU Pod Training (n/n) (#7296) 2021-05-17 11:33:44 +00:00
Nic Eggert f4f51e0dcf
Add kubeflow cluster environment (#7300)
* Add kubeflow cluster environment

* Add KubeflowEnvironment to docs

* Add KubeflowEnvironment to the changelog

* break up a long line

* Add method to detect kubeflow environment

* Select Kubeflow environment when available

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Run pre-commit

* task_idx == 0

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-05-17 09:05:24 +01:00
shuyingsunshine21 8538c1f61e
Accelerator model state dict (#7474)
* Fix some test errors
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* checkpoint consolidation

* Update ddp_spawn.py

* Update test_metric_result_integration.py

* Update test_results.py

* Update utils.py

* Update utils.py

* Update test_all_gather_grad.py

* Update test_all_gather_grad.py

* Update test_results.py

* Revert "Update test_results.py"

This reverts commit 9d4a2b891d.

* Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkpoint_consolidate"

This reverts commit c5053da789, reversing
changes made to 0d23d75bc9.

* Revert "Update test_all_gather_grad.py"

This reverts commit 0d23d75bc9.

* Revert "Update utils.py"

This reverts commit 70fe5da9c6.

* Revert "Update utils.py"

This reverts commit a9aae99f6e.

* Revert "Update test_results.py"

This reverts commit ea74906878.

* Revert "Update test_metric_result_integration.py"

This reverts commit bf70e431b3.

* Revert "Update ddp_spawn.py"

This reverts commit f17210183b.

* Revert "checkpoint consolidation"

This reverts commit 536c1323b0.

* Revert "Revert "checkpoint consolidation""

This reverts commit 3a9fde915a.

* Revert "Revert "Revert "checkpoint consolidation"""

This reverts commit 7a369f47e1.

* Revert "Revert "Update ddp_spawn.py""

This reverts commit 8222dc98ea.

* Revert "Revert "Update test_metric_result_integration.py""

This reverts commit 6c095b2370.

* Revert "Revert "Update test_results.py""

This reverts commit 250d0aaaa2.

* Revert "Revert "Update utils.py""

This reverts commit 8651d54d79.

* Revert "Revert "Update test_all_gather_grad.py""

This reverts commit dcdcd29731.

* modify distributed environment to make test pass

* modify model state dict to training type plugin

* remove changes

* add changelog

* fixing isort for pre-commit failure

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Address code review

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: SeanNaren <sean@grid.ai>
2021-05-11 16:39:04 +01:00
shuyingsunshine21 987530cd38
Set `num_nodes` and `sync_batchnorm` From Trainer for Manually Passed Training Type Plugin (#7026)
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-05-08 11:25:51 +00:00
Carlos Mocholí 8208c330eb
Use `torch.nn.utils.clip_grad_norm_` and add `clip_grad_by_value` support for TPU (#7025)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-05-07 16:41:39 +00:00
Leonard Lausen 98b94b810c
Fix DeepSpeedPlugin with IterableDataset (#7362)
* deepspeed add train_micro_batch_size_per_gpu argument

* Update naming and doc

* Modify to use auto naming convention, add test

* Add iterable tests

* Fix tests, attempt by mocking

* Import correct package

* Fix comparison

* Set as special test

* Remove import

* Add Changelog

Co-authored-by: SeanNaren <sean@grid.ai>
2021-05-07 10:46:03 +01:00
Kaushik B e21b7a62d7
Add ddp_find_unused_parameters_false to Registry (#7224) 2021-05-04 22:40:00 +00:00
Carlos Mocholí 8c0ea92af2
`TrainerState` refactor [5/5] (#7173)
* `TrainerState` refactor

* flake8

* Update finished check

* Test cleanup

* Fix tests

* Fixes

* Reorder

* flake8

* Update CHANGELOG

* Better docs

* Better docs

* Remove default

* Update tests

* Bad merge
2021-05-04 12:50:56 +02:00
Kaushik B 6d7c6d6403
Update Accelerator Connector for Registry (#7214) 2021-05-03 21:03:21 +00:00
Kaushik B 490cc57809
Device updates for TPU Pod (#7243) 2021-04-30 23:14:06 +05:30
thomas chaton 16d6c9828d
[bugfix] Apex never instantiated. (#7274)
* update

* update

* update apex

* update

* update

* update

* remove test.py

* update

* update

* update on comments

* update changelog

* update

* update

* typo
2021-04-30 13:16:28 -04:00
Adrian Wälchli ea2287e723
update training type plugin docs regarding result caching (#7261)
* add docs

* typo

* update
2021-04-30 13:03:10 +00:00
Carlos Mocholí bdc4272e99
`_launch` refactor and types [1/n] (#7232) 2021-04-28 17:41:08 +02:00
Kaushik B 94fcaaf5d7
Add `debug` flag to TPU Training Plugins (PT_XLA_DEBUG) (#7219) 2021-04-27 20:34:25 +00:00
Kaushik B c6d9f52cb3
Add a check for TPU Spawn barrrier (#7241) 2021-04-27 19:45:55 +00:00
ananthsub bab7225507
[fix] Add barriers before and after setup hook is run (#7202)
* Update data_connector.py

* move-barrier

* Update trainer.py

* Update ddp.py

* changelog

* Spacing

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-27 17:19:43 +01:00
Carlos Mocholí ca6c87ffbe
Add back `clip_gradients(model)` (#7231) 2021-04-27 11:34:02 +00:00
Adrian Wälchli 3b36d81c03
Fixed `num_sanity_val_steps` affecting reproducibility of training data shuffling (#7014)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-04-27 09:51:39 +00:00
Kaushik B 5cf9afa176
Add fairscale install msg for Sharded Plugins (#7213) 2021-04-27 08:22:44 +00:00
shuyingsunshine21 52a5cee0a7
Set smarter default for DDP sharded for performance optimization (#6937)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-27 04:01:34 +05:30
Sean Naren 8439aead66
Update FairScale on CI (#7017)
* Try updating CI to latest fairscale

* Update availability of imports.py

* Remove some of the fairscale custom ci stuff

* Update grad scaler within the new process as reference is incorrect for spawn

* Remove fairscale from mocks

* Install fairscale 0.3.4 into the base container, remove from extra.txt

* Update docs/source/conf.py

* Fix import issues

* Mock fairscale for docs

* Fix DeepSpeed and FairScale to specific versions

* Swap back to greater than

* extras

* Revert "extras"

This reverts commit 7353479f

* ci

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: jirka <jirka.borovec@seznam.cz>
2021-04-23 12:37:00 +01:00
ananthsub 3f1a08ab00
Fix mypy checks for double precision plugin (#7151) 2021-04-22 11:29:38 +01:00
Sean Naren ce14565ed9
[FSDP] Move on save checkpoint outside of zero check (#7134)
* Move on save checkpoint outside of zero check

* Remove unnecessary override

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-22 01:54:47 +02:00
thomas chaton 013756404b
[bugfix] Add set_default_tensor_type to torch.DoubleTensor with precision=64 (#7108)
* update

* Update pytorch_lightning/plugins/precision/double.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update pytorch_lightning/plugins/precision/double.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update pytorch_lightning/plugins/precision/double.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* resolve tests

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-20 15:25:37 +00:00
Kaushik B f168a535ca
Add MpModelWrapper in TPU Spawn (#7045)
Co-authored-by: tchaton <thomas@grid.ai>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-20 13:05:27 +00:00
Adrian Wälchli 6b15ca95f0
fix logger experiment version in multiple run DDP (#7077)
* fix

* changelog
2021-04-19 17:12:05 +00:00
Carlos Mocholí 898ec8a94a
Create pytorch_lightning/utilities/types.py (#7048) 2021-04-19 14:43:16 +02:00
Kaushik B 30b7440e12
TPU Spawn Rank & root device Error (#7074)
* TPU Spawn Rank Error

* Update tpu spawn

* Fix root device property for tpu spawn

* Update changelog
2021-04-18 23:42:48 +02:00
Kaushik B 97be843226
Better approach to register plugins (#7063)
* Better approach to register plugins

* Add ddp_with_find_unused_parameters_false

* Remove unnecessary break

* Revert back the ddp commit

* Update register override logic

* Update register override logic

* fix mypy
2021-04-18 11:23:12 +02:00
ananthsub 8bcd169767 [fix] Fix multi-node DDP launch by using local rank instead of global rank for main process (#7061)
* Update ddp.py

* Update CHANGELOG.md
2021-04-16 21:18:54 +01:00
Kaushik B 6a7b4cf5d3
Fix mypy for plugins registry (#7062) 2021-04-17 01:33:41 +05:30
Kaushik B 832a03af7c
Add Training Type Plugins Registry (#6982)
Co-authored-by: Sean Naren <sean@grid.ai>
Co-authored-by: thomas chaton <thomas@grid.ai>
2021-04-16 18:01:56 +05:30
Carlos Mocholí f29ecbfd90
Typing for accelerators and plugins (#7022) 2021-04-15 16:48:16 +00:00
ananthsub f6f81f0430
[fix] Add a cluster environment teardown to clean up environment state (#6942)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-15 16:06:54 +00:00
Ethan Harris f645df5e9a
Add typings for evaluation_loop.py and remove some dead code (#7015) 2021-04-15 07:36:04 +00:00
Adrian Wälchli d3f73a0a74
Plugin Docs (#6952)
Co-authored-by: edenlightning <66261195+edenlightning@users.noreply.github.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
2021-04-14 20:53:21 +00:00
Adrian Wälchli 33cc9fe138
Clean up environment access in plugins (#6941)
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-04-13 20:07:40 +02:00
Peng Zhang 89074fa2ad
Fix Multi-GPU join for horovod (#6954)
* fixjoin

* fix join on cpu

* fix typo

* try to undo horovod skip

* undo

* Try removing skip

* Update CHANGELOG

* add back skip for test_horovod_multi_optimizer

* Add back skip

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-13 17:44:41 +01:00
Hinrich B. Winther b37b58a73e
Fix Checkpoint issue when using Horovod distributed backend (PyTorchLightning#6947) (#6958)
Co-Authored-By: Adrian Wälchli <aedu.waelchli@gmail.com>

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-04-13 09:18:52 +00:00
Kaushik B 1b3e4f9fb9
Fix sync_dist for tpus (#6950) 2021-04-13 14:17:15 +05:30
Sean Naren b46cc557ef
[Feat] DeepSpeed single file saving (#6900)
* Add single checkpoint capability

* Fix checkpointing in test, few cleanups

* Add comment

* Change restore logic

* Move vars around, add better explanation, make todo align with DeepSpeed team

* Fix checkpointing

* Remove deepspeed from extra, install in Dockerfile

* push

* pull

* Split to two tests to see if it fixes Deepspeed error

* Add comment
2021-04-12 22:44:09 +00:00
Adrian Wälchli fe0d08899e
Fix ShardedDataParallel has no attribute require_backward_grad_sync (#6915)
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-04-10 16:14:37 +00:00
Kaushik B 55525031c6
Fix TPU Spawn gather (#6896) 2021-04-09 18:30:59 +05:30
Ethan Harris 1c2ecbf70c
TPUSpawn + IterableDataset error message (#6875)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-08 19:57:48 +05:30
shuyingsunshine21 313e81638d
Supporting Adding DDP Communication Hooks (#6736)
* Fix some test errors
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* checkpoint consolidation

* Update ddp_spawn.py

* Update test_metric_result_integration.py

* Update test_results.py

* Update utils.py

* Update utils.py

* Update test_all_gather_grad.py

* Update test_all_gather_grad.py

* Update test_results.py

* Revert "Update test_results.py"

This reverts commit 9d4a2b891d.

* Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkpoint_consolidate"

This reverts commit c5053da789, reversing
changes made to 0d23d75bc9.

* Revert "Update test_all_gather_grad.py"

This reverts commit 0d23d75bc9.

* Revert "Update utils.py"

This reverts commit 70fe5da9c6.

* Revert "Update utils.py"

This reverts commit a9aae99f6e.

* Revert "Update test_results.py"

This reverts commit ea74906878.

* Revert "Update test_metric_result_integration.py"

This reverts commit bf70e431b3.

* Revert "Update ddp_spawn.py"

This reverts commit f17210183b.

* Revert "checkpoint consolidation"

This reverts commit 536c1323b0.

* Revert "Revert "checkpoint consolidation""

This reverts commit 3a9fde915a.

* Revert "Revert "Revert "checkpoint consolidation"""

This reverts commit 7a369f47e1.

* Revert "Revert "Update ddp_spawn.py""

This reverts commit 8222dc98ea.

* Revert "Revert "Update test_metric_result_integration.py""

This reverts commit 6c095b2370.

* Revert "Revert "Update test_results.py""

This reverts commit 250d0aaaa2.

* Revert "Revert "Update utils.py""

This reverts commit 8651d54d79.

* Revert "Revert "Update test_all_gather_grad.py""

This reverts commit dcdcd29731.

* modify distributed environment to make test pass

* add DDP communication hook

* remove test related setting

* remove more test related setting

* fix ddp comm hook util import issue

* comments

* one more fix for test_custom_plugin

* fix ddp spwan

* fix sgd

* address comments and add tests

* 1. add is gpu checking 2. modify test a bit 3. formatting

* formatting nit

* fix conda 3.7 1.7 issue for no torch.distributed.algorithms module

* need at least 1.8.0

* minor fix

* modify changelog

* changelog should link to PR number instead of issue number

* refine a bit on doc for register_ddp_comm_hook function, like ddp_comm_wrapper explanation and add hyperparameter for power sgd states in example usge

* move single device checking before call register_ddp_comm_hook

* formatting

* comments

* typo

* pre-commit formatting
2021-04-07 12:35:57 +01:00
Anthony Kim 7f6154fcad
Add `Trainer(gradient_clip_algorithm='value'|'norm')` (#6123)
* add changelog

* add clip by value

* fix bug in training tricks.rst

* fix bug in trainer.rst

* Update trainer.rst

* Update trainer.rst

* Update CHANGELOG.md

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/plugins/precision/deepspeed_precision.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/utilities/enums.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* yapf formatting

* update training tricks

* update based on comment

* update based on comment

* Update pytorch_lightning/trainer/trainer.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* update based on comment

* pep8

* mypy

* mypy

* Update docs/source/advanced/training_tricks.rst

Co-authored-by: thomas chaton <thomas@grid.ai>

* Update sharded_native_amp.py

* Update test_sharded_parity.py

* update test codes

* Update test_tpu.py

* Update pytorch_lightning/trainer/connectors/training_trick_connector.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update test_trainer.py

* Update enums.py

* Update enums.py

* add super-class initialization to precision plugins.

* add clip_grad horovod cpu test

* add clip_grad horovod cpu test

* use subprocess check_call

* change order of horovod tests

* set max_epochs 2 in horovod test

* remove clip_grad_val test from horovod-cpu

* remove "type: ignore"

* divide clip grad val test in horovod

* update based on comments

* add super-class initialization to precision plugins.

* bugfix

* bugfix

* revert some changes

* revert some changes

* Update tests/models/test_horovod.py

* merge master

* Delete signature test

No point in testing a signature

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-04-06 08:27:37 -05:00
Kaushik B cf8e828559
[Fix] TPU Training Type Plugin (#6816) 2021-04-06 15:02:44 +05:30
Eugene Khvedchenya eafec7d425
Fix DPP + SyncBN (#6838)
* Fix DPP + SyncBN 

Ensure that model is already on correct GPU before applying SyncBN conversion

* Fix order of SyncBN for ddp_spawn
2021-04-06 08:40:29 +01:00
ananthsub bb9ace4333
[typing] Add typehint for broadcast in training type plugin (#6777)
* Update training_type_plugin.py

* Update accelerator.py

* Update pytorch_lightning/plugins/training_type/training_type_plugin.py

Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>

Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2021-04-02 20:55:34 +02:00
thomas chaton 3e3175d074
resolve bug (#6781) 2021-04-01 11:43:23 +01:00