lightning

Commit Graph

Author	SHA1	Message	Date
Kaushik B	bf46730d92	Support TPU Pod Training (n/n) (#7296 )	2021-05-17 11:33:44 +00:00
Nic Eggert	f4f51e0dcf	Add kubeflow cluster environment (#7300 ) * Add kubeflow cluster environment * Add KubeflowEnvironment to docs * Add KubeflowEnvironment to the changelog * break up a long line * Add method to detect kubeflow environment * Select Kubeflow environment when available * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Run pre-commit * task_idx == 0 Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-05-17 09:05:24 +01:00
shuyingsunshine21	8538c1f61e	Accelerator model state dict (#7474 ) * Fix some test errors Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * checkpoint consolidation * Update ddp_spawn.py * Update test_metric_result_integration.py * Update test_results.py * Update utils.py * Update utils.py * Update test_all_gather_grad.py * Update test_all_gather_grad.py * Update test_results.py * Revert "Update test_results.py" This reverts commit `9d4a2b891d`. * Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkpoint_consolidate" This reverts commit `c5053da789`, reversing changes made to `0d23d75bc9`. * Revert "Update test_all_gather_grad.py" This reverts commit `0d23d75bc9`. * Revert "Update utils.py" This reverts commit `70fe5da9c6`. * Revert "Update utils.py" This reverts commit `a9aae99f6e`. * Revert "Update test_results.py" This reverts commit `ea74906878`. * Revert "Update test_metric_result_integration.py" This reverts commit `bf70e431b3`. * Revert "Update ddp_spawn.py" This reverts commit `f17210183b`. * Revert "checkpoint consolidation" This reverts commit `536c1323b0`. * Revert "Revert "checkpoint consolidation"" This reverts commit `3a9fde915a`. * Revert "Revert "Revert "checkpoint consolidation""" This reverts commit `7a369f47e1`. * Revert "Revert "Update ddp_spawn.py"" This reverts commit `8222dc98ea`. * Revert "Revert "Update test_metric_result_integration.py"" This reverts commit `6c095b2370`. * Revert "Revert "Update test_results.py"" This reverts commit `250d0aaaa2`. * Revert "Revert "Update utils.py"" This reverts commit `8651d54d79`. * Revert "Revert "Update test_all_gather_grad.py"" This reverts commit `dcdcd29731`. * modify distributed environment to make test pass * modify model state dict to training type plugin * remove changes * add changelog * fixing isort for pre-commit failure * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address code review Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: SeanNaren <sean@grid.ai>	2021-05-11 16:39:04 +01:00
shuyingsunshine21	987530cd38	Set `num_nodes` and `sync_batchnorm` From Trainer for Manually Passed Training Type Plugin (#7026 ) Co-authored-by: thomas chaton <thomas@grid.ai> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-05-08 11:25:51 +00:00
Carlos Mocholí	8208c330eb	Use `torch.nn.utils.clip_grad_norm_` and add `clip_grad_by_value` support for TPU (#7025 ) Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>	2021-05-07 16:41:39 +00:00
Leonard Lausen	98b94b810c	Fix DeepSpeedPlugin with IterableDataset (#7362 ) * deepspeed add train_micro_batch_size_per_gpu argument * Update naming and doc * Modify to use auto naming convention, add test * Add iterable tests * Fix tests, attempt by mocking * Import correct package * Fix comparison * Set as special test * Remove import * Add Changelog Co-authored-by: SeanNaren <sean@grid.ai>	2021-05-07 10:46:03 +01:00
Kaushik B	e21b7a62d7	Add ddp_find_unused_parameters_false to Registry (#7224 )	2021-05-04 22:40:00 +00:00
Carlos Mocholí	8c0ea92af2	`TrainerState` refactor [5/5] (#7173 ) * `TrainerState` refactor * flake8 * Update finished check * Test cleanup * Fix tests * Fixes * Reorder * flake8 * Update CHANGELOG * Better docs * Better docs * Remove default * Update tests * Bad merge	2021-05-04 12:50:56 +02:00
Kaushik B	6d7c6d6403	Update Accelerator Connector for Registry (#7214 )	2021-05-03 21:03:21 +00:00
Kaushik B	490cc57809	Device updates for TPU Pod (#7243 )	2021-04-30 23:14:06 +05:30
thomas chaton	16d6c9828d	[bugfix] Apex never instantiated. (#7274 ) * update * update * update apex * update * update * update * remove test.py * update * update * update on comments * update changelog * update * update * typo	2021-04-30 13:16:28 -04:00
Adrian Wälchli	ea2287e723	update training type plugin docs regarding result caching (#7261 ) * add docs * typo * update	2021-04-30 13:03:10 +00:00
Carlos Mocholí	bdc4272e99	`_launch` refactor and types [1/n] (#7232 )	2021-04-28 17:41:08 +02:00
Kaushik B	94fcaaf5d7	Add `debug` flag to TPU Training Plugins (PT_XLA_DEBUG) (#7219 )	2021-04-27 20:34:25 +00:00
Kaushik B	c6d9f52cb3	Add a check for TPU Spawn barrrier (#7241 )	2021-04-27 19:45:55 +00:00
ananthsub	bab7225507	[fix] Add barriers before and after setup hook is run (#7202 ) * Update data_connector.py * move-barrier * Update trainer.py * Update ddp.py * changelog * Spacing Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-04-27 17:19:43 +01:00
Carlos Mocholí	ca6c87ffbe	Add back `clip_gradients(model)` (#7231 )	2021-04-27 11:34:02 +00:00
Adrian Wälchli	3b36d81c03	Fixed `num_sanity_val_steps` affecting reproducibility of training data shuffling (#7014 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: thomas chaton <thomas@grid.ai> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>	2021-04-27 09:51:39 +00:00
Kaushik B	5cf9afa176	Add fairscale install msg for Sharded Plugins (#7213 )	2021-04-27 08:22:44 +00:00
shuyingsunshine21	52a5cee0a7	Set smarter default for DDP sharded for performance optimization (#6937 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-04-27 04:01:34 +05:30
Sean Naren	8439aead66	Update FairScale on CI (#7017 ) * Try updating CI to latest fairscale * Update availability of imports.py * Remove some of the fairscale custom ci stuff * Update grad scaler within the new process as reference is incorrect for spawn * Remove fairscale from mocks * Install fairscale 0.3.4 into the base container, remove from extra.txt * Update docs/source/conf.py * Fix import issues * Mock fairscale for docs * Fix DeepSpeed and FairScale to specific versions * Swap back to greater than * extras * Revert "extras" This reverts commit `7353479f` * ci Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: jirka <jirka.borovec@seznam.cz>	2021-04-23 12:37:00 +01:00
ananthsub	3f1a08ab00	Fix mypy checks for double precision plugin (#7151 )	2021-04-22 11:29:38 +01:00
Sean Naren	ce14565ed9	[FSDP] Move on save checkpoint outside of zero check (#7134 ) * Move on save checkpoint outside of zero check * Remove unnecessary override Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-04-22 01:54:47 +02:00
thomas chaton	013756404b	[bugfix] Add set_default_tensor_type to torch.DoubleTensor with precision=64 (#7108 ) * update * Update pytorch_lightning/plugins/precision/double.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Update pytorch_lightning/plugins/precision/double.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Update pytorch_lightning/plugins/precision/double.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * resolve tests Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-04-20 15:25:37 +00:00
Kaushik B	f168a535ca	Add MpModelWrapper in TPU Spawn (#7045 ) Co-authored-by: tchaton <thomas@grid.ai> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-04-20 13:05:27 +00:00
Adrian Wälchli	6b15ca95f0	fix logger experiment version in multiple run DDP (#7077 ) * fix * changelog	2021-04-19 17:12:05 +00:00
Carlos Mocholí	898ec8a94a	Create pytorch_lightning/utilities/types.py (#7048 )	2021-04-19 14:43:16 +02:00
Kaushik B	30b7440e12	TPU Spawn Rank & root device Error (#7074 ) * TPU Spawn Rank Error * Update tpu spawn * Fix root device property for tpu spawn * Update changelog	2021-04-18 23:42:48 +02:00
Kaushik B	97be843226	Better approach to register plugins (#7063 ) * Better approach to register plugins * Add ddp_with_find_unused_parameters_false * Remove unnecessary break * Revert back the ddp commit * Update register override logic * Update register override logic * fix mypy	2021-04-18 11:23:12 +02:00
ananthsub	8bcd169767	[fix] Fix multi-node DDP launch by using local rank instead of global rank for main process (#7061 ) * Update ddp.py * Update CHANGELOG.md	2021-04-16 21:18:54 +01:00
Kaushik B	6a7b4cf5d3	Fix mypy for plugins registry (#7062 )	2021-04-17 01:33:41 +05:30
Kaushik B	832a03af7c	Add Training Type Plugins Registry (#6982 ) Co-authored-by: Sean Naren <sean@grid.ai> Co-authored-by: thomas chaton <thomas@grid.ai>	2021-04-16 18:01:56 +05:30
Carlos Mocholí	f29ecbfd90	Typing for accelerators and plugins (#7022 )	2021-04-15 16:48:16 +00:00
ananthsub	f6f81f0430	[fix] Add a cluster environment teardown to clean up environment state (#6942 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-04-15 16:06:54 +00:00
Ethan Harris	f645df5e9a	Add typings for evaluation_loop.py and remove some dead code (#7015 )	2021-04-15 07:36:04 +00:00
Adrian Wälchli	d3f73a0a74	Plugin Docs (#6952 ) Co-authored-by: edenlightning <66261195+edenlightning@users.noreply.github.com> Co-authored-by: William Falcon <waf2107@columbia.edu> Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>	2021-04-14 20:53:21 +00:00
Adrian Wälchli	33cc9fe138	Clean up environment access in plugins (#6941 ) Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>	2021-04-13 20:07:40 +02:00
Peng Zhang	89074fa2ad	Fix Multi-GPU join for horovod (#6954 ) * fixjoin * fix join on cpu * fix typo * try to undo horovod skip * undo * Try removing skip * Update CHANGELOG * add back skip for test_horovod_multi_optimizer * Add back skip Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-04-13 17:44:41 +01:00
Hinrich B. Winther	b37b58a73e	Fix Checkpoint issue when using Horovod distributed backend (PyTorchLightning#6947) (#6958 ) Co-Authored-By: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-04-13 09:18:52 +00:00
Kaushik B	1b3e4f9fb9	Fix sync_dist for tpus (#6950 )	2021-04-13 14:17:15 +05:30
Sean Naren	b46cc557ef	[Feat] DeepSpeed single file saving (#6900 ) * Add single checkpoint capability * Fix checkpointing in test, few cleanups * Add comment * Change restore logic * Move vars around, add better explanation, make todo align with DeepSpeed team * Fix checkpointing * Remove deepspeed from extra, install in Dockerfile * push * pull * Split to two tests to see if it fixes Deepspeed error * Add comment	2021-04-12 22:44:09 +00:00
Adrian Wälchli	fe0d08899e	Fix ShardedDataParallel has no attribute require_backward_grad_sync (#6915 ) Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>	2021-04-10 16:14:37 +00:00
Kaushik B	55525031c6	Fix TPU Spawn gather (#6896 )	2021-04-09 18:30:59 +05:30
Ethan Harris	1c2ecbf70c	TPUSpawn + IterableDataset error message (#6875 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-04-08 19:57:48 +05:30
shuyingsunshine21	313e81638d	Supporting Adding DDP Communication Hooks (#6736 ) * Fix some test errors Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * checkpoint consolidation * Update ddp_spawn.py * Update test_metric_result_integration.py * Update test_results.py * Update utils.py * Update utils.py * Update test_all_gather_grad.py * Update test_all_gather_grad.py * Update test_results.py * Revert "Update test_results.py" This reverts commit `9d4a2b891d`. * Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkpoint_consolidate" This reverts commit `c5053da789`, reversing changes made to `0d23d75bc9`. * Revert "Update test_all_gather_grad.py" This reverts commit `0d23d75bc9`. * Revert "Update utils.py" This reverts commit `70fe5da9c6`. * Revert "Update utils.py" This reverts commit `a9aae99f6e`. * Revert "Update test_results.py" This reverts commit `ea74906878`. * Revert "Update test_metric_result_integration.py" This reverts commit `bf70e431b3`. * Revert "Update ddp_spawn.py" This reverts commit `f17210183b`. * Revert "checkpoint consolidation" This reverts commit `536c1323b0`. * Revert "Revert "checkpoint consolidation"" This reverts commit `3a9fde915a`. * Revert "Revert "Revert "checkpoint consolidation""" This reverts commit `7a369f47e1`. * Revert "Revert "Update ddp_spawn.py"" This reverts commit `8222dc98ea`. * Revert "Revert "Update test_metric_result_integration.py"" This reverts commit `6c095b2370`. * Revert "Revert "Update test_results.py"" This reverts commit `250d0aaaa2`. * Revert "Revert "Update utils.py"" This reverts commit `8651d54d79`. * Revert "Revert "Update test_all_gather_grad.py"" This reverts commit `dcdcd29731`. * modify distributed environment to make test pass * add DDP communication hook * remove test related setting * remove more test related setting * fix ddp comm hook util import issue * comments * one more fix for test_custom_plugin * fix ddp spwan * fix sgd * address comments and add tests * 1. add is gpu checking 2. modify test a bit 3. formatting * formatting nit * fix conda 3.7 1.7 issue for no torch.distributed.algorithms module * need at least 1.8.0 * minor fix * modify changelog * changelog should link to PR number instead of issue number * refine a bit on doc for register_ddp_comm_hook function, like ddp_comm_wrapper explanation and add hyperparameter for power sgd states in example usge * move single device checking before call register_ddp_comm_hook * formatting * comments * typo * pre-commit formatting	2021-04-07 12:35:57 +01:00
Anthony Kim	7f6154fcad	Add `Trainer(gradient_clip_algorithm='value'\|'norm')` (#6123 ) * add changelog * add clip by value * fix bug in training tricks.rst * fix bug in trainer.rst * Update trainer.rst * Update trainer.rst * Update CHANGELOG.md Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/plugins/precision/deepspeed_precision.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/utilities/enums.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * yapf formatting * update training tricks * update based on comment * update based on comment * Update pytorch_lightning/trainer/trainer.py Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> * update based on comment * pep8 * mypy * mypy * Update docs/source/advanced/training_tricks.rst Co-authored-by: thomas chaton <thomas@grid.ai> * Update sharded_native_amp.py * Update test_sharded_parity.py * update test codes * Update test_tpu.py * Update pytorch_lightning/trainer/connectors/training_trick_connector.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Update test_trainer.py * Update enums.py * Update enums.py * add super-class initialization to precision plugins. * add clip_grad horovod cpu test * add clip_grad horovod cpu test * use subprocess check_call * change order of horovod tests * set max_epochs 2 in horovod test * remove clip_grad_val test from horovod-cpu * remove "type: ignore" * divide clip grad val test in horovod * update based on comments * add super-class initialization to precision plugins. * bugfix * bugfix * revert some changes * revert some changes * Update tests/models/test_horovod.py * merge master * Delete signature test No point in testing a signature Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: thomas chaton <thomas@grid.ai> Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-04-06 08:27:37 -05:00
Kaushik B	cf8e828559	[Fix] TPU Training Type Plugin (#6816 )	2021-04-06 15:02:44 +05:30
Eugene Khvedchenya	eafec7d425	Fix DPP + SyncBN (#6838 ) * Fix DPP + SyncBN Ensure that model is already on correct GPU before applying SyncBN conversion * Fix order of SyncBN for ddp_spawn	2021-04-06 08:40:29 +01:00
ananthsub	bb9ace4333	[typing] Add typehint for broadcast in training type plugin (#6777 ) * Update training_type_plugin.py * Update accelerator.py * Update pytorch_lightning/plugins/training_type/training_type_plugin.py Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>	2021-04-02 20:55:34 +02:00
thomas chaton	3e3175d074	resolve bug (#6781 )	2021-04-01 11:43:23 +01:00

1 2 3 4

196 Commits