Commit Graph

293 Commits

Author SHA1 Message Date
four4fish a451997c4d
Avoid wrapping LightningModule in DDP plugins when not fitting (#9096)
* Avoid wrapping LightningModule in DDP plugins when not fitting

* Avoid wrapping LightningModule in DDP plugins when not fitting
2021-09-02 02:23:59 +00:00
B. Kerim Tshimanga 65b3dc4495
scheduled removal of DeepSpeedPlugin.cpu_offload* parameters (#9244) 2021-09-01 12:02:30 +02:00
four4fish b497fb80e5
Remove reference to DistributedDataParallel from parallel plugin teardown (#8943) 2021-08-26 17:51:05 -07:00
Yi Wang 366fb39d2e
Support post-localSGD in Lightning DDP plugin (#8967)
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-08-26 08:24:49 +01:00
four4fish f01a9a6cd2
Remove `BasePlugin` (#9066)
* Remove BasePlugin

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-08-25 19:10:28 +00:00
Sean Naren bac8b1be81
Add support for CPU AMP autocast (#9084) 2021-08-25 12:18:00 +00:00
Sean Naren 1bab0a17a9
Fix torch bfloat import version (#9089) 2021-08-24 19:18:12 +00:00
Sean Naren 1feec8c601
Add bfloat16 support to Lightning Trainer (#9049) 2021-08-24 09:47:21 +00:00
ananthsub 1e4d8929fb
Simplify checkpoint connector loading after Checkpoint IO plugin's introduction (#9045)
* Simplify checkpoint connector loading after Checkpoint IO plugins introduction

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-08-23 13:12:18 -07:00
four4fish c912ebf889
Remove TrainingTypePlugin.on_save and Accelerator.on_save (#9023)
* Remove TrainingTypePlugin.on_save and Accelerator.on_save
2021-08-23 10:11:00 -07:00
Adrian Wälchli 49c52b0d4b
update an outdated error message in DDPPlugin (#9005) 2021-08-23 15:29:07 +00:00
Kaushik B 0461107972
Move `init_ddp_connection` to distributed utilities (#9044)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-08-23 14:01:01 +05:30
Sean Naren c6b6888387
Add DeepSpeed Stage 1 + doc improvements for model parallel (#8974)
* Add stage 1 support + small doc improvements

* Add CHANGELOG.md
2021-08-18 19:40:19 +05:30
Danielle Pintz 77bc5d4004
Replace instances of `self.lightning_module.trainer` with `trainer` directly in ddp_spawn and tpu_spawn (#8942)
* Replace instances of `self.lightning_module.trainer` with `trainer` directly in ddp_spawn and tpu_spawn
2021-08-17 13:15:33 -07:00
Yifu Wang 14f1475c25
Ensure the existence of `DDPPlugin._sync_dir` in `reconciliate_processes` (#8939)
Co-authored-by: Yifu Wang <yifuwang@2012@gmail.com>
2021-08-17 13:47:33 +05:30
Carlos Mocholí 93ab24d1ee
Replace DataLoader sampler once for IPUs (#8858) 2021-08-16 11:28:05 +02:00
Sean Naren b2973a035e
Introduce CheckpointIO Plugin (#8743) 2021-08-13 17:35:31 +01:00
Carlos Mocholí a1264a6850
Automatic string fixes (#8886) 2021-08-13 14:28:14 +00:00
Binh Tang efec3d461c
Move logger and profiler finalization to trainer's teardown (#8685)
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-08-05 10:09:43 +02:00
Carlos Mocholí ed13040729
Connect the model to the training type plugin at the start of run (#8536) 2021-08-04 17:43:34 +02:00
Sean Naren 49d03f87fe
[docs] Update deepspeed docs, add some more information and link to streamlit (#8691) 2021-08-03 16:12:36 +00:00
Sean Naren e5d9e21dea
Fix save/load/resume from checkpoint for DeepSpeed Plugin (#8397) 2021-08-02 22:31:05 +00:00
thomas chaton 9e61de2063
Torch Elastic DDP DeadLock bug fix (#8655)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-08-02 21:48:43 +02:00
Jirka Borovec f67892ea96
CI: yesqa (#8564)
* add yesqa
* fix flake8

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-08-02 16:05:56 +00:00
Sean Naren 7a1e97203e
Add property to skip restoring optimizers and schedulers via plugin (#8644) 2021-07-31 10:08:10 +02:00
Sean Naren 07b7dc9c17
[Fix] Add delay property for checkpointing, refactor loading checkpoint (DeepSpeed Checkpointing Fix 1/n) (#8627)
* Add property to delay checkpointing, move loading checkpoint file into the run function to allow deepspeed engine to be loaded

* Add a small test

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update pytorch_lightning/accelerators/accelerator.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Address review

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-30 11:31:08 +01:00
thomas chaton c7f8c8c3c8
[bugfix] DeepSpeed with no schedulers (#8580)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-27 15:28:10 +00:00
Carlos Mocholí e63968ab88
Add `pyupgrade` to `pre-commit` (#8557)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-26 14:38:12 +02:00
Carlos Mocholí a64cc37394
Replace `yapf` with `black` (#7783)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-26 13:37:35 +02:00
deepsource-autofix[bot] 2cf03af155
Remove undefined name from `__all__` (#8468)
* Remove undefined name from `__all__`

Co-authored-by: deepsource-autofix[bot] <62050782+deepsource-autofix[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-26 10:52:35 +02:00
Kaushik B ef7d41692c
Add `ddp_*_find_unused_parameters_false` to Plugins Registry. (#8483)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-07-24 04:02:54 +00:00
Carlos Mocholí 4a64bc3fd3
Fix DeepSpeed lr scheduler logic (#8527)
* Fix deepspeed scheduler logic

* Fix tests

* Minor changes

* Improve tests

* inference fix

* CHANGELOG

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-07-23 10:08:58 +01:00
Adrian Wälchli 0ad7f3a829
Fix log_dir tracking in case of multiple Trainer instances + DDP (#7403)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-07-23 09:18:23 +02:00
Carlos Mocholí f7027a8701
Remove `torch >= 1.6` checks (#8523) 2021-07-23 04:03:20 +00:00
Kaushik B 5452590872
fix: Enable manual optimization for TPUs (#8458) 2021-07-22 15:33:35 +05:30
thomas chaton c9af1a7aec
[bugfix] Reduce memory leaks (#8490)
* reduce memory leak

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update changelog

* Apply suggestions from code review

Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>

* resolve flake8

* update on comments

* resolve bug

* update

* Undo whitespace changes

* remove bug

* resolve flake8

* revert change

* update on comments

* delete the ddp wrapper as it hold memory

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* resolve flake8

* update on comments

* update changelog

* resolve test

* Update CHANGELOG

* Refactor teardown

* Fix comment

* Do it for non-gpu too

* remove ref when the model is not a lightning_module

* Fix import error

* move down

* resolve bug

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* resolve assignement

* update

* move above

* Fix device calls to support tpu training

* Updat todo

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Kaushik B <kaushikbokka@gmail.com>
2021-07-21 11:37:05 +02:00
marsggbo d0038b521c
Bugfix: horovod optimizer missing 2 required positional arguments (#7840)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-07-21 08:11:26 +00:00
Sean Naren 8a9ee403be
Add Windows Support for DeepSpeed (#8488)
* Modify deepspeed distributed to support windows

* Add weak test

* Cleanups

* Capture more in tests

* Add comment

* Cleaner asserts
2021-07-20 13:55:52 +00:00
deepsource-autofix[bot] 3628c314e5
Merge `isinstance` calls (#8469)
Co-authored-by: deepsource-autofix[bot] <62050782+deepsource-autofix[bot]@users.noreply.github.com>
2021-07-19 14:34:37 +00:00
Stephen McGroarty b7e5bc7a36
Only output IPU report on request (#8340)
These reports can be quite large and involve some processing
to produce. It means on larger models there's a noticeable performance
hit to produce the cycles/memory reports.
2021-07-19 12:52:58 +00:00
Yi Wang adaa32f47a
[DDP] Remove the outdated limitations of DDP communication hook since 1.9 (#8346)
* [DDP] Remove the outdated limitations of DDP communication hook since 1.9

1. DDP communication hook can work on multiple backends since 1.9.
2. SPMD in DDP is completely retired in 1.9, and SPSD is the only option.

* Update ddp.py

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-19 13:55:42 +02:00
Sean Naren 06ac7d9649
[Fix] Remove DeepSpeed Plugin FP16 exception (#8462)
* Remove error, add mixed to check

* Add test

* Remove test

* Add changelog

* Add test for mixed

* Update tests/plugins/test_deepspeed_plugin.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add special

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-19 11:12:31 +00:00
Adrian Wälchli b42efa7d86
support launching Lightning ddp with traditional command (#7480)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-07-14 11:25:36 +00:00
deepsource-autofix[bot] b2ba2e6333
Use literal syntax instead of function calls to create data structure (#8406) 2021-07-14 10:32:13 +00:00
Andrew Tritt 3102922647
Add LSF support (#5102)
* add ClusterEnvironment for LSF systems

* update init file

* add available cluster environments

* clean up LSFEnvironment

* add ddp_hpc as a distributed backend

* clean up SLURMEnvironment

* remove extra blank line

* init device for DDPHPCAccelerator

We need to do this so we don't send the model to the same device from multiple ranks

* committing current state

* add additional methods to ClusterEnvironments

* add NVIDIA mixin for setting up CUDA envars

* remove troubleshooting prints

* cleanup SLURMEnvironment

* fix docstring

* cleanup TorchElasticEnvironment and add documentation

* PEP8 puts a cork in it

* add set_ranks_to_trainer

* remove unused import

* move to new location

* update LSF environment

* remove mixin

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* changelog

* reset slurm env

* add tests

* add licence

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* test node_rank

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add lsf env to docs

* add auto detection for lsf environment

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix is_using_lsf() and test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-09 16:14:26 +02:00
Dusan Drevicky 1b06edf2f2
Add the `on_before_optimizer_step` hook (#8048)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-07-09 13:30:52 +02:00
thomas chaton 1c825a2a9c
Add the `on_before_backward` hook (#7865)
* Add callback to hook tests and add predict test

* Fix lambda callback test

* Simplify lambda call test

* Use LambdaCallback

* Dynamically append to called for the model

* Remove print

* Consistency

* Consistency

* Prepare args/kwargs testing

* yapf doesn't like dict literals

* Add arguments for fit no val test

* Add arguments for fit no val test

* add before_backward_hook

* add test

* resolve flake8

* resolve tests

* update changelog

* add on_before_backward to LightningModule

* update on comments

* Test arguments

* Datamodule refactor

* Fix eval test

* remove extra file

* resolve bug

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* move to hooks

* update

* resolve flake8

* update on comments

* Update full fit + val test

* Update test

* Remove FIXME

* Remove FIXME

* Undo change

* Fix

* Parametrize fit hook test

* Comment

* Parametrize fit hook test with different precision plugins

* Fix tests

* Parametrize fit hook test with manual optimization

* Unnecessary parenthesis

* WIP

* Comments

* Fix message

* Test CI error

* Revert "Test CI error"

This reverts commit 39c4a85a83.

* Add ddp training type teardown

* Update CHANGELOG

* Adrian's fix

* Use destructor

* Update CHANGELOG.md

* RPC destructor

* Update pytorch_lightning/plugins/training_type/ddp.py

* Why do you not work :(

* Missing condition

* Fix deepspeed test

* GC collect in conftest

* Do not show warnings for special tests

* Needs to run on 1.8

To avoid: "RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:32, unhandled cuda error, NCCL version 2.4.8"

* Run torch 1.8

* Skip test due to 'Python bus error'

* Debug NCCL

* shm size

* Disable warnings for special tests

* Remove NCCL_DEBUG statement

* Try smaller shm size

* Revert "Skip test due to 'Python bus error'"

This reverts commit e0a3e8785d.

* README and adjust versions

* Avoid self.on_gpu call

* empty cache cleanup

* More garbage collection

* Unroll parametrizations

* Do not reuse mock

* Undo changes

* Undo notebooks modification

* resolve test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* delete file

* Undo

* Fix test

* Revert "WIP"

This reverts commit f5828a8c42.

* Rename

* Remove optimizers

* Fix bug with LightningOptimizer

* Add optimizers

* update

* update

* Update CHANGELOG

* On after backward refactor

* Do not call super

* Fixes

* Remove should_accumulate

* pre/post backward refactor

* Call the LM backward hook

* Update tests

* Remove dev debug patch

* Fix test

* Remove optimizer arguments and typing

* Docs fixes

* Fix comment

* Undo changes

* Split manual and auto

* Undo change

* Deepsource

* Remove optimizers

* Undo changes

* Call the hook

* Docs

* Docs

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-07-09 06:15:57 +00:00
Carlos Mocholí eb6d991218
Refactor plugins backward (#8328) 2021-07-08 16:02:09 +02:00
Carlos Mocholí c4353ea702
Remove `dev_debugger.call_count` (#8317) 2021-07-07 19:59:59 +02:00
Carlos Mocholí 368ac1c622
[CLI] Drop `ArgumentParser` when pickling and save before spawning (#8017)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-07-07 17:56:13 +00:00