Commit Graph

277 Commits

Author SHA1 Message Date
Sean Naren b2973a035e
Introduce CheckpointIO Plugin (#8743) 2021-08-13 17:35:31 +01:00
Carlos Mocholí a1264a6850
Automatic string fixes (#8886) 2021-08-13 14:28:14 +00:00
Binh Tang efec3d461c
Move logger and profiler finalization to trainer's teardown (#8685)
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-08-05 10:09:43 +02:00
Carlos Mocholí ed13040729
Connect the model to the training type plugin at the start of run (#8536) 2021-08-04 17:43:34 +02:00
Sean Naren 49d03f87fe
[docs] Update deepspeed docs, add some more information and link to streamlit (#8691) 2021-08-03 16:12:36 +00:00
Sean Naren e5d9e21dea
Fix save/load/resume from checkpoint for DeepSpeed Plugin (#8397) 2021-08-02 22:31:05 +00:00
thomas chaton 9e61de2063
Torch Elastic DDP DeadLock bug fix (#8655)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-08-02 21:48:43 +02:00
Jirka Borovec f67892ea96
CI: yesqa (#8564)
* add yesqa
* fix flake8

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-08-02 16:05:56 +00:00
Sean Naren 7a1e97203e
Add property to skip restoring optimizers and schedulers via plugin (#8644) 2021-07-31 10:08:10 +02:00
Sean Naren 07b7dc9c17
[Fix] Add delay property for checkpointing, refactor loading checkpoint (DeepSpeed Checkpointing Fix 1/n) (#8627)
* Add property to delay checkpointing, move loading checkpoint file into the run function to allow deepspeed engine to be loaded

* Add a small test

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update pytorch_lightning/accelerators/accelerator.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Address review

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-30 11:31:08 +01:00
thomas chaton c7f8c8c3c8
[bugfix] DeepSpeed with no schedulers (#8580)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-27 15:28:10 +00:00
Carlos Mocholí e63968ab88
Add `pyupgrade` to `pre-commit` (#8557)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-26 14:38:12 +02:00
Carlos Mocholí a64cc37394
Replace `yapf` with `black` (#7783)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-26 13:37:35 +02:00
deepsource-autofix[bot] 2cf03af155
Remove undefined name from `__all__` (#8468)
* Remove undefined name from `__all__`

Co-authored-by: deepsource-autofix[bot] <62050782+deepsource-autofix[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-26 10:52:35 +02:00
Kaushik B ef7d41692c
Add `ddp_*_find_unused_parameters_false` to Plugins Registry. (#8483)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-07-24 04:02:54 +00:00
Carlos Mocholí 4a64bc3fd3
Fix DeepSpeed lr scheduler logic (#8527)
* Fix deepspeed scheduler logic

* Fix tests

* Minor changes

* Improve tests

* inference fix

* CHANGELOG

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-07-23 10:08:58 +01:00
Adrian Wälchli 0ad7f3a829
Fix log_dir tracking in case of multiple Trainer instances + DDP (#7403)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-07-23 09:18:23 +02:00
Carlos Mocholí f7027a8701
Remove `torch >= 1.6` checks (#8523) 2021-07-23 04:03:20 +00:00
Kaushik B 5452590872
fix: Enable manual optimization for TPUs (#8458) 2021-07-22 15:33:35 +05:30
thomas chaton c9af1a7aec
[bugfix] Reduce memory leaks (#8490)
* reduce memory leak

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update changelog

* Apply suggestions from code review

Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>

* resolve flake8

* update on comments

* resolve bug

* update

* Undo whitespace changes

* remove bug

* resolve flake8

* revert change

* update on comments

* delete the ddp wrapper as it hold memory

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* resolve flake8

* update on comments

* update changelog

* resolve test

* Update CHANGELOG

* Refactor teardown

* Fix comment

* Do it for non-gpu too

* remove ref when the model is not a lightning_module

* Fix import error

* move down

* resolve bug

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* resolve assignement

* update

* move above

* Fix device calls to support tpu training

* Updat todo

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Kaushik B <kaushikbokka@gmail.com>
2021-07-21 11:37:05 +02:00
marsggbo d0038b521c
Bugfix: horovod optimizer missing 2 required positional arguments (#7840)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-07-21 08:11:26 +00:00
Sean Naren 8a9ee403be
Add Windows Support for DeepSpeed (#8488)
* Modify deepspeed distributed to support windows

* Add weak test

* Cleanups

* Capture more in tests

* Add comment

* Cleaner asserts
2021-07-20 13:55:52 +00:00
deepsource-autofix[bot] 3628c314e5
Merge `isinstance` calls (#8469)
Co-authored-by: deepsource-autofix[bot] <62050782+deepsource-autofix[bot]@users.noreply.github.com>
2021-07-19 14:34:37 +00:00
Stephen McGroarty b7e5bc7a36
Only output IPU report on request (#8340)
These reports can be quite large and involve some processing
to produce. It means on larger models there's a noticeable performance
hit to produce the cycles/memory reports.
2021-07-19 12:52:58 +00:00
Yi Wang adaa32f47a
[DDP] Remove the outdated limitations of DDP communication hook since 1.9 (#8346)
* [DDP] Remove the outdated limitations of DDP communication hook since 1.9

1. DDP communication hook can work on multiple backends since 1.9.
2. SPMD in DDP is completely retired in 1.9, and SPSD is the only option.

* Update ddp.py

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-19 13:55:42 +02:00
Sean Naren 06ac7d9649
[Fix] Remove DeepSpeed Plugin FP16 exception (#8462)
* Remove error, add mixed to check

* Add test

* Remove test

* Add changelog

* Add test for mixed

* Update tests/plugins/test_deepspeed_plugin.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add special

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-19 11:12:31 +00:00
Adrian Wälchli b42efa7d86
support launching Lightning ddp with traditional command (#7480)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-07-14 11:25:36 +00:00
deepsource-autofix[bot] b2ba2e6333
Use literal syntax instead of function calls to create data structure (#8406) 2021-07-14 10:32:13 +00:00
Andrew Tritt 3102922647
Add LSF support (#5102)
* add ClusterEnvironment for LSF systems

* update init file

* add available cluster environments

* clean up LSFEnvironment

* add ddp_hpc as a distributed backend

* clean up SLURMEnvironment

* remove extra blank line

* init device for DDPHPCAccelerator

We need to do this so we don't send the model to the same device from multiple ranks

* committing current state

* add additional methods to ClusterEnvironments

* add NVIDIA mixin for setting up CUDA envars

* remove troubleshooting prints

* cleanup SLURMEnvironment

* fix docstring

* cleanup TorchElasticEnvironment and add documentation

* PEP8 puts a cork in it

* add set_ranks_to_trainer

* remove unused import

* move to new location

* update LSF environment

* remove mixin

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* changelog

* reset slurm env

* add tests

* add licence

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* test node_rank

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add lsf env to docs

* add auto detection for lsf environment

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix is_using_lsf() and test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-09 16:14:26 +02:00
Dusan Drevicky 1b06edf2f2
Add the `on_before_optimizer_step` hook (#8048)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-07-09 13:30:52 +02:00
thomas chaton 1c825a2a9c
Add the `on_before_backward` hook (#7865)
* Add callback to hook tests and add predict test

* Fix lambda callback test

* Simplify lambda call test

* Use LambdaCallback

* Dynamically append to called for the model

* Remove print

* Consistency

* Consistency

* Prepare args/kwargs testing

* yapf doesn't like dict literals

* Add arguments for fit no val test

* Add arguments for fit no val test

* add before_backward_hook

* add test

* resolve flake8

* resolve tests

* update changelog

* add on_before_backward to LightningModule

* update on comments

* Test arguments

* Datamodule refactor

* Fix eval test

* remove extra file

* resolve bug

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* move to hooks

* update

* resolve flake8

* update on comments

* Update full fit + val test

* Update test

* Remove FIXME

* Remove FIXME

* Undo change

* Fix

* Parametrize fit hook test

* Comment

* Parametrize fit hook test with different precision plugins

* Fix tests

* Parametrize fit hook test with manual optimization

* Unnecessary parenthesis

* WIP

* Comments

* Fix message

* Test CI error

* Revert "Test CI error"

This reverts commit 39c4a85a83.

* Add ddp training type teardown

* Update CHANGELOG

* Adrian's fix

* Use destructor

* Update CHANGELOG.md

* RPC destructor

* Update pytorch_lightning/plugins/training_type/ddp.py

* Why do you not work :(

* Missing condition

* Fix deepspeed test

* GC collect in conftest

* Do not show warnings for special tests

* Needs to run on 1.8

To avoid: "RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:32, unhandled cuda error, NCCL version 2.4.8"

* Run torch 1.8

* Skip test due to 'Python bus error'

* Debug NCCL

* shm size

* Disable warnings for special tests

* Remove NCCL_DEBUG statement

* Try smaller shm size

* Revert "Skip test due to 'Python bus error'"

This reverts commit e0a3e8785d.

* README and adjust versions

* Avoid self.on_gpu call

* empty cache cleanup

* More garbage collection

* Unroll parametrizations

* Do not reuse mock

* Undo changes

* Undo notebooks modification

* resolve test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* delete file

* Undo

* Fix test

* Revert "WIP"

This reverts commit f5828a8c42.

* Rename

* Remove optimizers

* Fix bug with LightningOptimizer

* Add optimizers

* update

* update

* Update CHANGELOG

* On after backward refactor

* Do not call super

* Fixes

* Remove should_accumulate

* pre/post backward refactor

* Call the LM backward hook

* Update tests

* Remove dev debug patch

* Fix test

* Remove optimizer arguments and typing

* Docs fixes

* Fix comment

* Undo changes

* Split manual and auto

* Undo change

* Deepsource

* Remove optimizers

* Undo changes

* Call the hook

* Docs

* Docs

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-07-09 06:15:57 +00:00
Carlos Mocholí eb6d991218
Refactor plugins backward (#8328) 2021-07-08 16:02:09 +02:00
Carlos Mocholí c4353ea702
Remove `dev_debugger.call_count` (#8317) 2021-07-07 19:59:59 +02:00
Carlos Mocholí 368ac1c622
[CLI] Drop `ArgumentParser` when pickling and save before spawning (#8017)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-07-07 17:56:13 +00:00
Carlos Mocholí 398eed508f
Fix `self.optimizers()` not returning a single `LightningOptimizer` (#8326) 2021-07-07 18:57:45 +02:00
Adrian Wälchli d73c32ab51
move `torch.cuda.set_device()` to enable collective calls earlier in setup (#8312) 2021-07-07 13:15:41 +02:00
Sean Naren 6d558961e3
[IPU] Allow poptorch.Options to override Trainer (#8233)
* Add test for poptorch Options

* Hacks to get manual plugin support

* Revert changes

* Fix tests + ensure logic follow suit

* Update pytorch_lightning/plugins/training_type/ipu.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Cleaner

* Cleaner

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-05 13:42:00 +00:00
Carlos Mocholí ea88105b88
Parametrize fit hook test with different precision plugins (#8070)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-07-05 10:50:01 +00:00
Kaushik B 7b6d0a842c
Fix progress bar updates for Pod Training (#8258)
* Fix progress bar updates for Pod Training

* Fix progress bar updates for Pod Training

* Add _pod_progress_bar_force_stdout
2021-07-05 10:38:38 +01:00
Sean Naren 07b1ce227c
[IPU] Fix Custom Poptorch options to IPUPlugin (#8241)
* Fixes to ensure ipu options are respected

* Better setter

* Add test for poptorch Options

* Fix test

* fix ipu test

* Update pytorch_lightning/plugins/training_type/ipu.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-07-02 11:23:57 +00:00
Adrian Wälchli e7139ab9f7
Support `DDPPlugin` to be used on CPU (#6208)
* Skip test due to 'Python bus error'

* Debug NCCL

* Remove NCCL_DEBUG statement

* Revert "Skip test due to 'Python bus error'"

This reverts commit e0a3e8785d.

* fix

* add test

* changelog

* yapf

* patch os environ

* make a special test

* destroy pg

* debug

* revert

* revert

* problematic test

* skip

* try the fixture

* test

* update sensitive test

* update changelog

* remove comment

* update wrong test

* update test name

* parameterization

* Revert "parameterization"

This reverts commit b0542f43f59c5ce66800883b5e2f0c66a97408cc.

* remove conftest

* ignore test

* teardown

* fix merge

* deep speed parameterization

* uncomment test

* update chlog

* update changelog

* split tests

* update test


update test


update test


update test

* update test comments

* unroll test

* unroll test

* unroll test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* increase shm

* sudo

* unroll ipu

* Revert "sudo"

This reverts commit 6cc68c1478.

* Revert "increase shm"

This reverts commit 8c27163483.

* x

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* find guilty test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* POPTORCH_WAIT_FOR_IPU=1

* move test

* redo parameterize for ipu

* de-comment test

* move chlog

* Update tests/accelerators/test_accelerator_connector.py

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

* Update tests/accelerators/test_accelerator_connector.py

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-07-02 12:00:24 +01:00
deepsource-autofix[bot] 7e2f84e050
Remove methods with unnecessary super delegation. (#8148)
* Remove methods with unnecessary super delegation.

* Update fully_sharded.py

* replace init in test

Co-authored-by: deepsource-autofix[bot] <62050782+deepsource-autofix[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Ethan Harris <ethanwharris@gmail.com>
2021-07-02 08:00:55 +00:00
Carlos Mocholí 74eb6cc7e9
Clean `cuda.empty_cache` usage (#8199) 2021-06-30 13:04:24 +02:00
Ethan Harris 57dce7244c
Fix double precision casting complex buffers (#8208)
* Fix double precision casting complex buffers

* Update CHANGELOG.md

* Fixes

* Fixes

* Fix

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-06-30 10:57:42 +01:00
Carlos Mocholí 2e537b75e3
Deprecate `DDPPlugin.task_idx` (#8203) 2021-06-30 01:02:55 +02:00
Carlos Mocholí df601405d9
Use full `torch.distributed` import (#8200) 2021-06-29 22:44:10 +00:00
Kaushik B 9444a08d56
Fix Deprecation warning in DDPSpawn (#8193) 2021-06-29 09:29:51 -07:00
Kaushik B 2a7fad92b9
Avoid passing unnecessary params from TPUSpawn to DDPSpawn (#8192) 2021-06-29 14:30:54 +02:00
Adrian Wälchli bf54ac1cad
fix NCCL error with non-consecutive trainer gpus (#8165)
* device ids in barrier


x


x


s


same fix for spawn


fix non-nccl 


x

* add changelog

* get nccl backend

* get backend

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-06-28 22:08:10 +02:00
thomas chaton c521624a92
[bugfix] Add mechanism to prevent deadlock for DDP on Exception Trigger (#8167)
* add mechanism to prevent deadlock

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* resolve flake8 + update changelog

* update on comments

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

* remove space

* resolve bugs

* overwrite config

* update on comments

* update on comments

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

* update

* update test with comments

* Update pytorch_lightning/plugins/training_type/parallel.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update on comments

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-06-28 19:26:03 +00:00