Commit Graph

111 Commits

Author SHA1 Message Date
Jirka Borovec 58a6d59784
simplify skip-if tests >> 0/n (#5920)
* skipif + yapf + isort

* tests

* docs

* pp
2021-03-01 12:17:09 +00:00
Justus Schock 0647340f3b
Add mypy typing to precision plugins. (#6149)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2021-02-26 14:27:16 +01:00
Kaushik B e7298b5d38
fix parallel devices return type & add copyright (#6215) 2021-02-26 11:09:08 +01:00
Justus Schock 3ed8ef8af9
type accelerators (#6148) 2021-02-25 06:42:23 +00:00
Sean Naren 432e5637d6
Expose DeepSpeed FP16 parameters due to loss instability (#6115)
* Expose deepspeed config parameters to init function due to instability in parameters

* See if tests can run on normal CI, without special tests

* Add changelog

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-02-21 21:43:11 +01:00
Adrian Wälchli 3bdc0673ea
consistent behavior for reduce method across all Plugins (#6011)
* reduction docs

* docs for abstract base method

* make mean the default

* add preliminary chlog

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-02-20 13:30:21 +01:00
Adrian Wälchli 4b7c0fae00
Fix amp autocast (#6080)
* precision fixes

* add amp test model

* fix test

* revert

* move assert to training step

* fix test

* fix test

* remove unrelated changes

* add changelog

* remove unused import
2021-02-19 17:00:27 +00:00
Rohit Gupta 5d6a091531
default sched (#6062) 2021-02-18 21:02:50 +01:00
Adrian Wälchli fc9bb53e13
fix flake8 for new plugins (#5951)
* flake8

* fix cyclic import

* isort
2021-02-18 18:28:23 +00:00
Adrian Wälchli 6cc1a06078
rename accelerator_backend -> accelerator (#6034)
* rename accelerator backend

* rename new additions from master

* add proper deprecation

* pep8

* warning match

* add missing warning type
2021-02-18 15:54:12 +00:00
Sean Naren ffdcb62e8f
Make parallel devices optional across all plugins (#6051)
* Make parallel devices optional across all plugins so that they can be instantiated

* Add any to types to capture vars passed in
2021-02-18 12:09:53 +00:00
Sean Naren 8d7ac8f0f8
Address code review for deepspeed (#6042) 2021-02-17 22:53:20 +00:00
Sean Naren 7189d673f6
DeepSpeed Integration (#5954)
* Add initial deepspeed changes

* Address code review

* Move static method outside of function

* Fixes

* Add missing annotation

* Remove seed setting

* Doc changes

* Doc changes, add address reviews

* Fix docs

* Try fixing issue by moving to torch adam

* Clean up check

* Changes, better APIs!

* Add wrapper, swap to git install revision

* Add special test

* Add warning

* Address review

* Add better disclaimer

* Turn off ZeRO for testing due to compilation

* Add description on modifying parameters via the plugin

* Doc strings clear

* Small doc fixes

* Fix hash, reduce test

* Added CI change

* Move to azure pipeline

* Fix test name

* Add missing flag

* Remove sudo...

* Try conda instead

* Swap to conda base

* Try suggested install

* Apply suggestions from code review

* Apply suggestions from code review

* Revert "Apply suggestions from code review"

This reverts commit 41cca05a

* Revert "Apply suggestions from code review"

This reverts commit e06ec29e

* Remove setter

* Address most review

* Move out function, remove DeepSpeed from requirements

* Install deepspeed/mpi4py within container

* Use special tests, move to master commit for deepspeed

* Export path

* Force compile to happen first

* Remove!

* Debugging ninja

* Fix error in optimizer step logic

* Attempt to fix symbolic link

* Reverse to aid debugging

* Export path again

* Clean up mess

* var

* Revert "var"

This reverts commit 3450eaca

* Address review, add todo

* Add note about unsupported functionality

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: tchaton <thomas@grid.ai>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-02-17 15:23:42 -05:00
chaton 6bc4490d01
[HotFix] Resolve TPU Training (#6027)
* fix tpus

* update

* add back reduction in val_loss

* resolve some bugs with TPUs

* update changelog

* update on comments

* forgot status

* Fix train_bn arg

* resolve comments

* update on comments

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-02-17 16:40:13 +00:00
chaton 5700fd091f
[Feat] Add TORCH_DISTRIBUTED_BACKEND env variable (#5981)
* add backend support

* resolve flake8

* update changelog

* update

* Apply suggestions from code review

* Update docs/source/advanced/multi_gpu.rst

* add patch as context manager

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-02-17 16:37:39 +00:00
Sean Naren ba6290029a
Re-introduce fix for Hydra directory sync with multiple process (#5993)
* Add hydra fix that was missing from master

* Remove error commas

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2021-02-17 11:24:57 +00:00
chaton e982800b81
Add PredictLoop (#5752)
* integrate distrib_type

* sync changes

* sync

* fixes

* add forgotten generators

* add missing logic

* update

* import

* missed imports

* import fixes

* isort

* mv f

* changelog

* format

* move helper to parallel plugin

* d

* add world size

* clean up

* duplicate

* activate ddp_sharded and tpu

* set nvidia flags

* remove unused colab var

* use_tpu <-> on_tpu attrs

* make some ddp_cpu and clusterplugin tests pass

* Ref/accelerator connector (#5742)

* final cleanup

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* connector cleanup

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* trainer cleanup

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* accelerator cleanup + missing logic in accelerator connector

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add missing changes to callbacks

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* reflect accelerator changes to lightning module

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* clean cluster envs

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* cleanup plugins

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add broadcasting

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* yapf

* remove plugin connector

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* plugins

* add predict_loop

* manual optimization

* clean predictloop

* update optimizer routing

* add predict loop on new accelerator

* resolve a bug

* add rank to torchelastic

* add predict_loop

* add predict loop on new accelerator

* resolve a bug

* fix memory mixed precision

* update

* setstate on trainer for pickling in ddp spawn

* add predict_loop

* clean predictloop

* add predict loop on new accelerator

* resolve a bug

* add predict_loop

* add predict loop on new accelerator

* resolve a bug

* add predict_loop

* add predict loop on new accelerator

* resolve a bug

* add predict_loop

* add predict loop on new accelerator

* resolve a bug

* add predict_loop

* clean predictloop

* add predict loop on new accelerator

* resolve a bug

* add predict_loop

* add predict loop on new accelerator

* resolve a bug

* resolve tests

* add predict method

* add back commented accelerator code

* adapt test for sync_batch_norm to new plugin

* fix deprecated tests

* fix ddp cpu choice when no num_processes are given

* yapf format

* skip a memory test that cannot pass anymore

* remove sanetize

* rename train to run_train

* remove useless hooks

* add misconfigurationException

* remove wrong naming

* resolve some legacy

* udpate docstring

* fix pickle error in spawn plugin

* x

* avoid

* x

* fix cyclic import in docs build

* add support for sharded

* update typing

* add sharded and sharded_spawn to distributed types

* make unwrap model default

* refactor LightningShardedDataParallel similar to LightningDistributedDataParallel

* update sharded spawn to reflect changes

* update sharded to reflect changes

* Merge 1.1.5 changes

* fix merge

* fix merge

* yapf isort

* fix merge

* yapf isort

* fix indentation in test

* copy over reinit scheduler implementation from dev1.2

* fix apex tracking calls with dev_debugger

* reduce diff to dev1.2, clean up

* fix trainer config test  when gpus>0 and num_processes >0 and ddp_cpu

* sort plugin tests legacy/new

* fix error handling for amp on cpu

* fix merge


fix merge


fix merge

* [Feat] Resolve manual_backward (#5837)

* resolve manual_backward

* resolve flake8

* update

* resolve for ddp_spawn

* resolve flake8

* resolve flake8

* resolve flake8

Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>

* fix tests/accelerator tests on cpu

* [BugFix] Resolve manual optimization (#5852)

* resolve manual_optimization

* update

* update

Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>

* Remove copy trainer parameters to happen earlier within the loop and add safe guard to get ref model (#5856)

* resovle a bug

* Accelerator refactor sharded rpc (#5854)

* rpc branch

* merge

* update handling of rpc

* make devices etc. Optional in RPC

* set devices etc. later if necessary

* remove devices from sequential

* make devices optional in rpc

* fix import

* uncomment everything

* fix cluster selection

Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>

* resolve bug

* fix assert in rpc test

* resolve a test

* fix docs compilation

* accelerator refactor - fix for sharded parity test (#5866)

* fix memory issue with ddp_spawn

* x


x


x


x


x


x


x


x


x

* x

* Remove DDP2 as this does not apply

* Add missing pre optimizer hook to ensure lambda closure is called

* fix apex docstring

* [accelerator][BugFix] Resolve some test for 1 gpu (#5863)

* update

* revert init

* resolve a bug

* update

* resolve flake8

* update

* update

* update

* revert init

* resolve a bug

* update

* resolve flake8

* update

* update

* update

* update

* update

* revert init

* resolve a bug

* update

* resolve flake8

* update

* update

* update

* revert init

* update

* resolve flake8

* update

* update

* update

* update

* update

* all_gather

* update

* make plugins work, add misconfig for RPC

* update

* update

* remove breaking test

* resolve some tests

* resolve flake8

* revert to ddp_spawn

Co-authored-by: root <root@ip-172-31-88-60.ec2.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de>

* yapf isort

* resolve flake8

* fix apex doctests

* fix apex doctests 2

* resolve docs

* update drone

* clean env

* update

* update

* update

* update

* merge

* Fix RPC related tests, clean out old API, update for new accelerator API [skip ci] (#5881)

* Fix RPC related tests, clean out old API, update for new accelerator API

* Move tests out of legacy folder, update paths and names

* Update test_remove_1-4.py

* Expose properties for tpu cores/gpus/num_gpus

* Add root GPU property

* Move properties to properties.py

* move tests that were previously in drone

* Fix root GPU property (#5908)

* Move root GPU to property, remove horovod set as this is handled in horovod plugin, ensure we mock correctly to set GPU accelerator

* Add missing tests back

* fix best model path transfer when no checkpoint callback available

* Fix setup hook order [wip] (#5858)

* Call trainer setup hook before accelerator setup

* Add test case

* add new test

* typo

* fix callback order in test

Co-authored-by: tchaton <thomas@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* rename ddp sequential -> rpc sequential for special test

* revert

* fix stupid merge problem

* Use property in connector for sampler (#5913)

* merge the import conflicts

* fix spawning of processes in slurm

* [wip] Fix some bugs for TPU [skip ci] (#5878)

* fixed for single tpu

* fixed spawn

* fixed spawn

* update

* update

* wip

* resolve bugs

* resolve bug

* update on comment

* removed decorator

* resolve comments

* set to 4

* update

* update

* need cleaning

* update

* update

* update

* resolve flake8

* resolve bugs

* exclude broadcast

* resolve bugs

* change test

* update

* update

* skip if meet fails

* properly raise trace

* update

* add catch

* wrap test

* resolve typo

* update

* typo

Co-authored-by: Lezwon Castelino <lezwon@gmail.com>
Co-authored-by: Your Name <you@example.com>

* resolve some tests

* update

* fix imports

* update

* resolve flake8

* update azure pipeline

* skip a sharded test on cpu that requires a gpu

* resolve tpus

* resolve bug

* resolve flake8

* update

* updat utils

* revert permission change on files

* suggestions from carlos

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* remove unrelated formatting changes

* remove incomplete comment

* Update pytorch_lightning/accelerators/__init__.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* remove unrelated formatting change

* add types

* warn 1.7 ddp manual backward only if ddp kwarg unset

* yapf + isort

* pep8 unused imports

* fix cyclic import in docs

* Apply suggestions from code review

* typer in accelerator.py

* typo

* resolve flake8

* update code

* update

* Update pytorch_lightning/trainer/predict_loop.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update pytorch_lightning/trainer/predict_loop.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* fix merge

* fix merge

* reset legacy accelerator

* add missing rename dispatch

* rename post traning

* update code

* resolved comments

* typo

* typo

* add flow description

* resolve comments

* update on comments

* update flow

* add backticks

* resolve tpu

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: justusschock <justus.schock@posteo.de>
Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: root <root@ip-172-31-88-60.ec2.internal>
Co-authored-by: Lezwon Castelino <lezwon@gmail.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2021-02-16 17:11:56 -05:00
chaton a52be5bb07
[Hot Fix] Ensure process_dataloader is called when tpu_cores > 1 to use Parallel DataLoader (#6015)
* hotfix for tpu

* update changelog

* Update CHANGELOG.md

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2021-02-16 17:02:25 -05:00
chaton 6e79bef996
[accelerator][FeatBugFix] Improve manual optimization API (#5771)
* fix trainer.model access

* move properties

* fix test_transfer_batch_hook

* fix auto_select_gpus

* fix omegaconf test

* fix test that needs to simulate slurm ddp

* add horovod plugin

* fix test with named arguments

* clean up whitespace

* fix datamodules test

* remove old accelerators

* fix naming

* move old plugins

* move to plugins

* create precision subpackage

* create training_type subpackage

* fix all new import errors

* fix wrong arguments order passed to test

* fix LR finder

* Added sharded training type and amp plugin

* Move clip grad to precision plugin

* Added sharded spawn, select accelerators based on distributed_backend + enable custom fp16 plugin automatically

* Fix import issue, attempting to fix tests

* Fix initial test

* Reflect hook logic from master, should wrap model after move to device

* Optional state consolidation, since master has optimizers not wrapped

* change attribute for instance test

* reset optimizers

optimizers are not used in main process, so state would be wrong.

* legacy

* imports in accel

* legacy2

* trainer imports

* fix import errors after rebase

* move hook to new setup location

* provide unwrapping logic

* fix trainer callback system

* added ddp2 implementation

* fix imports .legacy

* move plugins

* restore legacy

* drop test.py from root

* add tpu accelerator and plugins

* fixes

* fix lightning optimizer merge

* reset bugreportmodel

* unwrapping

* step routing forward

* model access

* unwrap

* opt

* integrate distrib_type

* sync changes

* sync

* fixes

* add forgotten generators

* add missing logic

* update

* import

* missed imports

* import fixes

* isort

* mv f

* changelog

* format

* move helper to parallel plugin

* d

* add world size

* clean up

* duplicate

* activate ddp_sharded and tpu

* set nvidia flags

* remove unused colab var

* use_tpu <-> on_tpu attrs

* make some ddp_cpu and clusterplugin tests pass

* Ref/accelerator connector (#5742)

* final cleanup

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* connector cleanup

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* trainer cleanup

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* accelerator cleanup + missing logic in accelerator connector

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add missing changes to callbacks

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* reflect accelerator changes to lightning module

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* clean cluster envs

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* cleanup plugins

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add broadcasting

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* yapf

* remove plugin connector

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* plugins

* manual optimization

* update optimizer routing

* add rank to torchelastic

* fix memory mixed precision

* setstate on trainer for pickling in ddp spawn

* add predict method

* add back commented accelerator code

* adapt test for sync_batch_norm to new plugin

* fix deprecated tests

* fix ddp cpu choice when no num_processes are given

* yapf format

* skip a memory test that cannot pass anymore

* update on comments

* fix pickle error in spawn plugin

* x

* avoid

* x

* fix cyclic import in docs build

* add support for sharded

* update typing

* add sharded and sharded_spawn to distributed types

* make unwrap model default

* refactor LightningShardedDataParallel similar to LightningDistributedDataParallel

* update sharded spawn to reflect changes

* update sharded to reflect changes

* Merge 1.1.5 changes

* fix merge

* fix merge

* yapf isort

* fix merge

* yapf isort

* fix indentation in test

* copy over reinit scheduler implementation from dev1.2

* fix apex tracking calls with dev_debugger

* reduce diff to dev1.2, clean up

* fix trainer config test  when gpus>0 and num_processes >0 and ddp_cpu

* sort plugin tests legacy/new

* fix error handling for amp on cpu

* fix merge


fix merge


fix merge

* [Feat] Resolve manual_backward (#5837)

* resolve manual_backward

* resolve flake8

* update

* resolve for ddp_spawn

* resolve flake8

* resolve flake8

* resolve flake8

Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>

* fix tests/accelerator tests on cpu

* [BugFix] Resolve manual optimization (#5852)

* resolve manual_optimization

* update

* update

Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>

* Remove copy trainer parameters to happen earlier within the loop and add safe guard to get ref model (#5856)

* resovle a bug

* Accelerator refactor sharded rpc (#5854)

* rpc branch

* merge

* update handling of rpc

* make devices etc. Optional in RPC

* set devices etc. later if necessary

* remove devices from sequential

* make devices optional in rpc

* fix import

* uncomment everything

* fix cluster selection

Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>

* resolve bug

* fix assert in rpc test

* resolve a test

* fix docs compilation

* accelerator refactor - fix for sharded parity test (#5866)

* fix memory issue with ddp_spawn

* x


x


x


x


x


x


x


x


x

* x

* Remove DDP2 as this does not apply

* Add missing pre optimizer hook to ensure lambda closure is called

* fix apex docstring

* [accelerator][BugFix] Resolve some test for 1 gpu (#5863)

* update

* revert init

* resolve a bug

* update

* resolve flake8

* update

* update

* update

* revert init

* resolve a bug

* update

* resolve flake8

* update

* update

* update

* update

* update

* revert init

* resolve a bug

* update

* resolve flake8

* update

* update

* update

* revert init

* update

* resolve flake8

* update

* update

* update

* update

* update

* all_gather

* update

* make plugins work, add misconfig for RPC

* update

* update

* remove breaking test

* resolve some tests

* resolve flake8

* revert to ddp_spawn

Co-authored-by: root <root@ip-172-31-88-60.ec2.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de>

* yapf isort

* resolve flake8

* fix apex doctests

* fix apex doctests 2

* resolve docs

* update drone

* clean env

* update

* update

* update

* update

* merge

* Fix RPC related tests, clean out old API, update for new accelerator API [skip ci] (#5881)

* Fix RPC related tests, clean out old API, update for new accelerator API

* Move tests out of legacy folder, update paths and names

* Update test_remove_1-4.py

* Expose properties for tpu cores/gpus/num_gpus

* Add root GPU property

* Move properties to properties.py

* move tests that were previously in drone

* Fix root GPU property (#5908)

* Move root GPU to property, remove horovod set as this is handled in horovod plugin, ensure we mock correctly to set GPU accelerator

* Add missing tests back

* fix best model path transfer when no checkpoint callback available

* Fix setup hook order [wip] (#5858)

* Call trainer setup hook before accelerator setup

* Add test case

* add new test

* typo

* fix callback order in test

Co-authored-by: tchaton <thomas@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* rename ddp sequential -> rpc sequential for special test

* revert

* fix stupid merge problem

* Use property in connector for sampler (#5913)

* merge the import conflicts

* fix spawning of processes in slurm

* [wip] Fix some bugs for TPU [skip ci] (#5878)

* fixed for single tpu

* fixed spawn

* fixed spawn

* update

* update

* wip

* resolve bugs

* resolve bug

* update on comment

* removed decorator

* resolve comments

* set to 4

* update

* update

* need cleaning

* update

* update

* update

* resolve flake8

* resolve bugs

* exclude broadcast

* resolve bugs

* change test

* update

* update

* skip if meet fails

* properly raise trace

* update

* add catch

* wrap test

* resolve typo

* update

* typo

Co-authored-by: Lezwon Castelino <lezwon@gmail.com>
Co-authored-by: Your Name <you@example.com>

* resolve some tests

* update

* fix imports

* update

* resolve flake8

* update azure pipeline

* skip a sharded test on cpu that requires a gpu

* resolve tpus

* resolve bug

* resolve flake8

* update

* updat utils

* revert permission change on files

* suggestions from carlos

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* remove unrelated formatting changes

* remove incomplete comment

* Update pytorch_lightning/accelerators/__init__.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* remove unrelated formatting change

* add types

* warn 1.7 ddp manual backward only if ddp kwarg unset

* yapf + isort

* pep8 unused imports

* fix cyclic import in docs

* Apply suggestions from code review

* typer in accelerator.py

* typo

* Apply suggestions from code review

* formatting

* update on comments

* update typo

* Update pytorch_lightning/trainer/properties.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* update

* update on comments

* resolve some comments

* update on comments

* resolve test

* add toggle_model

* update

* update on comments

* update doc

* typo

* update

* typo

* remove space

* update

* update on comments

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: justusschock <justus.schock@posteo.de>
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: root <root@ip-172-31-88-60.ec2.internal>
Co-authored-by: Lezwon Castelino <lezwon@gmail.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2021-02-16 16:00:35 -05:00
Jirka Borovec 1c87f1f6cd
remove legacy plugins (#5950)
* remove legacy plugins

* imports

* formatting

* fix docs references

* fix cluster environment inheritance

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-02-16 19:20:58 +00:00
Sean Naren b40d414463
Fix error in pre-optim logic (#5995) 2021-02-16 19:05:44 +00:00
Justus Schock da6dbc8d1d
PoC: Accelerator refactor (#5743)
* restoring the result from subprocess

* fix queue.get() order for results

* add missing "block_backward_sync" context manager

* add missing "block_backward_sync" context manager

* fix sync_batchnorm

* fix supported gpu-ids for tuple

* fix clip gradients and inf recursion

* accelerator selection: added cluster_environment plugin

* fix torchelastic test

* fix reduce early stopping decision for DDP

* fix tests: callbacks, conversion to lightning optimizer

* fix lightning optimizer does not pickle

* fix setting benchmark and deterministic option

* fix slurm amp test

* fix prepare_data test and determine node_rank

* fix retrieving last path when testing

* remove obsolete plugin argument

* fix test: test_trainer_config

* fix torchscript tests

* fix trainer.model access

* move properties

* fix test_transfer_batch_hook

* fix auto_select_gpus

* fix omegaconf test

* fix test that needs to simulate slurm ddp

* add horovod plugin

* fix test with named arguments

* clean up whitespace

* fix datamodules test

* remove old accelerators

* fix naming

* move old plugins

* move to plugins

* create precision subpackage

* create training_type subpackage

* fix all new import errors

* fix wrong arguments order passed to test

* fix LR finder

* Added sharded training type and amp plugin

* Move clip grad to precision plugin

* Added sharded spawn, select accelerators based on distributed_backend + enable custom fp16 plugin automatically

* Fix import issue, attempting to fix tests

* Fix initial test

* Reflect hook logic from master, should wrap model after move to device

* Optional state consolidation, since master has optimizers not wrapped

* change attribute for instance test

* reset optimizers

optimizers are not used in main process, so state would be wrong.

* legacy

* imports in accel

* legacy2

* trainer imports

* fix import errors after rebase

* move hook to new setup location

* provide unwrapping logic

* fix trainer callback system

* added ddp2 implementation

* fix imports .legacy

* move plugins

* restore legacy

* drop test.py from root

* add tpu accelerator and plugins

* fixes

* fix lightning optimizer merge

* reset bugreportmodel

* unwrapping

* step routing forward

* model access

* unwrap

* opt

* integrate distrib_type

* sync changes

* sync

* fixes

* add forgotten generators

* add missing logic

* update

* import

* missed imports

* import fixes

* isort

* mv f

* changelog

* format

* move helper to parallel plugin

* d

* add world size

* clean up

* duplicate

* activate ddp_sharded and tpu

* set nvidia flags

* remove unused colab var

* use_tpu <-> on_tpu attrs

* make some ddp_cpu and clusterplugin tests pass

* Ref/accelerator connector (#5742)

* final cleanup

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* connector cleanup

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* trainer cleanup

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* accelerator cleanup + missing logic in accelerator connector

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add missing changes to callbacks

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* reflect accelerator changes to lightning module

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* clean cluster envs

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* cleanup plugins

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add broadcasting

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* yapf

* remove plugin connector

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* plugins

* manual optimization

* update optimizer routing

* add rank to torchelastic

* fix memory mixed precision

* setstate on trainer for pickling in ddp spawn

* add predict method

* add back commented accelerator code

* adapt test for sync_batch_norm to new plugin

* fix deprecated tests

* fix ddp cpu choice when no num_processes are given

* yapf format

* skip a memory test that cannot pass anymore

* fix pickle error in spawn plugin

* x

* avoid

* x

* fix cyclic import in docs build

* add support for sharded

* update typing

* add sharded and sharded_spawn to distributed types

* make unwrap model default

* refactor LightningShardedDataParallel similar to LightningDistributedDataParallel

* update sharded spawn to reflect changes

* update sharded to reflect changes

* Merge 1.1.5 changes

* fix merge

* fix merge

* yapf isort

* fix merge

* yapf isort

* fix indentation in test

* copy over reinit scheduler implementation from dev1.2

* fix apex tracking calls with dev_debugger

* reduce diff to dev1.2, clean up

* fix trainer config test  when gpus>0 and num_processes >0 and ddp_cpu

* sort plugin tests legacy/new

* fix error handling for amp on cpu

* fix merge


fix merge


fix merge

* [Feat] Resolve manual_backward (#5837)

* resolve manual_backward

* resolve flake8

* update

* resolve for ddp_spawn

* resolve flake8

* resolve flake8

* resolve flake8

Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>

* fix tests/accelerator tests on cpu

* [BugFix] Resolve manual optimization (#5852)

* resolve manual_optimization

* update

* update

Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>

* Remove copy trainer parameters to happen earlier within the loop and add safe guard to get ref model (#5856)

* resovle a bug

* Accelerator refactor sharded rpc (#5854)

* rpc branch

* merge

* update handling of rpc

* make devices etc. Optional in RPC

* set devices etc. later if necessary

* remove devices from sequential

* make devices optional in rpc

* fix import

* uncomment everything

* fix cluster selection

Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>

* resolve bug

* fix assert in rpc test

* resolve a test

* fix docs compilation

* accelerator refactor - fix for sharded parity test (#5866)

* fix memory issue with ddp_spawn

* x


x


x


x


x


x


x


x


x

* x

* Remove DDP2 as this does not apply

* Add missing pre optimizer hook to ensure lambda closure is called

* fix apex docstring

* [accelerator][BugFix] Resolve some test for 1 gpu (#5863)

* update

* revert init

* resolve a bug

* update

* resolve flake8

* update

* update

* update

* revert init

* resolve a bug

* update

* resolve flake8

* update

* update

* update

* update

* update

* revert init

* resolve a bug

* update

* resolve flake8

* update

* update

* update

* revert init

* update

* resolve flake8

* update

* update

* update

* update

* update

* all_gather

* update

* make plugins work, add misconfig for RPC

* update

* update

* remove breaking test

* resolve some tests

* resolve flake8

* revert to ddp_spawn

Co-authored-by: root <root@ip-172-31-88-60.ec2.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de>

* yapf isort

* resolve flake8

* fix apex doctests

* fix apex doctests 2

* resolve docs

* update drone

* clean env

* update

* update

* update

* update

* merge

* Fix RPC related tests, clean out old API, update for new accelerator API [skip ci] (#5881)

* Fix RPC related tests, clean out old API, update for new accelerator API

* Move tests out of legacy folder, update paths and names

* Update test_remove_1-4.py

* Expose properties for tpu cores/gpus/num_gpus

* Add root GPU property

* Move properties to properties.py

* move tests that were previously in drone

* Fix root GPU property (#5908)

* Move root GPU to property, remove horovod set as this is handled in horovod plugin, ensure we mock correctly to set GPU accelerator

* Add missing tests back

* fix best model path transfer when no checkpoint callback available

* Fix setup hook order [wip] (#5858)

* Call trainer setup hook before accelerator setup

* Add test case

* add new test

* typo

* fix callback order in test

Co-authored-by: tchaton <thomas@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* rename ddp sequential -> rpc sequential for special test

* revert

* fix stupid merge problem

* Use property in connector for sampler (#5913)

* merge the import conflicts

* fix spawning of processes in slurm

* [wip] Fix some bugs for TPU [skip ci] (#5878)

* fixed for single tpu

* fixed spawn

* fixed spawn

* update

* update

* wip

* resolve bugs

* resolve bug

* update on comment

* removed decorator

* resolve comments

* set to 4

* update

* update

* need cleaning

* update

* update

* update

* resolve flake8

* resolve bugs

* exclude broadcast

* resolve bugs

* change test

* update

* update

* skip if meet fails

* properly raise trace

* update

* add catch

* wrap test

* resolve typo

* update

* typo

Co-authored-by: Lezwon Castelino <lezwon@gmail.com>
Co-authored-by: Your Name <you@example.com>

* resolve some tests

* update

* fix imports

* update

* resolve flake8

* update azure pipeline

* skip a sharded test on cpu that requires a gpu

* resolve tpus

* resolve bug

* resolve flake8

* update

* updat utils

* revert permission change on files

* suggestions from carlos

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* remove unrelated formatting changes

* remove incomplete comment

* Update pytorch_lightning/accelerators/__init__.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* remove unrelated formatting change

* add types

* warn 1.7 ddp manual backward only if ddp kwarg unset

* yapf + isort

* pep8 unused imports

* fix cyclic import in docs

* Apply suggestions from code review

* typer in accelerator.py

* typo

* Apply suggestions from code review

* formatting

* update on comments

* update typo

* Update pytorch_lightning/trainer/properties.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* update

* suggestion from code review

* suggestion from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: root <root@ip-172-31-88-60.ec2.internal>
Co-authored-by: Lezwon Castelino <lezwon@gmail.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2021-02-12 15:48:56 -05:00
Gyeongjae Choi f20c1d1cdb Fix typo in docstring (#5835)
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
(cherry picked from commit 0a50bb406f)
2021-02-08 20:22:39 +01:00
Tobias Maier 1d99530c66 Fix root node resolution in slurm environment
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
2021-02-03 19:41:46 +01:00
Adrian Wälchli 8943d8bca0
add missing logic to new plugins and accelerator (#5734)
* add missing logic

* missed imports

* import fixes

* isort

* mv f

* changelog

* format

* move helper to parallel plugin

* d
2021-02-01 13:23:53 -05:00
Justus Schock 9064b83caf
Accelerator Refactor/RPC + Sharded (#5732)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-02-01 14:49:15 +01:00
Justus Schock b3ebc18bcb
Hardware specific parts of Accelerator Refactoring (#5719)
* add basic accelerator class.
Co-Authored with @awaelchi

* pep8

Co-authored-by: @awaelchi

* add cpu accelerator

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add gpu accelerator

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add tpu accelerator

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add accelerator connector

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add single device training

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add single tpu

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add tpu spawn

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* make on_colab_kaggle utility func

* add basic accelerator class.
Co-Authored with @awaelchi

* pep8

Co-authored-by: @awaelchi

* add cpu accelerator

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add gpu accelerator

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add tpu accelerator

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add accelerator connector

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add single device training

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add single tpu

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add tpu spawn

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* make on_colab_kaggle utility func

* fixes

* move

* yapf

* .

* .

* .

* flake8

* sync accelerator connector changes from dev1.2

* changelog

* fix tpu handling

* tpu

* aval

* yapf

* Update pytorch_lightning/plugins/training_type/tpu_spawn.py

Co-authored-by: chaton <thomas@grid.ai>

* Update pytorch_lightning/accelerators/accelerator_connector.py

Co-authored-by: chaton <thomas@grid.ai>

* Update pytorch_lightning/plugins/training_type/tpu_spawn.py

Co-authored-by: chaton <thomas@grid.ai>

* Update tpu_spawn.py

* Update pytorch_lightning/accelerators/accelerator_connector.py

Co-authored-by: chaton <thomas@grid.ai>

* indentation

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: chaton <thomas@grid.ai>
2021-02-01 08:34:59 -05:00
Justus Schock 069ae27cef
Accelerator Refactor: Precision Plugins (#5718)
* add basic accelerator class.
Co-Authored with @awaelchi

* add basic trainign type plugin.
Co-Authored with @awaelchi

* pep8

Co-authored-by: @awaelchi

* update copyright

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add apex_amp

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add mixed base class

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add native amp

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add native amp sharded

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add tpu bfloat

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add inits

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update precision_plugin.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-01-31 13:12:02 -05:00
Adrian Wälchli 3bacac7734
accelerator refactor - add parallel plugins (#5714)
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
2021-01-31 07:48:14 -05:00
Adrian Wälchli 692f77b8a7
Refactor LightningDataParallel (#5670)
* module

* fix model access

* scalar conversion

* refactor

* kwargs

* auto unsqueeze

* refactor code duplication

* clean up

* docs

* update dp docs

* changelog

* generalize test

* test

* rename

* warning cache

* isort

* unsqueezing test

* device

* device

* scalar test

* device

* device

* include coverage of overrides

* clear

* add deprecation test

* docs

* improve coverage

* increase coverage

* fix merge

* extend test

* rename base class

* mention the predict method in docs

* combine iteration over collection

* remove override

* move

* line

* Apply suggestions from code review

* fix running stage

* f401

* fix cyclic import

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-01-31 06:08:16 -05:00
Justus Schock 5d239ccd70
Base classes for accelerator refactoring (#5715)
* add basic accelerator class.
Co-Authored with @awaelchi

* Add base plugin class.
Co-authored with @awaelchi

* add basic trainign type plugin.
Co-Authored with @awaelchi

* add basic precision plugin.
Co-Authored with @awaelchi

* Add missing inits.
Co-authored with @awaelchi

* pep8

Co-authored-by: @awaelchi

* ignore  flake8

* coverage omit

* imports in init

* lost

* imports

* flake8

* .

* .

* chlog

* Update pytorch_lightning/plugins/training_type/training_type_plugin.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/plugins/training_type/training_type_plugin.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/plugins/training_type/training_type_plugin.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/plugins/training_type/training_type_plugin.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/plugins/training_type/training_type_plugin.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/plugins/training_type/training_type_plugin.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/plugins/training_type/training_type_plugin.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-01-30 14:55:28 -05:00
Arnaud Gelas 8dfcc078c4
Fix isort failures in core (#5526)
Remove from skipped module in pyproject.toml and fix failures on:
- pytorch_lightning/core/*.py
2021-01-27 05:02:16 -05:00
Jirka Borovec 7e2e874d95
Refactor: legacy accelerators and plugins (#5645)
* tests: legacy

* legacy: accel

* legacy: plug

* fix imports

* mypy

* flake8
2021-01-26 20:04:36 -05:00
Jirka Borovec 53b0ae49b9 fix imports / isort / flake8 2021-01-26 14:57:34 +01:00
chaton 0435e23a64 deprecate enable_pl_optimizer as it is not restored properly (#5244)
* update

* clean test

* still in progress

* udpdate test

* update

* update

* resolve flake

* add test for zero_grad

* update

* works without accumulated_grad

* update

* update

* resolve amp

* revert back to True

* update

* clean tests

* cleaned out

* typo

* update test

* git repare bug

* remove print

* udpate

* Fix formatting/optimizer imports

* Refactor the test for cleanliness

* Add vanilla model to the test, better var names

* Fixed var names, let's clean up these mock tests

* repare test

* update test

* resolve flake8

* add manual_optimization

* update tests

* resolve flake8

* add random accumulate_grad_batches

* improve test

* Update tests/trainer/optimization/test_parity_automatic_optimization.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/trainer/optimization/test_parity_automatic_optimization.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update

* clean tests

* correct bug

* Apply suggestions from code review

* format

* adress comments

* update on comments

* wip

* typo

* depreceate enable_pl_optimizer

* resolve latest bugs

* update

* resolve merge

* add comment

* Update pytorch_lightning/core/lightning.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/deprecated_api/test_remove_1-3.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/trainer/connectors/optimizer_connector.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/trainer/trainer.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/trainer/trainer.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/trainer/optimization/test_parity_automatic_optimization.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update on comments

* update restore

* add a property

* remove setstate as not needed anymore

* update test

* provide optimizer to on_before_zero_grad

* update on comments

* update on comments

* Update pytorch_lightning/trainer/trainer.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update tests/trainer/optimization/test_parity_automatic_optimization.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update tests/trainer/optimization/test_parity_automatic_optimization.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update tests/trainer/optimization/test_parity_automatic_optimization.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* mofidy import

* update changelog

* resolve flake8

* update

* update

* clean doc

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-62-109.ec2.internal>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>

(cherry picked from commit f2e99d617f)
2021-01-26 14:29:46 +01:00
chaton f2f4a49271 [bug-fix] Call transfer_batch_to_device in DDPlugin (#5195)
* hacking out

* update

* remove useless on_before_forward

* update

* remove overriden

* iremove os

* use on_before_forward

* resolve flake8

* add test

* update

* add single_process_per_device

* resolve flake8

* update

* resolve

* update

* update

* update

* add comment

* resolve bug with sharded

* update

* remove property

* update

* resolve test

* resolve bug

* update on comments

* update doc

* Update pytorch_lightning/core/hooks.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* update on comments

* Update pytorch_lightning/plugins/ddp_plugin.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* Update pytorch_lightning/plugins/ddp_plugin.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* resolve pep8

* add device_ids to pipe

* update on comments

* update

* resolve

* update

* update

* update

Co-authored-by: Ubuntu <ubuntu@ip-172-31-62-109.ec2.internal>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>

(cherry picked from commit d510707bc9)
2021-01-26 14:28:45 +01:00
Jirka Borovec f782230412
docs cleaning - testcode (#5595)
* testcode - python

* revert

* simple

* testcode @rst

* pl

* fix

* pip

* update

* conf

* conf

* nn.

* typo
2021-01-26 04:44:54 -05:00
Jirka Borovec 2846322f60
fix docs render (#5610) 2021-01-25 20:21:00 -05:00
Jirka Borovec 6386f45de7
clarify Trainer running state atribs. (#5589)
* update Trainer is_ attributes

* tests

* more

* isort

* split

* rename

* check

* fix
2021-01-24 10:45:42 +00:00
Adrian Wälchli e806bb77fa
Refactor LightningDistributedDataParallel (#5185)
* add wrapper

* add squeeze

* replace LightningDistributedDP

* update import

* module access

* inputs

* refactor warning

* update

* resolve flake8

* remove old class

* set find unused params to False

* update docstrings

* update docs

* update docs

* add changelog

* deprecation

* rename wrapper -> module

* rename pl_module

* add unit tests

* Revert "add changelog"

This reverts commit 02ec0a6864f4ba2ace3bb6fc6ebc364e1a80ffd7.

* Revert "set find unused params to False"

This reverts commit 8e451515e6ba3227d00f4a5cb63f332cfedb7b30.

Co-authored-by: Ubuntu <thomas@grid.ai>
2021-01-13 14:35:42 -05:00
Adrian Wälchli 61308138c3
set find_unused_parameters=False in DDP as in pytorch (#5435)
* set find unused params to False

* add changelog

* fix changelog

* fix test

* update docs

* update changelog

Co-authored-by: chaton <thomas@grid.ai>
2021-01-13 10:13:40 -05:00
Jirka Borovec 54d20dc596
Refactor: clean trainer device & distrib getters (#5300)
* warnings

* .

* .

* flake8

* .

* .

* .

* use_tpu

* use_dp

* .

* use_ddp

* .

* use_horovod

* .

* .

* .
2021-01-12 05:22:37 -05:00
Carlos Mocholí 3ee3c42035 Prepare 1.1.3 release (#5365)
* Prepare 1.1.3 release

* Fix flake8 error

* suppress

* Remove 1.1.4 section

* Add missing commits to CHANGELOG

* Update PR template

* Add missing commit

* fix

* Update CHANGELOG.md

* Apply suggestions from code review

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

(cherry picked from commit 4d9db866a1)
2021-01-06 15:17:27 +01:00
chaton 56437e98a6 [bug-fix] Trainer.test points to latest best_model_path (#5161)
* resolve bug

* update code

* add set -e

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* update test

* Update tests/checkpointing/test_trainer_checkpoint.py

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>

* Update tests/checkpointing/test_trainer_checkpoint.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* update on comments

* resolve test

* convert to set

* update

* add error triggering

* update

* update on comments

* update

* resolve import

* update

* update

* Update pytorch_lightning/plugins/rpc_plugin.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-62-109.ec2.internal>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

(cherry picked from commit d5b367871f)
2021-01-06 15:14:10 +01:00
Heewon Jeon(gogamza) c0e9a78db4 supports --num-nodes on DDPSequentialPlugin() (#5327)
(cherry picked from commit d20fd8e5ab)
2021-01-06 12:48:17 +01:00
Jirka Borovec b72ed71d4e
Refactor: clean trainer device & distrib setters (#5297)
* naive replace

* simplify

* clean

* .

* fix

* .

* fix

* fix
2021-01-04 17:10:13 +00:00
Jirka Borovec 0f36525e8f
fix/enable - check F401 (#5201)
* refactor - check F401

* missed

* fix
2020-12-21 10:15:04 +01:00
Jirka Borovec 2d54116baa
annotat unused vars (#5017)
* annotate all unused vars

* rank_zero_warn

* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* f1 fixed

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2020-12-19 13:53:06 +01:00
Jirka Borovec 059eaecbb4
set xxx_AVAILABLE as protected (#5082)
* sett xxx_AVAILABLE as protected

* docs
2020-12-14 20:19:05 +05:30
chaton ef8ef12fd0
[feat] pp 2/n (#5026)
* Added changes for RPC plugin

* Add missing kwargs

* Fix code format

* Loading refactors by introducing is_distributed var, fix optimizer step flow

* Add rpc guard

* Added docstrings and typing

* resolve comments

* Add additional rpc hook, refactor name of exit process hook for clarity

* remove annotation

* Modify behaviour to allow optional return, add test for rpc plugin

* resolve tests

* rename is_ddp_based

* update

* update for windows

* update

* resolve test

* code smell

* Added sequential plugin

* resolve bug

* update

* cleanup

* add Exception

* resolve docs

* Remove ddp support

* Revert distributed -> ddp

* Update pl_examples/basic_examples/conv_sequential_example.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pl_examples/basic_examples/conv_sequential_example.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/plugins/ddp_sequential_plugin.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Address code review points

* Update pytorch_lightning/plugins/ddp_sequential_plugin.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/plugins/ddp_sequential_plugin.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Add missing return

* Fix formatting, add datamodule args

* add small comment

* resolve comments

* resolve comments

* update source for fairscale

* update extras

* remove staticmethod

* resolve flake8

* Skip tests that are failing due to bug upstream with multiple optimizers and shard

* update

* update on comments

* clean test

* latest comments

* remove old comments

* add todo

* Update version

* update

* resolve bugs

* resolve bugs

* update test

* remove hanging test

* Update pytorch_lightning/plugins/ddp_sequential_plugin.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* resolve on comments

* Update pytorch_lightning/plugins/ddp_sequential_plugin.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* resolve on comments

* Update pytorch_lightning/plugins/ddp_sequential_plugin.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update pytorch_lightning/plugins/ddp_sequential_plugin.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update pytorch_lightning/plugins/ddp_sequential_plugin.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update pytorch_lightning/plugins/ddp_sequential_plugin.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* remove ImportError

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2020-12-09 12:56:51 +00:00