Commit Graph

148 Commits

Author SHA1 Message Date
chaton f2e99d617f
deprecate enable_pl_optimizer as it is not restored properly (#5244)
* update

* clean test

* still in progress

* udpdate test

* update

* update

* resolve flake

* add test for zero_grad

* update

* works without accumulated_grad

* update

* update

* resolve amp

* revert back to True

* update

* clean tests

* cleaned out

* typo

* update test

* git repare bug

* remove print

* udpate

* Fix formatting/optimizer imports

* Refactor the test for cleanliness

* Add vanilla model to the test, better var names

* Fixed var names, let's clean up these mock tests

* repare test

* update test

* resolve flake8

* add manual_optimization

* update tests

* resolve flake8

* add random accumulate_grad_batches

* improve test

* Update tests/trainer/optimization/test_parity_automatic_optimization.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/trainer/optimization/test_parity_automatic_optimization.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update

* clean tests

* correct bug

* Apply suggestions from code review

* format

* adress comments

* update on comments

* wip

* typo

* depreceate enable_pl_optimizer

* resolve latest bugs

* update

* resolve merge

* add comment

* Update pytorch_lightning/core/lightning.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/deprecated_api/test_remove_1-3.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/trainer/connectors/optimizer_connector.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/trainer/trainer.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/trainer/trainer.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/trainer/optimization/test_parity_automatic_optimization.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update on comments

* update restore

* add a property

* remove setstate as not needed anymore

* update test

* provide optimizer to on_before_zero_grad

* update on comments

* update on comments

* Update pytorch_lightning/trainer/trainer.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update tests/trainer/optimization/test_parity_automatic_optimization.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update tests/trainer/optimization/test_parity_automatic_optimization.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update tests/trainer/optimization/test_parity_automatic_optimization.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* mofidy import

* update changelog

* resolve flake8

* update

* update

* clean doc

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-62-109.ec2.internal>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2021-01-08 16:13:12 -05:00
chaton d510707bc9
[bug-fix] Call transfer_batch_to_device in DDPlugin (#5195)
* hacking out

* update

* remove useless on_before_forward

* update

* remove overriden

* iremove os

* use on_before_forward

* resolve flake8

* add test

* update

* add single_process_per_device

* resolve flake8

* update

* resolve

* update

* update

* update

* add comment

* resolve bug with sharded

* update

* remove property

* update

* resolve test

* resolve bug

* update on comments

* update doc

* Update pytorch_lightning/core/hooks.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* update on comments

* Update pytorch_lightning/plugins/ddp_plugin.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* Update pytorch_lightning/plugins/ddp_plugin.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* resolve pep8

* add device_ids to pipe

* update on comments

* update

* resolve

* update

* update

* update

Co-authored-by: Ubuntu <ubuntu@ip-172-31-62-109.ec2.internal>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2021-01-08 12:05:22 -05:00
ananthsub 94838d3be2
Fix hang in DDP HPC accelerators (#5157)
* Fix hang in DDP HPC accelerators

init_device was never called

* Update CHANGELOG.md
2020-12-16 21:07:11 +01:00
chaton 2c3d43dcb5
Initialize trainer with None in DDPAccelerator (#4915)
* Initialize trainer with None

* add typing to all accelerators

* resolve imports

* update

* add typing

* removed typo

* update

* Fix formatting and imports in accelerator

Co-authored-by: maxjeblick <maxjeblick@users.noreply.github.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
2020-12-10 15:24:44 +01:00
Jirka Borovec d5fa02e798
simplify accelerator steps (#5015)
* simplify accelerator steps

* Apply suggestions from code review

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2020-12-10 18:36:13 +05:30
Jirka Borovec cdbddbe99f
release 1.1.0 (#5048)
* release 1.1.0

* pep8
2020-12-10 00:52:39 +00:00
Jirka Borovec ce9179591d
ref: clean config [1/n] add intermediate setters (#4990)
* add intermediate setters

* show inputs

* fix options

* move

* fix

* less talk

* fix

* talk less

* str

* cases

* rename

Co-authored-by: chaton <thomas@grid.ai>
2020-12-09 14:13:57 -05:00
Rohit Gupta bcbba3b702
Simplify GPU and TPU accelerator (#5024) 2020-12-09 14:12:44 -05:00
Jirka Borovec 53d7c9555c
drop usage of deprecated distributed_backend (#5009)
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
2020-12-09 09:18:23 +01:00
Ananya Harsh Jha 127454ade2
All gatherwith grads (#5012)
* all_gather

* ddp

* horovod

* grad tests

* fixed ddp

* ddp fixed, removed tpu, horovod for now

* changelog

* windows fix

* windows fix

* removed batch from ctx

* all_gather

* ddp

* horovod

* grad tests

* fixed ddp

* ddp fixed, removed tpu, horovod for now

* changelog

* windows fix

* windows fix

* removed batch from ctx

* removed code duplication

* merge

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-12-08 23:20:01 +00:00
Sean Naren ee9b3fe574
[feat] pp 1/n (#5016)
* Added changes for RPC plugin

* Add missing kwargs

* Fix code format

* Loading refactors by introducing is_distributed var, fix optimizer step flow

* Add rpc guard

* Added docstrings and typing

* resolve comments

* Add additional rpc hook, refactor name of exit process hook for clarity

* remove annotation

* Modify behaviour to allow optional return, add test for rpc plugin

* resolve tests

* rename is_ddp_based

* update

* update for windows

* update

* resolve test

* code smell

* Revert back to init_ddp_connection for backwards compat

* Swap to explicit name for property

* Add missing speed parity increase for CI variability, fix call counts for child process

Co-authored-by: tchaton <thomas@grid.ai>
2020-12-08 22:02:10 +00:00
maxjeblick 79ae66d026
Initialize trainer with None (#4847)
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: edenlightning <66261195+edenlightning@users.noreply.github.com>
2020-12-08 22:49:55 +05:30
chaton 2393474350
[hotfix] ddp + manual_optimisation (#4976)
* Rely on ddp plugin for blocking sync behaviour, and skip if we're using manual optimization

* debug

* Revert "debug"

This reverts commit ccca6b6b

* Expose manual reduce for automatic optimization

* Add input arguments

* Enable parity test

* clean imports

* Expose hook after to ensure we reset

* Fix naming

* add

* fix test

* resolve on comments

* typo

* Update tests/trainer/optimization/test_manual_optimization.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/trainer/optimization/test_manual_optimization.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update on comments

* resolve comments

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-12-07 19:31:54 +00:00
chaton 02152c1729
Simplify optimization Logic (#4984)
* Rely on ddp plugin for blocking sync behaviour, and skip if we're using manual optimization

* debug

* Revert "debug"

This reverts commit ccca6b6b

* Expose manual reduce for automatic optimization

* Add input arguments

* Enable parity test

* clean imports

* Expose hook after to ensure we reset

* Fix naming

* add

* fix test

* uniformize optimizer logic

* resolve test

* resovle flake8

* resolve amp bug

* update tests

* remove bug

* remove optimizer_step in accelerators

* typo

* update lightning optimizer

* set doesn't work with ddp_spawn

* resolve flake8

* update threshold

* ignore pyright

* correct codeFactor

* remove useless if

* remove zer_grad function

* simplify step

* remove typo

* resolve bug

* Apply suggestions from code review

* update on comments

* resolve bugs

* remove tests

* Update pytorch_lightning/trainer/configuration_validator.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* simplify testing

* add more tests

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2020-12-07 12:55:49 +00:00
Jirka Borovec 3976db597d
refactor imports of optional dependencies (#4859)
* refactor imports of optional dependencies

* fix

* fix

* fix

* fix

* fix

* flake8

* flake8

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
2020-12-04 10:26:10 +01:00
Lezwon Castelino 12cb9942a1
Tpu save (#4309)
* convert xla tensor to cpu before save

* move_to_cpu

* updated CHANGELOG.md

* added on_save to accelerators

* if accelerator is not None

* refactors

* change filename to run test

* run test_tpu_backend

* added xla_device_utils to tests

* added xla_device_utils to test

* removed tests

* Revert "added xla_device_utils to test"

This reverts commit 0c9316bb

* fixed pep

* increase timeout and print traceback

* lazy check tpu exists

* increased timeout
removed barrier for tpu during test
reduced epochs

* fixed torch_xla imports

* fix tests

* define xla utils

* fix test

* aval

* chlog

* docs

* aval

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2020-12-02 13:05:11 +00:00
chaton c2e6e68c7e
optimizer clean up (#4658)
* add LightningOptimizer

* typo

* add mock closure

* typo

* remove logic in optimizer_step

* update

* update

* update

* desactivate LightningOptimizer for hovorod

* resolve flake

* typo

* check optimizer name

* change name

* added backward to LightningOptimizer

* remove use_lightning_optimizer

* move update

* simplify init

* resolve comments

* resolve bug

* update

* update

* resolve bugs

* resolve flake8

* set state

* work manual_optimizer_step

* add doc

* add enable_pl_optimizer

* make optimizer_step

* add make_optimizer_step

* add examples

* resolve test

* add test_optimizer_return_options_enable_pl_optimizer

* add enable_pl_optimizer=True

* update

* update tests

* resolve bugs

* update

* set Trainer to False

* update

* resolve bugs

* update

* remove from doc

* resolve bug

* typo

* update

* set to True

* simplification

* typo

* resolve horovod

* unwrap horovod

* remove Optimizer

* resolve horovod

* move logic to amp_backend

* doesn't seem to be pickable

* update

* add again

* resolve some bugs

* cleanup

* resolve bug with AMP

* change __repr__

* round at -12

* udpate

* update

* update

* remove from horovod

* typo

* add convert_to_lightning_optimizers in each accelerators

* typo

* forgot

* forgot a convert_to_lightning_optimizers

* update

* update

* update

* increase coverage

* update

* resolve flake8

* update

* remove useless code

* resolve comments + add support for LightningOptimizer base class

* resolve flake

* check optimizer get wrapped back

* resolve DDPSharded

* reduce code

* lightningoptimizer

* Update pytorch_lightning/core/optimizer.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update pytorch_lightning/core/lightning.py

* remove reference to step function

* Apply suggestions from code review

* update on comments

* resolve

* Update CHANGELOG.md

* add back training_step in apex and native_amp

* rename optimizer_step

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2020-12-01 00:09:46 +00:00
Jirka Borovec 217650320e
simplify imports Omegaconf (#4873)
* hydra

* omegaconf
2020-11-27 01:00:56 +01:00
Jirka Borovec 442d57f1e9
simplify imports xla / TPU (#4872)
* xla

* tpu

* fix

* fix

* flake8
2020-11-27 00:37:48 +01:00
Sean Naren 404af43cde
5/n: Extract reference model call to plugins/accelerators (#4773)
* Encapsulate extracting reference model within the plugin to allow custom wrapper logic to live within the plugin/accelerators

* Add missing new lines

* Fix call to accelerator

* Removed double blank

* Use accelerator backend

* Handle case where wrapper has not been initialized within the plugin

* Added basic get model tests, add better typing

* Change model name

* Split GPU/DDP test

* Add stronger typing, skip ddp test on windows

* Fix import

* Fix import in dp

* Fixed PEP8 definition

* Add ddp launcher for ddp testing

* Modify accelerator reference model to property, change name to reflect func

* Revert property as this is incorrect.=

* Revert across accelerators

* Modified name to get_model_from_plugin

* Code review changes, fix issue with dp

* Add verb to function getter

Co-authored-by: chaton <thomas@grid.ai>
2020-11-23 17:21:47 +00:00
ananthsub 45c57600af
Move init_ddp_connection to DDP Plugin (#4407)
* Move init_ddp_connection to DDP Plugin

* cluster-env

* trainer?

* imports

* Update ddp_plugin.py

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2020-11-18 15:49:22 -05:00
Sean Naren e7134a9135
Sharded Plugin 2/n: Allow ddp plugin to modify optimizer state saving (#4675)
* Allow ddp plugin to modify optimizer state saving

* Rely on the accelerator for optimizer states

* Ensure we init the accelerator for the saving function

* Better comment for optim state dump

* Revert "Ensure we init the accelerator for the saving function"

This reverts commit af65effa

* Added accelerator check to initialize tuner before saving model checkpoint

* Simplify comment

* Revert "Added accelerator check to initialize tuner before saving model checkpoint"

This reverts commit f9929c0c

* Return single optimizer state to reduce duplication

* Fixed docstring

* Fixed typing

* Fixed comment

* Added CHANGELOG.md

Co-authored-by: chaton <thomas@grid.ai>
2020-11-18 16:38:35 +00:00
Sean Naren 8283680aa0
Sharded Plugin 3/n: Expose step input to DDP plugin (#4686)
* Allow ddp plugin to move the input to a different device if needed

* Swapped name to on_before_forward to align with hooks in the future

* Update pytorch_lightning/plugins/ddp_plugin.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Pass variable arg type to hook, add example

* Remove blank space (pep check)

* Added blank line

Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-11-18 15:45:30 +00:00
chaton 4018237c30
[FEAT] Add lambda closure to manual_optimizer_step (#4618)
* added lambda_closure

* move to types

* add 2 new tests

* make example more complex

* add complex example to doc

* added more tests

* resolve doc

* typo

* update

* update tpu optimizer_step

* Apply suggestions from code review

* Update pytorch_lightning/core/lightning.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-11-12 19:22:06 +00:00
Sean Naren bacabaebaf
Sharded Accelerator 1/n: Expose clip gradients to plugins via abstract class (#4639)
* Added abstract precision plugin to expose clip_gradients function, use within accelerator to clip gradients

* Exclude model from override, keep optimizer (needed for sharded clip gradients), add override for O2 support apex

* Fix doc

* Applied codereview changes

* Refactored clip function to encapsulate tpu changes with tpu accelerator. Default to standard clip function for vanilla torch

* Pass correct grad clip val

* Moved var to property

* Apply code review suggestions
2020-11-12 17:18:09 +00:00
Sean Naren 33470ba605
Prevent crash if sync_dist=True on CPU (#4626)
* Added test/fix for sync_dist raising NotImplementedError

* Fixed comments/formatting

* Revert base class change, enforce sync tensors across accelerators, added GPU test
2020-11-11 22:04:05 +00:00
chaton 7e08b0d710
[bug-fix] DDP and automatic_optimization=False (#4485)
* resolve bug

* add self._running_manual_optim

* update

* update tests

* update lightning module

* resolve bug

* update tests

* update

* resolve pep8

* update

* replace by `ddp_spawn`

* temporary fix

* update

* update

* move update to training_loop

* make both ddp_spawn

* introduce `manual_optimizer_step`

* update changelog

* added changelog wrong place

* add force_optimizer_step

* update docstring for tests

* update optimizer_step

* update zero_grad

* resolve flake8

* move update into manual_optimizer_step

* add zero_grad

* remove zero_grad tests

* remove manual_backward in AMP, it doesn't help

* update

* loosen tests

* update

* update doc

* add TODO

* Removed unnecessary get model from native amp

* Remove try except with pytest raise

* Add seed, clean up imports, remove try catch to reproduce error

* update code

* update test

* revert back

* formatting

* Update pytorch_lightning/core/lightning.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-11-10 19:44:51 +00:00
William Falcon ee35907170
Accelerator docs (#4583)
* accelerator docs

* accelerator docs
2020-11-08 17:24:41 -05:00
William Falcon 3ba48d3bc4
ref: unify slurm and TE under backendPlugin 5/n" (#4582)
* ref: unify slurm and TE under backendPlugin 4/n

* ref: unify slurm and TE under backendPlugin 5/n
2020-11-08 16:20:19 -05:00
William Falcon 624f5b5938
ref: unify slurm and TE under backendPlugin 3/n (#4581) 2020-11-08 15:32:37 -05:00
William Falcon bfaf014096
ref: unify slurm and TE under backendPlugin 2/n (#4580) 2020-11-08 15:07:16 -05:00
William Falcon 0f64f15f52
ref: unify slurm and TE under backendPlugin 1/n (#4578)
* ref: unify slurm and TE under backendPlugin

* ref: unify slurm and TE under backendPlugin
2020-11-08 14:28:55 -05:00
cool425589 5e09fd31e9
show progressbar only on progress_rank 0 on ddp_slurm (#4437)
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-11-06 01:36:22 +01:00
Travis Addair 51cc7a89ee
Horovod: fixed early stopping and added metrics aggregation (#3775)
* Fixed early stopping for Horovod

* Refactored to sync_dist_if_available

* Bump min Horovod version to support hvd.is_initialized

* Changelog

* Added back change for Horovod

* Removed redundant checks for initialization

* Implement metrics gathering for Horovod

* Added test for EvalResult

* Renamed ddp_sync_on_step -> dist_sync_on_step

* Added metric test for Horovod

* Added option pass callable allgather function to metric base class

* Added dist_sync_fn

* Fixed calls to private _sync_dist

* Fixed Horovod test

* Added sync_tensor to the distributed backend

* Skip Windows

* Insert test path

* Removed redundant import

* Updated drone

* Unset HOROVOD_GPU_ALLREDUCE

* Unset

* No cache dir

* No uninstall

* Unset variables

* Uninstall Horovod during initialization

* Replaced more references to ddp_sync_on_step

* Fixed imports

* Fixed attribute

* Added back default

* Lint

* Added back docstring

* Made gather_all_tensors default

* Added whitespace

* Update tests/models/test_horovod.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/metrics/metric.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update CHANGELOG.md

Co-authored-by: Teddy Koker <teddy.koker@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-11-05 12:52:02 -05:00
Ananya Harsh Jha 01ab2a933d
[bug] [docs] Clearer optimizer_step override instructions (#4455)
* fix

* flags

* remove defaults
2020-11-02 22:13:34 +00:00
chaton 102fa9ee7d
[BUGFIX] AMP + Precision unscale grad (#4441)
* move unscale within Native plugin

* remove gradient tracking from lightning backward

* forgot trainer.fit

* typo

* update

* cleanup

* set to 1.6

* typo

* skip if below 1.6 strict

* update changelog

* remove useless code

* Update tests/plugins/test_amp_plugin.py

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>

* Update tests/plugins/test_amp_plugin.py

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>

* update changelog

* Update CHANGELOG.md

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Jeff Yang <ydcjeff@outlook.com>
2020-11-02 16:36:48 +00:00
Adrian Wälchli 28d45a26a3
Set correct device ids in DDP [wip] (#4297)
* repro


debug


c


d


dd


d


d


d


ads


d


d


d


f


rank


f


v


d


d


d


d


d


d


d


d


d


d


d


set 


drop PL_DDP_PID


clean up


keep set gpus


revert


Revert "drop PL_DDP_PID"

This reverts commit 7d88cae469541ef19128f9c20919fd3a6f863039.
d


pid


gpus


clean up


clean up 


misconfig?


misconfig


clean


clean

* fix pep

* changelog

* remove script

Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-10-24 17:33:47 -04:00
Sean Naren 5641b266d5
Bug/4319 ddp checkpoint (#4323)
* Broadcast best model path to ensure we sync with main process + wait for main process to save

* Add barrier call to ensure all processes are in sync

* Added changelog commit

* Move sync of best model path/score to model checkpoint, keep barrier to ensure all processes complete

* Ensure we broadcast as tuple

* Add init check

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* Removed model checkpoint code, added barrier to trainer to enforce we syncronize and wait for all processes to finish before completing training

* Add barrier within teardown call, removed horovod teardown to inherit from base accelerator

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
2020-10-24 16:55:49 -04:00
William Falcon 753362d0a4
enable ddp as a plugin (#4285)
* enable custom ddp plugin

* enable custom ddp plugin

* enable custom ddp plugin

* enable custom ddp plugin

* enable custom ddp plugin

* enable custom ddp plugin

* enable custom ddp plugin

* enable custom ddp plugin

* enable custom ddp plugin

* enable custom ddp plugin

* enable custom ddp plugin

Co-authored-by: chaton <thomas@grid.ai>
2020-10-22 05:15:51 -04:00
Justus Schock 0ec4107697
Optimizer closure (#4190)
* closure for all optimizers

* rename hook and take care of alternating backwards

* add comment

* training_loop_fix

* closure whenever possible

* training_loop

* simple tests that count backward calls

* fix test to work with closure

* remove debugging statement

* better place

* check grads after backward

* start fixing manual optimization

* skip step when result returned by closure was None

* fix gradient clipping test to work with closure

* attribute dict result only for automatic optimization

* adjust backward calls in accelerator

* adjust where to call gradient clipping

* adjust backward calls in tests

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* pass kwargs to xla optimizer

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-10-21 19:34:29 +01:00
Akihiro Nitta d27ee8b5bf
docs: Add empty lines in docstring [ci skip] (#4232)
* Add empty lines in docstring for proper docs

* Remove Returns:

* Remove unnecessary Returns:

* Update pytorch_lightning/accelerators/ddp2_accelerator.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* fix returns

Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2020-10-21 09:00:39 -04:00
Jirka Borovec f37444fa3e
CI: add flake8 (#4239) 2020-10-19 21:20:17 +01:00
Akihiro Nitta b45b57cc58
Use `Optional` for arguments set to `None` by default (#4164)
* Use `Optional` for variables set to `None` by default

* Use `Optional` instead of `Union[None, ...]` for consistency
2020-10-15 23:02:50 +02:00
Sean Naren 98eb736496
Added getstate/setstate method for torch.save serialization (#4127)
* Added getstate/setstate method for torch.save serialization, added additional Optional Typing to results object

* Added tests to ensure torch.save does not fail

* Added flags to ensure compatible ddp cpu environment

* Removed torch version check due to minimum already being 1.3, reduced epochs for speed

* Moved tests to separate file

* Update to accelerator, move to ddp_spawn to prevent hanging ddp
2020-10-13 16:47:23 -04:00
William Falcon 09c2020a93
notices (#4118) 2020-10-13 07:18:07 -04:00
William Falcon 4c4b090c66
depre (#4088) 2020-10-12 05:58:31 -04:00
William Falcon b9f2682b7d
clean docs, enable grad clip in manual mode (#4078)
* docs

* docs
2020-10-11 13:12:35 -04:00
William Falcon 7ffe05a3d1
ref: accelerator names (#4066)
* ref: accelerator names

* docs
2020-10-11 01:05:14 -04:00
William Falcon a4b9221fc5
ref: decouple apex second attemp part n/n (#4065)
* ref: decouple apex second attemp part n/n

* ref: decouple apex second attemp part n/n
2020-10-10 22:04:50 -04:00
William Falcon 0281b077d8
ref: decouple apex second attemp part 10/n (#4064)
* ref: decouple apex second attemp part 9/n

* ref: decouple apex second attemp part 9/n

* ref: decouple apex second attemp part 9/n
2020-10-10 20:05:05 -04:00