Commit Graph

310 Commits

Author SHA1 Message Date
Jirka Borovec 53b0ae49b9 fix imports / isort / flake8 2021-01-26 14:57:34 +01:00
SeanNaren 127e04124d Fix merge issue 2021-01-26 14:29:47 +01:00
chaton 0435e23a64 deprecate enable_pl_optimizer as it is not restored properly (#5244)
* update

* clean test

* still in progress

* udpdate test

* update

* update

* resolve flake

* add test for zero_grad

* update

* works without accumulated_grad

* update

* update

* resolve amp

* revert back to True

* update

* clean tests

* cleaned out

* typo

* update test

* git repare bug

* remove print

* udpate

* Fix formatting/optimizer imports

* Refactor the test for cleanliness

* Add vanilla model to the test, better var names

* Fixed var names, let's clean up these mock tests

* repare test

* update test

* resolve flake8

* add manual_optimization

* update tests

* resolve flake8

* add random accumulate_grad_batches

* improve test

* Update tests/trainer/optimization/test_parity_automatic_optimization.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/trainer/optimization/test_parity_automatic_optimization.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update

* clean tests

* correct bug

* Apply suggestions from code review

* format

* adress comments

* update on comments

* wip

* typo

* depreceate enable_pl_optimizer

* resolve latest bugs

* update

* resolve merge

* add comment

* Update pytorch_lightning/core/lightning.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/deprecated_api/test_remove_1-3.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/trainer/connectors/optimizer_connector.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/trainer/trainer.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/trainer/trainer.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/trainer/optimization/test_parity_automatic_optimization.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update on comments

* update restore

* add a property

* remove setstate as not needed anymore

* update test

* provide optimizer to on_before_zero_grad

* update on comments

* update on comments

* Update pytorch_lightning/trainer/trainer.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update tests/trainer/optimization/test_parity_automatic_optimization.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update tests/trainer/optimization/test_parity_automatic_optimization.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update tests/trainer/optimization/test_parity_automatic_optimization.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* mofidy import

* update changelog

* resolve flake8

* update

* update

* clean doc

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-62-109.ec2.internal>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>

(cherry picked from commit f2e99d617f)
2021-01-26 14:29:46 +01:00
chaton f2f4a49271 [bug-fix] Call transfer_batch_to_device in DDPlugin (#5195)
* hacking out

* update

* remove useless on_before_forward

* update

* remove overriden

* iremove os

* use on_before_forward

* resolve flake8

* add test

* update

* add single_process_per_device

* resolve flake8

* update

* resolve

* update

* update

* update

* add comment

* resolve bug with sharded

* update

* remove property

* update

* resolve test

* resolve bug

* update on comments

* update doc

* Update pytorch_lightning/core/hooks.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* update on comments

* Update pytorch_lightning/plugins/ddp_plugin.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* Update pytorch_lightning/plugins/ddp_plugin.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* resolve pep8

* add device_ids to pipe

* update on comments

* update

* resolve

* update

* update

* update

Co-authored-by: Ubuntu <ubuntu@ip-172-31-62-109.ec2.internal>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>

(cherry picked from commit d510707bc9)
2021-01-26 14:28:45 +01:00
Jirka Borovec 2846322f60
fix docs render (#5610) 2021-01-25 20:21:00 -05:00
Arnaud Gelas 1ff6b18e8a
Fix pre-commit isort failure on pytorch_lightning/accelerators (#5503)
Remove from skipped module in pyproject.toml and fix failures on:
- pytorch_lightning/accelerators/*.py
2021-01-16 14:10:56 -05:00
Adrian Wälchli e806bb77fa
Refactor LightningDistributedDataParallel (#5185)
* add wrapper

* add squeeze

* replace LightningDistributedDP

* update import

* module access

* inputs

* refactor warning

* update

* resolve flake8

* remove old class

* set find unused params to False

* update docstrings

* update docs

* update docs

* add changelog

* deprecation

* rename wrapper -> module

* rename pl_module

* add unit tests

* Revert "add changelog"

This reverts commit 02ec0a6864f4ba2ace3bb6fc6ebc364e1a80ffd7.

* Revert "set find unused params to False"

This reverts commit 8e451515e6ba3227d00f4a5cb63f332cfedb7b30.

Co-authored-by: Ubuntu <thomas@grid.ai>
2021-01-13 14:35:42 -05:00
Jirka Borovec 54d20dc596
Refactor: clean trainer device & distrib getters (#5300)
* warnings

* .

* .

* flake8

* .

* .

* .

* use_tpu

* use_dp

* .

* use_ddp

* .

* use_horovod

* .

* .

* .
2021-01-12 05:22:37 -05:00
Jirka Borovec 5ae6926a52
fix some minor typos in docs (#5369)
* fix docs typos

* Apply suggestions from code review

Co-authored-by: Wansoo Kim <rladhkstn8@gmail.com>

* flake8

Co-authored-by: Wansoo Kim <rladhkstn8@gmail.com>
2021-01-07 08:01:52 -05:00
ananthsub a7fe24e9a1 Fix hang in DDP HPC accelerators (#5157)
* Fix hang in DDP HPC accelerators

init_device was never called

* Update CHANGELOG.md
2021-01-05 09:58:36 +01:00
Jirka Borovec b72ed71d4e
Refactor: clean trainer device & distrib setters (#5297)
* naive replace

* simplify

* clean

* .

* fix

* .

* fix

* fix
2021-01-04 17:10:13 +00:00
Jirka Borovec 957583544a
mark todo exceptions (#5320)
* mark todo exceptions

* .

* .

* .

* .

* .

* .

* .

* .

* try

* .
2021-01-04 09:07:56 +01:00
Jirka Borovec 0f36525e8f
fix/enable - check F401 (#5201)
* refactor - check F401

* missed

* fix
2020-12-21 10:15:04 +01:00
Jirka Borovec 2d54116baa
annotat unused vars (#5017)
* annotate all unused vars

* rank_zero_warn

* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* f1 fixed

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2020-12-19 13:53:06 +01:00
Jirka Borovec 059eaecbb4
set xxx_AVAILABLE as protected (#5082)
* sett xxx_AVAILABLE as protected

* docs
2020-12-14 20:19:05 +05:30
chaton 2c3d43dcb5
Initialize trainer with None in DDPAccelerator (#4915)
* Initialize trainer with None

* add typing to all accelerators

* resolve imports

* update

* add typing

* removed typo

* update

* Fix formatting and imports in accelerator

Co-authored-by: maxjeblick <maxjeblick@users.noreply.github.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
2020-12-10 15:24:44 +01:00
Jirka Borovec d5fa02e798
simplify accelerator steps (#5015)
* simplify accelerator steps

* Apply suggestions from code review

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2020-12-10 18:36:13 +05:30
Jirka Borovec cdbddbe99f
release 1.1.0 (#5048)
* release 1.1.0

* pep8
2020-12-10 00:52:39 +00:00
Jirka Borovec ce9179591d
ref: clean config [1/n] add intermediate setters (#4990)
* add intermediate setters

* show inputs

* fix options

* move

* fix

* less talk

* fix

* talk less

* str

* cases

* rename

Co-authored-by: chaton <thomas@grid.ai>
2020-12-09 14:13:57 -05:00
Rohit Gupta bcbba3b702
Simplify GPU and TPU accelerator (#5024) 2020-12-09 14:12:44 -05:00
Jirka Borovec 53d7c9555c
drop usage of deprecated distributed_backend (#5009)
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
2020-12-09 09:18:23 +01:00
Ananya Harsh Jha 127454ade2
All gatherwith grads (#5012)
* all_gather

* ddp

* horovod

* grad tests

* fixed ddp

* ddp fixed, removed tpu, horovod for now

* changelog

* windows fix

* windows fix

* removed batch from ctx

* all_gather

* ddp

* horovod

* grad tests

* fixed ddp

* ddp fixed, removed tpu, horovod for now

* changelog

* windows fix

* windows fix

* removed batch from ctx

* removed code duplication

* merge

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-12-08 23:20:01 +00:00
Sean Naren ee9b3fe574
[feat] pp 1/n (#5016)
* Added changes for RPC plugin

* Add missing kwargs

* Fix code format

* Loading refactors by introducing is_distributed var, fix optimizer step flow

* Add rpc guard

* Added docstrings and typing

* resolve comments

* Add additional rpc hook, refactor name of exit process hook for clarity

* remove annotation

* Modify behaviour to allow optional return, add test for rpc plugin

* resolve tests

* rename is_ddp_based

* update

* update for windows

* update

* resolve test

* code smell

* Revert back to init_ddp_connection for backwards compat

* Swap to explicit name for property

* Add missing speed parity increase for CI variability, fix call counts for child process

Co-authored-by: tchaton <thomas@grid.ai>
2020-12-08 22:02:10 +00:00
maxjeblick 79ae66d026
Initialize trainer with None (#4847)
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: edenlightning <66261195+edenlightning@users.noreply.github.com>
2020-12-08 22:49:55 +05:30
chaton 2393474350
[hotfix] ddp + manual_optimisation (#4976)
* Rely on ddp plugin for blocking sync behaviour, and skip if we're using manual optimization

* debug

* Revert "debug"

This reverts commit ccca6b6b

* Expose manual reduce for automatic optimization

* Add input arguments

* Enable parity test

* clean imports

* Expose hook after to ensure we reset

* Fix naming

* add

* fix test

* resolve on comments

* typo

* Update tests/trainer/optimization/test_manual_optimization.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/trainer/optimization/test_manual_optimization.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update on comments

* resolve comments

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-12-07 19:31:54 +00:00
chaton 02152c1729
Simplify optimization Logic (#4984)
* Rely on ddp plugin for blocking sync behaviour, and skip if we're using manual optimization

* debug

* Revert "debug"

This reverts commit ccca6b6b

* Expose manual reduce for automatic optimization

* Add input arguments

* Enable parity test

* clean imports

* Expose hook after to ensure we reset

* Fix naming

* add

* fix test

* uniformize optimizer logic

* resolve test

* resovle flake8

* resolve amp bug

* update tests

* remove bug

* remove optimizer_step in accelerators

* typo

* update lightning optimizer

* set doesn't work with ddp_spawn

* resolve flake8

* update threshold

* ignore pyright

* correct codeFactor

* remove useless if

* remove zer_grad function

* simplify step

* remove typo

* resolve bug

* Apply suggestions from code review

* update on comments

* resolve bugs

* remove tests

* Update pytorch_lightning/trainer/configuration_validator.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* simplify testing

* add more tests

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2020-12-07 12:55:49 +00:00
Jirka Borovec 3976db597d
refactor imports of optional dependencies (#4859)
* refactor imports of optional dependencies

* fix

* fix

* fix

* fix

* fix

* flake8

* flake8

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
2020-12-04 10:26:10 +01:00
Lezwon Castelino 12cb9942a1
Tpu save (#4309)
* convert xla tensor to cpu before save

* move_to_cpu

* updated CHANGELOG.md

* added on_save to accelerators

* if accelerator is not None

* refactors

* change filename to run test

* run test_tpu_backend

* added xla_device_utils to tests

* added xla_device_utils to test

* removed tests

* Revert "added xla_device_utils to test"

This reverts commit 0c9316bb

* fixed pep

* increase timeout and print traceback

* lazy check tpu exists

* increased timeout
removed barrier for tpu during test
reduced epochs

* fixed torch_xla imports

* fix tests

* define xla utils

* fix test

* aval

* chlog

* docs

* aval

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2020-12-02 13:05:11 +00:00
chaton c2e6e68c7e
optimizer clean up (#4658)
* add LightningOptimizer

* typo

* add mock closure

* typo

* remove logic in optimizer_step

* update

* update

* update

* desactivate LightningOptimizer for hovorod

* resolve flake

* typo

* check optimizer name

* change name

* added backward to LightningOptimizer

* remove use_lightning_optimizer

* move update

* simplify init

* resolve comments

* resolve bug

* update

* update

* resolve bugs

* resolve flake8

* set state

* work manual_optimizer_step

* add doc

* add enable_pl_optimizer

* make optimizer_step

* add make_optimizer_step

* add examples

* resolve test

* add test_optimizer_return_options_enable_pl_optimizer

* add enable_pl_optimizer=True

* update

* update tests

* resolve bugs

* update

* set Trainer to False

* update

* resolve bugs

* update

* remove from doc

* resolve bug

* typo

* update

* set to True

* simplification

* typo

* resolve horovod

* unwrap horovod

* remove Optimizer

* resolve horovod

* move logic to amp_backend

* doesn't seem to be pickable

* update

* add again

* resolve some bugs

* cleanup

* resolve bug with AMP

* change __repr__

* round at -12

* udpate

* update

* update

* remove from horovod

* typo

* add convert_to_lightning_optimizers in each accelerators

* typo

* forgot

* forgot a convert_to_lightning_optimizers

* update

* update

* update

* increase coverage

* update

* resolve flake8

* update

* remove useless code

* resolve comments + add support for LightningOptimizer base class

* resolve flake

* check optimizer get wrapped back

* resolve DDPSharded

* reduce code

* lightningoptimizer

* Update pytorch_lightning/core/optimizer.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update pytorch_lightning/core/lightning.py

* remove reference to step function

* Apply suggestions from code review

* update on comments

* resolve

* Update CHANGELOG.md

* add back training_step in apex and native_amp

* rename optimizer_step

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2020-12-01 00:09:46 +00:00
Jirka Borovec 217650320e
simplify imports Omegaconf (#4873)
* hydra

* omegaconf
2020-11-27 01:00:56 +01:00
Jirka Borovec 442d57f1e9
simplify imports xla / TPU (#4872)
* xla

* tpu

* fix

* fix

* flake8
2020-11-27 00:37:48 +01:00
Sean Naren 404af43cde
5/n: Extract reference model call to plugins/accelerators (#4773)
* Encapsulate extracting reference model within the plugin to allow custom wrapper logic to live within the plugin/accelerators

* Add missing new lines

* Fix call to accelerator

* Removed double blank

* Use accelerator backend

* Handle case where wrapper has not been initialized within the plugin

* Added basic get model tests, add better typing

* Change model name

* Split GPU/DDP test

* Add stronger typing, skip ddp test on windows

* Fix import

* Fix import in dp

* Fixed PEP8 definition

* Add ddp launcher for ddp testing

* Modify accelerator reference model to property, change name to reflect func

* Revert property as this is incorrect.=

* Revert across accelerators

* Modified name to get_model_from_plugin

* Code review changes, fix issue with dp

* Add verb to function getter

Co-authored-by: chaton <thomas@grid.ai>
2020-11-23 17:21:47 +00:00
ananthsub 45c57600af
Move init_ddp_connection to DDP Plugin (#4407)
* Move init_ddp_connection to DDP Plugin

* cluster-env

* trainer?

* imports

* Update ddp_plugin.py

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2020-11-18 15:49:22 -05:00
Sean Naren e7134a9135
Sharded Plugin 2/n: Allow ddp plugin to modify optimizer state saving (#4675)
* Allow ddp plugin to modify optimizer state saving

* Rely on the accelerator for optimizer states

* Ensure we init the accelerator for the saving function

* Better comment for optim state dump

* Revert "Ensure we init the accelerator for the saving function"

This reverts commit af65effa

* Added accelerator check to initialize tuner before saving model checkpoint

* Simplify comment

* Revert "Added accelerator check to initialize tuner before saving model checkpoint"

This reverts commit f9929c0c

* Return single optimizer state to reduce duplication

* Fixed docstring

* Fixed typing

* Fixed comment

* Added CHANGELOG.md

Co-authored-by: chaton <thomas@grid.ai>
2020-11-18 16:38:35 +00:00
Sean Naren 8283680aa0
Sharded Plugin 3/n: Expose step input to DDP plugin (#4686)
* Allow ddp plugin to move the input to a different device if needed

* Swapped name to on_before_forward to align with hooks in the future

* Update pytorch_lightning/plugins/ddp_plugin.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Pass variable arg type to hook, add example

* Remove blank space (pep check)

* Added blank line

Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-11-18 15:45:30 +00:00
chaton 4018237c30
[FEAT] Add lambda closure to manual_optimizer_step (#4618)
* added lambda_closure

* move to types

* add 2 new tests

* make example more complex

* add complex example to doc

* added more tests

* resolve doc

* typo

* update

* update tpu optimizer_step

* Apply suggestions from code review

* Update pytorch_lightning/core/lightning.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-11-12 19:22:06 +00:00
Sean Naren bacabaebaf
Sharded Accelerator 1/n: Expose clip gradients to plugins via abstract class (#4639)
* Added abstract precision plugin to expose clip_gradients function, use within accelerator to clip gradients

* Exclude model from override, keep optimizer (needed for sharded clip gradients), add override for O2 support apex

* Fix doc

* Applied codereview changes

* Refactored clip function to encapsulate tpu changes with tpu accelerator. Default to standard clip function for vanilla torch

* Pass correct grad clip val

* Moved var to property

* Apply code review suggestions
2020-11-12 17:18:09 +00:00
Sean Naren 33470ba605
Prevent crash if sync_dist=True on CPU (#4626)
* Added test/fix for sync_dist raising NotImplementedError

* Fixed comments/formatting

* Revert base class change, enforce sync tensors across accelerators, added GPU test
2020-11-11 22:04:05 +00:00
chaton 7e08b0d710
[bug-fix] DDP and automatic_optimization=False (#4485)
* resolve bug

* add self._running_manual_optim

* update

* update tests

* update lightning module

* resolve bug

* update tests

* update

* resolve pep8

* update

* replace by `ddp_spawn`

* temporary fix

* update

* update

* move update to training_loop

* make both ddp_spawn

* introduce `manual_optimizer_step`

* update changelog

* added changelog wrong place

* add force_optimizer_step

* update docstring for tests

* update optimizer_step

* update zero_grad

* resolve flake8

* move update into manual_optimizer_step

* add zero_grad

* remove zero_grad tests

* remove manual_backward in AMP, it doesn't help

* update

* loosen tests

* update

* update doc

* add TODO

* Removed unnecessary get model from native amp

* Remove try except with pytest raise

* Add seed, clean up imports, remove try catch to reproduce error

* update code

* update test

* revert back

* formatting

* Update pytorch_lightning/core/lightning.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-11-10 19:44:51 +00:00
William Falcon ee35907170
Accelerator docs (#4583)
* accelerator docs

* accelerator docs
2020-11-08 17:24:41 -05:00
William Falcon 3ba48d3bc4
ref: unify slurm and TE under backendPlugin 5/n" (#4582)
* ref: unify slurm and TE under backendPlugin 4/n

* ref: unify slurm and TE under backendPlugin 5/n
2020-11-08 16:20:19 -05:00
William Falcon 624f5b5938
ref: unify slurm and TE under backendPlugin 3/n (#4581) 2020-11-08 15:32:37 -05:00
William Falcon bfaf014096
ref: unify slurm and TE under backendPlugin 2/n (#4580) 2020-11-08 15:07:16 -05:00
William Falcon 0f64f15f52
ref: unify slurm and TE under backendPlugin 1/n (#4578)
* ref: unify slurm and TE under backendPlugin

* ref: unify slurm and TE under backendPlugin
2020-11-08 14:28:55 -05:00
cool425589 5e09fd31e9
show progressbar only on progress_rank 0 on ddp_slurm (#4437)
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-11-06 01:36:22 +01:00
Travis Addair 51cc7a89ee
Horovod: fixed early stopping and added metrics aggregation (#3775)
* Fixed early stopping for Horovod

* Refactored to sync_dist_if_available

* Bump min Horovod version to support hvd.is_initialized

* Changelog

* Added back change for Horovod

* Removed redundant checks for initialization

* Implement metrics gathering for Horovod

* Added test for EvalResult

* Renamed ddp_sync_on_step -> dist_sync_on_step

* Added metric test for Horovod

* Added option pass callable allgather function to metric base class

* Added dist_sync_fn

* Fixed calls to private _sync_dist

* Fixed Horovod test

* Added sync_tensor to the distributed backend

* Skip Windows

* Insert test path

* Removed redundant import

* Updated drone

* Unset HOROVOD_GPU_ALLREDUCE

* Unset

* No cache dir

* No uninstall

* Unset variables

* Uninstall Horovod during initialization

* Replaced more references to ddp_sync_on_step

* Fixed imports

* Fixed attribute

* Added back default

* Lint

* Added back docstring

* Made gather_all_tensors default

* Added whitespace

* Update tests/models/test_horovod.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/metrics/metric.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update CHANGELOG.md

Co-authored-by: Teddy Koker <teddy.koker@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-11-05 12:52:02 -05:00
Ananya Harsh Jha 01ab2a933d
[bug] [docs] Clearer optimizer_step override instructions (#4455)
* fix

* flags

* remove defaults
2020-11-02 22:13:34 +00:00
chaton 102fa9ee7d
[BUGFIX] AMP + Precision unscale grad (#4441)
* move unscale within Native plugin

* remove gradient tracking from lightning backward

* forgot trainer.fit

* typo

* update

* cleanup

* set to 1.6

* typo

* skip if below 1.6 strict

* update changelog

* remove useless code

* Update tests/plugins/test_amp_plugin.py

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>

* Update tests/plugins/test_amp_plugin.py

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>

* update changelog

* Update CHANGELOG.md

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Jeff Yang <ydcjeff@outlook.com>
2020-11-02 16:36:48 +00:00
Adrian Wälchli 28d45a26a3
Set correct device ids in DDP [wip] (#4297)
* repro


debug


c


d


dd


d


d


d


ads


d


d


d


f


rank


f


v


d


d


d


d


d


d


d


d


d


d


d


set 


drop PL_DDP_PID


clean up


keep set gpus


revert


Revert "drop PL_DDP_PID"

This reverts commit 7d88cae469541ef19128f9c20919fd3a6f863039.
d


pid


gpus


clean up


clean up 


misconfig?


misconfig


clean


clean

* fix pep

* changelog

* remove script

Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-10-24 17:33:47 -04:00
Sean Naren 5641b266d5
Bug/4319 ddp checkpoint (#4323)
* Broadcast best model path to ensure we sync with main process + wait for main process to save

* Add barrier call to ensure all processes are in sync

* Added changelog commit

* Move sync of best model path/score to model checkpoint, keep barrier to ensure all processes complete

* Ensure we broadcast as tuple

* Add init check

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* Removed model checkpoint code, added barrier to trainer to enforce we syncronize and wait for all processes to finish before completing training

* Add barrier within teardown call, removed horovod teardown to inherit from base accelerator

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
2020-10-24 16:55:49 -04:00
William Falcon 753362d0a4
enable ddp as a plugin (#4285)
* enable custom ddp plugin

* enable custom ddp plugin

* enable custom ddp plugin

* enable custom ddp plugin

* enable custom ddp plugin

* enable custom ddp plugin

* enable custom ddp plugin

* enable custom ddp plugin

* enable custom ddp plugin

* enable custom ddp plugin

* enable custom ddp plugin

Co-authored-by: chaton <thomas@grid.ai>
2020-10-22 05:15:51 -04:00
Justus Schock 0ec4107697
Optimizer closure (#4190)
* closure for all optimizers

* rename hook and take care of alternating backwards

* add comment

* training_loop_fix

* closure whenever possible

* training_loop

* simple tests that count backward calls

* fix test to work with closure

* remove debugging statement

* better place

* check grads after backward

* start fixing manual optimization

* skip step when result returned by closure was None

* fix gradient clipping test to work with closure

* attribute dict result only for automatic optimization

* adjust backward calls in accelerator

* adjust where to call gradient clipping

* adjust backward calls in tests

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* pass kwargs to xla optimizer

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-10-21 19:34:29 +01:00
Akihiro Nitta d27ee8b5bf
docs: Add empty lines in docstring [ci skip] (#4232)
* Add empty lines in docstring for proper docs

* Remove Returns:

* Remove unnecessary Returns:

* Update pytorch_lightning/accelerators/ddp2_accelerator.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* fix returns

Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2020-10-21 09:00:39 -04:00
Jirka Borovec f37444fa3e
CI: add flake8 (#4239) 2020-10-19 21:20:17 +01:00
Akihiro Nitta b45b57cc58
Use `Optional` for arguments set to `None` by default (#4164)
* Use `Optional` for variables set to `None` by default

* Use `Optional` instead of `Union[None, ...]` for consistency
2020-10-15 23:02:50 +02:00
Sean Naren 98eb736496
Added getstate/setstate method for torch.save serialization (#4127)
* Added getstate/setstate method for torch.save serialization, added additional Optional Typing to results object

* Added tests to ensure torch.save does not fail

* Added flags to ensure compatible ddp cpu environment

* Removed torch version check due to minimum already being 1.3, reduced epochs for speed

* Moved tests to separate file

* Update to accelerator, move to ddp_spawn to prevent hanging ddp
2020-10-13 16:47:23 -04:00
William Falcon 09c2020a93
notices (#4118) 2020-10-13 07:18:07 -04:00
William Falcon 4c4b090c66
depre (#4088) 2020-10-12 05:58:31 -04:00
William Falcon b9f2682b7d
clean docs, enable grad clip in manual mode (#4078)
* docs

* docs
2020-10-11 13:12:35 -04:00
William Falcon 7ffe05a3d1
ref: accelerator names (#4066)
* ref: accelerator names

* docs
2020-10-11 01:05:14 -04:00
William Falcon a4b9221fc5
ref: decouple apex second attemp part n/n (#4065)
* ref: decouple apex second attemp part n/n

* ref: decouple apex second attemp part n/n
2020-10-10 22:04:50 -04:00
William Falcon 0281b077d8
ref: decouple apex second attemp part 10/n (#4064)
* ref: decouple apex second attemp part 9/n

* ref: decouple apex second attemp part 9/n

* ref: decouple apex second attemp part 9/n
2020-10-10 20:05:05 -04:00
William Falcon dca86c310e
ref: decouple apex second attemp part 6/n (#4060)
* ref: decouple apex second attemp part 6/n

* ref: decouple apex second attemp part 6/n
2020-10-10 15:28:25 -04:00
William Falcon ce2edf1192
ref: decouple apex second attemp part 4/n (#4056)
* ref: decouple apex second attemp part 4/n

* ref: decouple apex second attemp part 4/n

* Update lightning.py

* ref: decouple apex second attemp part 4/n
2020-10-10 12:19:22 -04:00
William Falcon 3a6717ca34
ref: decouple apex second attemp part 3/n (#4055) 2020-10-10 11:05:57 -04:00
William Falcon 7285613974
ref: decouple apex second attemp part 2/n (#4054)
* ref: decouple apex second attemp part 2/n

* ref: decouple apex second attemp part 2/n
2020-10-10 10:24:20 -04:00
William Falcon e854d3744c
ref: decouple apex second attemp part 1/n (#4052) 2020-10-10 09:53:02 -04:00
William Falcon 5b261a230e
enable passing in custom accelerators (#4050)
* enable custom accelerators

* ref: finish decoupling apex, LM and backward

* ref: finish decoupling apex, LM and backward

* ref: finish decoupling apex, LM and backward
2020-10-10 09:21:08 -04:00
William Falcon 2b255a3df4
ref: enable custom clusters (1/n) (#4048)
* enable cluster plugins

* enable cluster plugins + test backend choices

* enable cluster plugins + test backend choices

* enable cluster plugins + test backend choices

* enable cluster plugins + test backend choices

* enable cluster plugins + test backend choices

* enable cluster plugins + test backend choices
2020-10-10 08:09:29 -04:00
William Falcon 0c42aa03fd
enables plugins (#4041)
* plugin hardware

* plugin hardware

* plugin hardware
2020-10-09 22:03:46 -04:00
William Falcon 048a816be3
added tests for the training epoch end (#3967) 2020-10-07 22:27:36 -04:00
William Falcon b922409624
clean and organize fit (#3938)
* clean and organize fit

* clean and organize fit

* clean and organize fit

* clean and organize fit

* clean and organize fit
2020-10-07 11:04:10 -04:00
William Falcon 9c415d2c71
moves configure ddp to each backend (#3924)
* moves configure ddp to each backend

* moves configure ddp to each backend

* moves configure ddp to each backend

* added torch manual seed in test_mean_error

* test for complicated batch structure

* test for complicated batch structure

* test for complicated batch structure

Co-authored-by: ananyahjha93 <ananya@pytorchlightning.ai>
2020-10-07 00:50:16 -04:00
William Falcon e3007ffe0c
moves sync bn to each backend (#3925) 2020-10-06 22:42:33 -04:00
William Falcon af5887c0aa
fixed ddp flag crash (#3927) 2020-10-06 22:41:08 -04:00
Lezwon Castelino 69833dad5b
Added check to verify xla device is TPU (#3274)
* tpu device check

* replaced with xmp spawn

* Revert "replaced with xmp spawn"

This reverts commit 6835380f

* replaced all instances of XLA_AVAILABLE

* moved inner_f to global scope

* made refactors

* added changelog

* added TPU_AVAILABLE variable

* fix codefactor issues

* removed form trainer and early stopping

* add TORCHXLA_AVAILABLE check

* added tests

* refactoring

* Update pytorch_lightning/utilities/xla_device_utils.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* updated function names

* fixed bug

* updated CHANGELOG.md

* added todo

* added type hints

* isort and black

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-10-06 19:54:37 +02:00
Sean Naren e4a56fa5cf
Ensure global seed exists before passing into env subprocess.Popen call (#3904) 2020-10-06 12:31:49 -04:00
William Falcon 70e792344a
test selecting the correct backend. temp backends while slurm and TE are decoupled (#3848)
* test selecting the correct backend. tem backends while slurm and TE are decoupled

* test selecting the correct backend. tem backends while slurm and TE are decoupled
2020-10-04 15:44:50 -04:00
William Falcon 2c21f7d7e2
ref: adding compute environments (2/n) (#3842)
* ref: adding compute environments (2/n)

* ref: adding compute environments (2/n)

* ref: adding compute environments (2/n)

* ref: adding compute environments (2/n)
2020-10-04 08:48:46 -04:00
Lezwon Castelino 4da240ea1b
added broadcast option to tpu (#3814)
* added broadcast option to tpu

* add device

* moved tpu broadcast to tpu_backend

* removed Lightning dist

* decode bytes

* pep8 fix

* fix bug

* test for broadcast

* updated changelog
2020-10-04 07:47:33 -04:00
William Falcon 1f8ff7c48c
ref: callback system and init ddp (1/n) (#3836)
* refactored callback system and init ddp

* refactored callback system and init ddp

* refactored callback system and init ddp

* refactored callback system and init ddp
2020-10-03 23:39:17 -04:00
William Falcon 35d1111994
[WIP] ref: decoupled ddp, ddp spawn (finish 3733) (#3819)
* ref: finish #3733

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* Update pytorch_lightning/accelerators/ddp_backend.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* remove deprecated test

* remove deprecated test

* remove deprecated test

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
2020-10-03 14:05:31 -04:00
William Falcon ed1450a293
ref: clean up ddp before final fix (#3817)
* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix
2020-10-03 12:01:02 -04:00
William Falcon 0838c6bfce
ref: decoupled ddp2 (#3816) 2020-10-03 09:02:35 -04:00
William Falcon a677833f84
ref: separate slurm from ddp (#3809)
* ref: separate slurm from ddp

* ref: separate te from ddp

* ref: merge

* ref: merge

* ref: merge
2020-10-02 23:08:34 -04:00
William Falcon 74484edecd
ref: separate te from ddp (#3810)
* ref: separate te from ddp

* ref: separate te from ddp

* ref: separate te from ddp
2020-10-02 21:00:51 -04:00
William Falcon a28528cc8b
ref: remove weight loading hack for ddp_cpu (#3808) 2020-10-02 19:28:50 -04:00
William Falcon afa43837a4
ref: part 8 of #3733 (#3806) 2020-10-02 18:46:18 -04:00
ananthsub 3ab730e316
Swap torch.load for fsspec load in ddp spawn backend (#3787)
* Update ddp_spawn_backend.py

* Update ddp_cpu_spawn_backend.py

* log

Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
2020-10-02 21:00:01 +02:00
William Falcon 7c6ed1fa28
ref: part 7 of #3733 (#3802)
* ref: part 7 of #3733

* ref: part 7 of #3733
2020-10-02 14:23:27 -04:00
Jirka Borovec 62eabdd535
revert backend types (#3788)
* revert backend types

* todo

* todo
2020-10-02 06:18:44 -04:00
Akihiro Nitta ebc1b23fa3
Use `raise .. from ..` to explicitly chain exceptions (#3750)
* Fix exception chaining

* names

* Change exception names for consistency

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

* Change exception names for consistency

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
2020-10-01 21:45:44 +02:00
William Falcon 622c5c3982
ref: part 4 of #3733 (#3773)
* ref: part 4 of #3733

* ref: part 4 of #3733

* ref: part 4 of #3733

* ref: part 4 of #3733
2020-10-01 11:26:58 -04:00
William Falcon 440f837f6d
ref: part a of #3733 (#3766)
* ref: part a of #3733

* ref: part a of #3733
2020-10-01 08:15:23 -04:00
Lezwon Castelino 8be002ccc7
skip best_model_path if checkpoint_callback is None (#2962)
* skip best_model_path if checkpoint_callback is None

* removed test
2020-10-01 06:57:26 -04:00
William Falcon a38d108a68
add dist lib to enable syncing anything across devices (#3762)
* add dist lib to enable syncing anything across devices
2020-10-01 01:21:38 -04:00
Jirka Borovec 31a36f04df
define distributed as a type (#3740)
* define type

* miss

* Apply suggestions from code review

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* miss

* warn

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2020-09-30 08:33:01 -04:00
William Falcon c41ea86b35
ref: move backends back to individual files (1/5) (ddp_cpu) (#3712)
* ref: make each backend independent for easier debugging and independent debugging

* ref: make each backend independent for easier debugging and independent debugging

* ref: make each backend independent for easier debugging and independent debugging

* ref: make each backend independent for easier debugging and independent debugging

* ref: make each backend independent for easier debugging and independent debugging

* ref: make each backend independent for easier debugging and independent debugging

* ref: test val epoch end

* ref: test val epoch end
2020-09-29 01:59:18 -04:00
Rohit Gupta 783750547d
disable optimizers setup during testing (#3059)
* disable configure_optimizers during testing

* minor changes

* hvd and ddp

* fix precision during testing

* fix ddp

* fix amp

* fix cpu

* update dp

* simplify optimizers

* add test

* codefactor

* ref optimizer setup

* chlog

* suggestions

* isort

* rebased with master
2020-09-29 01:09:04 +02:00
William Falcon 931995b55b
remove flake 8 (#3687) 2020-09-27 20:40:02 -04:00
William Falcon 031274c25d
fix dp issues + update examples and test examples (#3618)
* fix dp

* fix dp

* fix dp

* fix dp

* fix examples

* fix examples

* fix examples

* fix examples

* fix examples

* fix examples

* fix examples

* fix examples

* fix examples

* fix examples

* fix examples

* fix examples

* fix examples

* fix examples

* fix examples

* fix examples

* fix examples

* fix examples
2020-09-23 00:19:46 -04:00
Adrian Wälchli a71d62d840
Fix deterministic behavior in ddp_spawn (#3573)
* docs

* set env variable

* fix

* changelog
2020-09-20 19:42:58 -04:00
William Falcon 890588a9ee
ref: precision plugins 1/n (#3504)
* ref: precision plugins 1/n

* ref: precision plugins 1/n
2020-09-15 09:56:12 -04:00
William Falcon 810b445097
ref: apex plugin (#3502)
* ref: apex plugin

* ref: apex plugin

* ref: apex plugin
2020-09-15 06:02:42 -04:00
William Falcon 6bcfa8b068
ref: merge backends x/n (#3482) 2020-09-12 16:28:29 -04:00
William Falcon 518a0c0e92
ref: merge backends x/n (#3480) 2020-09-12 15:27:11 -04:00
William Falcon 0045119b3f
ref: merge backends x/n (#3478)
* ref: merge backends x/n

* ref: merge backends x/n

* ref: merge backends x/n

* ref: merge backends x/n
2020-09-12 13:55:55 -04:00
William Falcon 00d155ae01
ref: merge backends x/n (#3477) 2020-09-12 12:36:55 -04:00
William Falcon 59d8472548
ref: slurm connector 1/n (#3476)
* ref: slurm connector 1/n

* ref: slurm connector 1/n

* ref: slurm connector 1/n

* ref: slurm connector 1/n
2020-09-12 11:07:15 -04:00
William Falcon ff0064f956
ref: group connectors (#3472)
* ref: accelerator connector methods 3/n

* ref: accelerator connector methods 3/n
2020-09-11 23:33:09 -04:00
William Falcon dd324e4086
ref: accelerator connector methods x/n (#3470) 2020-09-11 22:25:48 -04:00
William Falcon de99222834
ref: accelerator connector methods x/n (#3469)
* ref: accelerator connector methods x/n

* ref: accelerator connector methods x/n
2020-09-11 21:52:22 -04:00
William Falcon ef20310873
ref: move specific accelerator code x/n (#3457)
* ref: organize args x/n

* ref: move specific accelerator code x/n

* ref: move specific accelerator code x/n

* ref: move specific accelerator code x/n
2020-09-11 10:56:21 -04:00
William Falcon 70af47db84
ref: organize args 4/n (#3456) 2020-09-10 21:58:47 -04:00
William Falcon 3281586ab4
ref: organize args 3/n (#3449)
* ref: organize args 3/n

* ref: organize args 3/n

* ref: organize args 3/n

* ref: organize args 3/n

* ref: organize args 3/n

* ref: organize args 3/n
2020-09-10 13:21:04 -04:00
William Falcon a208d6da46
ref: organize args 2/n (#3448)
* ref: organize args 2/n

* ref: organize args 2/n

* ref: organize args 2/n
2020-09-10 10:51:35 -04:00
William Falcon 541c4ab01d
ref: organize args 3/n (#3447)
* ref: organize args 2/n

* ref: organize args 2/n

* ref: organize args 2/n

* ref: organize args 2/n
2020-09-10 08:55:30 -04:00
William Falcon deb82d9c08
ref: organize args 2/n (#3442)
* ref: organize args 2/n

* ref: organize args 2/n
2020-09-10 08:07:55 -04:00
William Falcon 49290a569b
ref: organize args 1/n (#3435)
* ref: organize args 1/n

* ref: organize args 1/n
2020-09-10 07:24:42 -04:00
William Falcon 8f6b115511
ref: added model connector (#3407)
* ref: added model connector

* ref: added model connector

* ref: added model connector
2020-09-09 00:24:20 -04:00
Travis Addair 091d37f968
Added check for apex AMP and unit tests for Horovod + AMP (#3404)
* Added check for apex AMP and unit tests for Horovod + AMP

* Changelog

* Fixed order of Horovod and Apex optimizer wrapping
2020-09-08 20:30:57 -04:00
William Falcon 9939f53b7c
ref: inner train loop (intermediate step) 12/n (#3372)
* ref: inner train loop (intermediate step) 12/n

* ref: inner train loop (intermediate step) 12/n

* ref: inner train loop (intermediate step) 12/n

* ref: inner train loop (intermediate step) 12/n

* ref: inner train loop (intermediate step) 12/n

* ref: inner train loop (intermediate step) 12/n
2020-09-06 17:50:47 -04:00
William Falcon 38b9677638
ref: inner train loop (intermediate step) 5/n (#3365) 2020-09-05 18:27:28 -04:00
William Falcon c7ef5ee874
ref: inner train loop (intermediate step) 3/n (#3363) 2020-09-05 17:01:46 -04:00
William Falcon f55efb7616
ref: inner train loop (intermediate step) 1/n (#3361) 2020-09-05 10:10:49 -04:00
William Falcon 5a474c452c
ref: inner train loop (intermediate step) 1/n (#3359) 2020-09-05 08:55:22 -04:00
William Falcon 0a119403d6
ref: moved accelerator router (#3309)
* ref: moved accelerator

* ref: moved accelerator

* ref: moved accelerator

* ref: moved accelerator
2020-09-01 15:48:28 -04:00
William Falcon b0298cead8
ref: move train outside of setup training (#3297)
* ref: move train outside of setup training

* ref: move train outside of setup training

* ref: move train outside of setup training

* ref: move train outside of setup training
2020-08-31 20:36:52 -04:00
William Falcon bcd13f70b8
ref: run_pretrain_routine -> setup_training (#3294)
* ref: .tune()

* ref: run_pretrain_routine -> setup_training
2020-08-31 18:06:11 -04:00
Philipp Singer 0aee137ba7
DP device fix (#3196) 2020-08-27 09:01:29 -04:00
William Falcon 4272360076
ddp backend refactor (#3210) 2020-08-26 21:02:15 -04:00
William Falcon 3a26b4ff5c
ddp backend refactor (#3209) 2020-08-26 20:31:09 -04:00
William Falcon 6bae404bed
ref: ddp backend refactor (3) (#3208)
* ddp backend refactor

* ddp backend refactor
2020-08-26 20:03:09 -04:00
William Falcon a8daf914f8
ddp backend refactor (#3207) 2020-08-26 19:10:24 -04:00
William Falcon ff3c2f4cff
ddp backend refactor (#3204) 2020-08-26 18:43:28 -04:00
William Falcon f3384d0cbb
ref: ddps train hooks (#3203)
* ddps train

* ddps train
2020-08-26 15:37:40 -04:00
William Falcon ef07b0c4b3
acceleartor fit 1 (#3200) 2020-08-26 14:20:38 -04:00
William Falcon f064d74be8
refactored dataloader process hook (#3139) 2020-08-24 21:53:56 -04:00
William Falcon 82d1128966
eval step scaling factor (#3136) 2020-08-24 20:26:39 -04:00
William Falcon 6c3cec3a3c
training amp scaling refactor (#3135) 2020-08-24 19:59:46 -04:00
William Falcon 0b3cb3c955
ref: moved ___step_end hooks (#3130)
* moved eval hooks

* moved eval hooks

* moved eval hooks

* moved eval hooks

* moved eval hooks

* moved eval hooks

* moved eval hooks
2020-08-24 17:50:47 -04:00
William Falcon 6068b29d29
ref: remove obscure forward call in eval + CPU backend ___step (#3123)
* remove obscure forward call in eval

* remove obscure forward call in eval

* remove obscure forward call in eval

* remove obscure forward call in eval

* remove obscure forward call in eval

* remove obscure forward call in eval
2020-08-24 12:31:40 -04:00
William Falcon 18160b81b5
refactored horovod backend (#3122) 2020-08-24 11:13:49 -04:00
William Falcon 8ebf4fe173
ref: refactored horovod backend (#3121)
* refactored horovod backend

* refactored horovod backend
2020-08-24 10:35:32 -04:00
William Falcon 8d7ca5cd2c
ref: refactored gpu backend __step (#3120)
* refactored gpu backend __step

* refactored gpu backend __step

* refactored gpu backend __step

* refactored gpu backend __step
2020-08-24 09:22:05 -04:00
William Falcon 527b9dca36
refactored ddp backend forward (#3119) 2020-08-24 07:33:14 -04:00
William Falcon 3c88b0dd83
Refactor 1: moved tpu xxx_step to backend (#3118)
* moved tpu training_step

* refactored eval step

* refactored eval step

* refactored eval step
2020-08-24 07:02:06 -04:00
Ananya Harsh Jha 9445c800b0
set device to root gpu (#3042) 2020-08-18 19:28:35 -04:00
Adrian Wälchli 188e06c261
ddp fix for trainer.test() + add basic ddp tests (#2997)
* add ddp script variations

* add ddp test

* rename

* shell

* test

* test

* try call

* try without subprocess

* test

* display the error

* list all variations

* try string

* try copy env

* debug

* pythonpath

* path

* update test

* change

* simple ddp test

* replace

* remove random port

* random port

* str

* clean up

* check run spawn

* clean up

* docs

* docs

* update test

* docs

* changelog

* changelog
2020-08-16 11:19:57 -04:00
William Falcon e7794eb79a
Fixes #2407 (#2981)
* fix gpus index error
2020-08-14 16:22:48 -04:00
Jirka Borovec 5bce06c050
nb. devices (#2973) 2020-08-14 11:37:21 +02:00
William Falcon 0c264689cb
Fixes #2942 (#2969)
* Fixes #2942

* doc fix
2020-08-13 21:54:57 -04:00
Jirka Borovec 4354690e55
add apex test (#2921)
* add apex test

* rename

* level

* events

* wrap

* evt

* miss

* apex

* apex

* apex

* apex

* apex

* apex

* Update tests/models/test_amp.py

Co-authored-by: William Falcon <waf2107@columbia.edu>

* notes

* notes

Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-08-13 10:03:13 -04:00
Phil e3528afae3
Move optimizer creation after device placement for ddp backends. (#2904) 2020-08-12 06:34:59 -04:00
Jirka Borovec a6e7aa7796
allow using apex with any PT version (#2865)
* wip

* setup

* type

* name

* wip

* docs

* imports

* fix if

* fix if

* use_amp

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* fix tests

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* fix tests

* todos

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-08-08 11:07:32 +02:00
Jirka Borovec b7d72706c3
clean imports (#2867)
* clean imports

* miss
2020-08-08 00:33:51 +02:00
Jirka Borovec f8c058215f
simplify tests & cleaning (#2588)
* simplify

* tmpdir

* revert

* clean

* accel

* types

* test

* edit test acc

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update test acc

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-08-07 23:22:05 +02:00
William Falcon 4dbd761a1c
refactor 3/n (#2709)
* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator
2020-07-25 20:56:50 -04:00
William Falcon b34217e410
Refactor 2/n (#2708)
* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator
2020-07-25 17:31:34 -04:00
William Falcon 071e09fe38
refactor 1/n for v1.0.0 (#2704)
* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator
2020-07-25 14:38:51 -04:00