Jirka Borovec
53b0ae49b9
fix imports / isort / flake8
2021-01-26 14:57:34 +01:00
SeanNaren
127e04124d
Fix merge issue
2021-01-26 14:29:47 +01:00
chaton
0435e23a64
deprecate enable_pl_optimizer as it is not restored properly ( #5244 )
...
* update
* clean test
* still in progress
* udpdate test
* update
* update
* resolve flake
* add test for zero_grad
* update
* works without accumulated_grad
* update
* update
* resolve amp
* revert back to True
* update
* clean tests
* cleaned out
* typo
* update test
* git repare bug
* remove print
* udpate
* Fix formatting/optimizer imports
* Refactor the test for cleanliness
* Add vanilla model to the test, better var names
* Fixed var names, let's clean up these mock tests
* repare test
* update test
* resolve flake8
* add manual_optimization
* update tests
* resolve flake8
* add random accumulate_grad_batches
* improve test
* Update tests/trainer/optimization/test_parity_automatic_optimization.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update tests/trainer/optimization/test_parity_automatic_optimization.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* update
* clean tests
* correct bug
* Apply suggestions from code review
* format
* adress comments
* update on comments
* wip
* typo
* depreceate enable_pl_optimizer
* resolve latest bugs
* update
* resolve merge
* add comment
* Update pytorch_lightning/core/lightning.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update tests/deprecated_api/test_remove_1-3.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update pytorch_lightning/trainer/connectors/optimizer_connector.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update pytorch_lightning/trainer/trainer.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update pytorch_lightning/trainer/trainer.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update tests/trainer/optimization/test_parity_automatic_optimization.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* update on comments
* update restore
* add a property
* remove setstate as not needed anymore
* update test
* provide optimizer to on_before_zero_grad
* update on comments
* update on comments
* Update pytorch_lightning/trainer/trainer.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Update tests/trainer/optimization/test_parity_automatic_optimization.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Update tests/trainer/optimization/test_parity_automatic_optimization.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Update tests/trainer/optimization/test_parity_automatic_optimization.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* mofidy import
* update changelog
* resolve flake8
* update
* update
* clean doc
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-62-109.ec2.internal>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
(cherry picked from commit f2e99d617f
)
2021-01-26 14:29:46 +01:00
chaton
f2f4a49271
[bug-fix] Call transfer_batch_to_device in DDPlugin ( #5195 )
...
* hacking out
* update
* remove useless on_before_forward
* update
* remove overriden
* iremove os
* use on_before_forward
* resolve flake8
* add test
* update
* add single_process_per_device
* resolve flake8
* update
* resolve
* update
* update
* update
* add comment
* resolve bug with sharded
* update
* remove property
* update
* resolve test
* resolve bug
* update on comments
* update doc
* Update pytorch_lightning/core/hooks.py
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* update on comments
* Update pytorch_lightning/plugins/ddp_plugin.py
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* Update pytorch_lightning/plugins/ddp_plugin.py
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* resolve pep8
* add device_ids to pipe
* update on comments
* update
* resolve
* update
* update
* update
Co-authored-by: Ubuntu <ubuntu@ip-172-31-62-109.ec2.internal>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
(cherry picked from commit d510707bc9
)
2021-01-26 14:28:45 +01:00
Jirka Borovec
2846322f60
fix docs render ( #5610 )
2021-01-25 20:21:00 -05:00
Arnaud Gelas
1ff6b18e8a
Fix pre-commit isort failure on pytorch_lightning/accelerators ( #5503 )
...
Remove from skipped module in pyproject.toml and fix failures on:
- pytorch_lightning/accelerators/*.py
2021-01-16 14:10:56 -05:00
Adrian Wälchli
e806bb77fa
Refactor LightningDistributedDataParallel ( #5185 )
...
* add wrapper
* add squeeze
* replace LightningDistributedDP
* update import
* module access
* inputs
* refactor warning
* update
* resolve flake8
* remove old class
* set find unused params to False
* update docstrings
* update docs
* update docs
* add changelog
* deprecation
* rename wrapper -> module
* rename pl_module
* add unit tests
* Revert "add changelog"
This reverts commit 02ec0a6864f4ba2ace3bb6fc6ebc364e1a80ffd7.
* Revert "set find unused params to False"
This reverts commit 8e451515e6ba3227d00f4a5cb63f332cfedb7b30.
Co-authored-by: Ubuntu <thomas@grid.ai>
2021-01-13 14:35:42 -05:00
Jirka Borovec
54d20dc596
Refactor: clean trainer device & distrib getters ( #5300 )
...
* warnings
* .
* .
* flake8
* .
* .
* .
* use_tpu
* use_dp
* .
* use_ddp
* .
* use_horovod
* .
* .
* .
2021-01-12 05:22:37 -05:00
Jirka Borovec
5ae6926a52
fix some minor typos in docs ( #5369 )
...
* fix docs typos
* Apply suggestions from code review
Co-authored-by: Wansoo Kim <rladhkstn8@gmail.com>
* flake8
Co-authored-by: Wansoo Kim <rladhkstn8@gmail.com>
2021-01-07 08:01:52 -05:00
ananthsub
a7fe24e9a1
Fix hang in DDP HPC accelerators ( #5157 )
...
* Fix hang in DDP HPC accelerators
init_device was never called
* Update CHANGELOG.md
2021-01-05 09:58:36 +01:00
Jirka Borovec
b72ed71d4e
Refactor: clean trainer device & distrib setters ( #5297 )
...
* naive replace
* simplify
* clean
* .
* fix
* .
* fix
* fix
2021-01-04 17:10:13 +00:00
Jirka Borovec
957583544a
mark todo exceptions ( #5320 )
...
* mark todo exceptions
* .
* .
* .
* .
* .
* .
* .
* .
* try
* .
2021-01-04 09:07:56 +01:00
Jirka Borovec
0f36525e8f
fix/enable - check F401 ( #5201 )
...
* refactor - check F401
* missed
* fix
2020-12-21 10:15:04 +01:00
Jirka Borovec
2d54116baa
annotat unused vars ( #5017 )
...
* annotate all unused vars
* rank_zero_warn
* Apply suggestions from code review
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* f1 fixed
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2020-12-19 13:53:06 +01:00
Jirka Borovec
059eaecbb4
set xxx_AVAILABLE as protected ( #5082 )
...
* sett xxx_AVAILABLE as protected
* docs
2020-12-14 20:19:05 +05:30
chaton
2c3d43dcb5
Initialize trainer with None in DDPAccelerator ( #4915 )
...
* Initialize trainer with None
* add typing to all accelerators
* resolve imports
* update
* add typing
* removed typo
* update
* Fix formatting and imports in accelerator
Co-authored-by: maxjeblick <maxjeblick@users.noreply.github.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
2020-12-10 15:24:44 +01:00
Jirka Borovec
d5fa02e798
simplify accelerator steps ( #5015 )
...
* simplify accelerator steps
* Apply suggestions from code review
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2020-12-10 18:36:13 +05:30
Jirka Borovec
cdbddbe99f
release 1.1.0 ( #5048 )
...
* release 1.1.0
* pep8
2020-12-10 00:52:39 +00:00
Jirka Borovec
ce9179591d
ref: clean config [1/n] add intermediate setters ( #4990 )
...
* add intermediate setters
* show inputs
* fix options
* move
* fix
* less talk
* fix
* talk less
* str
* cases
* rename
Co-authored-by: chaton <thomas@grid.ai>
2020-12-09 14:13:57 -05:00
Rohit Gupta
bcbba3b702
Simplify GPU and TPU accelerator ( #5024 )
2020-12-09 14:12:44 -05:00
Jirka Borovec
53d7c9555c
drop usage of deprecated distributed_backend ( #5009 )
...
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
2020-12-09 09:18:23 +01:00
Ananya Harsh Jha
127454ade2
All gatherwith grads ( #5012 )
...
* all_gather
* ddp
* horovod
* grad tests
* fixed ddp
* ddp fixed, removed tpu, horovod for now
* changelog
* windows fix
* windows fix
* removed batch from ctx
* all_gather
* ddp
* horovod
* grad tests
* fixed ddp
* ddp fixed, removed tpu, horovod for now
* changelog
* windows fix
* windows fix
* removed batch from ctx
* removed code duplication
* merge
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-12-08 23:20:01 +00:00
Sean Naren
ee9b3fe574
[feat] pp 1/n ( #5016 )
...
* Added changes for RPC plugin
* Add missing kwargs
* Fix code format
* Loading refactors by introducing is_distributed var, fix optimizer step flow
* Add rpc guard
* Added docstrings and typing
* resolve comments
* Add additional rpc hook, refactor name of exit process hook for clarity
* remove annotation
* Modify behaviour to allow optional return, add test for rpc plugin
* resolve tests
* rename is_ddp_based
* update
* update for windows
* update
* resolve test
* code smell
* Revert back to init_ddp_connection for backwards compat
* Swap to explicit name for property
* Add missing speed parity increase for CI variability, fix call counts for child process
Co-authored-by: tchaton <thomas@grid.ai>
2020-12-08 22:02:10 +00:00
maxjeblick
79ae66d026
Initialize trainer with None ( #4847 )
...
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: edenlightning <66261195+edenlightning@users.noreply.github.com>
2020-12-08 22:49:55 +05:30
chaton
2393474350
[hotfix] ddp + manual_optimisation ( #4976 )
...
* Rely on ddp plugin for blocking sync behaviour, and skip if we're using manual optimization
* debug
* Revert "debug"
This reverts commit ccca6b6b
* Expose manual reduce for automatic optimization
* Add input arguments
* Enable parity test
* clean imports
* Expose hook after to ensure we reset
* Fix naming
* add
* fix test
* resolve on comments
* typo
* Update tests/trainer/optimization/test_manual_optimization.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update tests/trainer/optimization/test_manual_optimization.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* update on comments
* resolve comments
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-12-07 19:31:54 +00:00
chaton
02152c1729
Simplify optimization Logic ( #4984 )
...
* Rely on ddp plugin for blocking sync behaviour, and skip if we're using manual optimization
* debug
* Revert "debug"
This reverts commit ccca6b6b
* Expose manual reduce for automatic optimization
* Add input arguments
* Enable parity test
* clean imports
* Expose hook after to ensure we reset
* Fix naming
* add
* fix test
* uniformize optimizer logic
* resolve test
* resovle flake8
* resolve amp bug
* update tests
* remove bug
* remove optimizer_step in accelerators
* typo
* update lightning optimizer
* set doesn't work with ddp_spawn
* resolve flake8
* update threshold
* ignore pyright
* correct codeFactor
* remove useless if
* remove zer_grad function
* simplify step
* remove typo
* resolve bug
* Apply suggestions from code review
* update on comments
* resolve bugs
* remove tests
* Update pytorch_lightning/trainer/configuration_validator.py
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* simplify testing
* add more tests
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2020-12-07 12:55:49 +00:00
Jirka Borovec
3976db597d
refactor imports of optional dependencies ( #4859 )
...
* refactor imports of optional dependencies
* fix
* fix
* fix
* fix
* fix
* flake8
* flake8
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
2020-12-04 10:26:10 +01:00
Lezwon Castelino
12cb9942a1
Tpu save ( #4309 )
...
* convert xla tensor to cpu before save
* move_to_cpu
* updated CHANGELOG.md
* added on_save to accelerators
* if accelerator is not None
* refactors
* change filename to run test
* run test_tpu_backend
* added xla_device_utils to tests
* added xla_device_utils to test
* removed tests
* Revert "added xla_device_utils to test"
This reverts commit 0c9316bb
* fixed pep
* increase timeout and print traceback
* lazy check tpu exists
* increased timeout
removed barrier for tpu during test
reduced epochs
* fixed torch_xla imports
* fix tests
* define xla utils
* fix test
* aval
* chlog
* docs
* aval
* Apply suggestions from code review
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2020-12-02 13:05:11 +00:00
chaton
c2e6e68c7e
optimizer clean up ( #4658 )
...
* add LightningOptimizer
* typo
* add mock closure
* typo
* remove logic in optimizer_step
* update
* update
* update
* desactivate LightningOptimizer for hovorod
* resolve flake
* typo
* check optimizer name
* change name
* added backward to LightningOptimizer
* remove use_lightning_optimizer
* move update
* simplify init
* resolve comments
* resolve bug
* update
* update
* resolve bugs
* resolve flake8
* set state
* work manual_optimizer_step
* add doc
* add enable_pl_optimizer
* make optimizer_step
* add make_optimizer_step
* add examples
* resolve test
* add test_optimizer_return_options_enable_pl_optimizer
* add enable_pl_optimizer=True
* update
* update tests
* resolve bugs
* update
* set Trainer to False
* update
* resolve bugs
* update
* remove from doc
* resolve bug
* typo
* update
* set to True
* simplification
* typo
* resolve horovod
* unwrap horovod
* remove Optimizer
* resolve horovod
* move logic to amp_backend
* doesn't seem to be pickable
* update
* add again
* resolve some bugs
* cleanup
* resolve bug with AMP
* change __repr__
* round at -12
* udpate
* update
* update
* remove from horovod
* typo
* add convert_to_lightning_optimizers in each accelerators
* typo
* forgot
* forgot a convert_to_lightning_optimizers
* update
* update
* update
* increase coverage
* update
* resolve flake8
* update
* remove useless code
* resolve comments + add support for LightningOptimizer base class
* resolve flake
* check optimizer get wrapped back
* resolve DDPSharded
* reduce code
* lightningoptimizer
* Update pytorch_lightning/core/optimizer.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Update pytorch_lightning/core/lightning.py
* remove reference to step function
* Apply suggestions from code review
* update on comments
* resolve
* Update CHANGELOG.md
* add back training_step in apex and native_amp
* rename optimizer_step
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2020-12-01 00:09:46 +00:00
Jirka Borovec
217650320e
simplify imports Omegaconf ( #4873 )
...
* hydra
* omegaconf
2020-11-27 01:00:56 +01:00
Jirka Borovec
442d57f1e9
simplify imports xla / TPU ( #4872 )
...
* xla
* tpu
* fix
* fix
* flake8
2020-11-27 00:37:48 +01:00
Sean Naren
404af43cde
5/n: Extract reference model call to plugins/accelerators ( #4773 )
...
* Encapsulate extracting reference model within the plugin to allow custom wrapper logic to live within the plugin/accelerators
* Add missing new lines
* Fix call to accelerator
* Removed double blank
* Use accelerator backend
* Handle case where wrapper has not been initialized within the plugin
* Added basic get model tests, add better typing
* Change model name
* Split GPU/DDP test
* Add stronger typing, skip ddp test on windows
* Fix import
* Fix import in dp
* Fixed PEP8 definition
* Add ddp launcher for ddp testing
* Modify accelerator reference model to property, change name to reflect func
* Revert property as this is incorrect.=
* Revert across accelerators
* Modified name to get_model_from_plugin
* Code review changes, fix issue with dp
* Add verb to function getter
Co-authored-by: chaton <thomas@grid.ai>
2020-11-23 17:21:47 +00:00
ananthsub
45c57600af
Move init_ddp_connection to DDP Plugin ( #4407 )
...
* Move init_ddp_connection to DDP Plugin
* cluster-env
* trainer?
* imports
* Update ddp_plugin.py
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2020-11-18 15:49:22 -05:00
Sean Naren
e7134a9135
Sharded Plugin 2/n: Allow ddp plugin to modify optimizer state saving ( #4675 )
...
* Allow ddp plugin to modify optimizer state saving
* Rely on the accelerator for optimizer states
* Ensure we init the accelerator for the saving function
* Better comment for optim state dump
* Revert "Ensure we init the accelerator for the saving function"
This reverts commit af65effa
* Added accelerator check to initialize tuner before saving model checkpoint
* Simplify comment
* Revert "Added accelerator check to initialize tuner before saving model checkpoint"
This reverts commit f9929c0c
* Return single optimizer state to reduce duplication
* Fixed docstring
* Fixed typing
* Fixed comment
* Added CHANGELOG.md
Co-authored-by: chaton <thomas@grid.ai>
2020-11-18 16:38:35 +00:00
Sean Naren
8283680aa0
Sharded Plugin 3/n: Expose step input to DDP plugin ( #4686 )
...
* Allow ddp plugin to move the input to a different device if needed
* Swapped name to on_before_forward to align with hooks in the future
* Update pytorch_lightning/plugins/ddp_plugin.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Pass variable arg type to hook, add example
* Remove blank space (pep check)
* Added blank line
Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-11-18 15:45:30 +00:00
chaton
4018237c30
[FEAT] Add lambda closure to manual_optimizer_step ( #4618 )
...
* added lambda_closure
* move to types
* add 2 new tests
* make example more complex
* add complex example to doc
* added more tests
* resolve doc
* typo
* update
* update tpu optimizer_step
* Apply suggestions from code review
* Update pytorch_lightning/core/lightning.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* update
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-11-12 19:22:06 +00:00
Sean Naren
bacabaebaf
Sharded Accelerator 1/n: Expose clip gradients to plugins via abstract class ( #4639 )
...
* Added abstract precision plugin to expose clip_gradients function, use within accelerator to clip gradients
* Exclude model from override, keep optimizer (needed for sharded clip gradients), add override for O2 support apex
* Fix doc
* Applied codereview changes
* Refactored clip function to encapsulate tpu changes with tpu accelerator. Default to standard clip function for vanilla torch
* Pass correct grad clip val
* Moved var to property
* Apply code review suggestions
2020-11-12 17:18:09 +00:00
Sean Naren
33470ba605
Prevent crash if sync_dist=True on CPU ( #4626 )
...
* Added test/fix for sync_dist raising NotImplementedError
* Fixed comments/formatting
* Revert base class change, enforce sync tensors across accelerators, added GPU test
2020-11-11 22:04:05 +00:00
chaton
7e08b0d710
[bug-fix] DDP and automatic_optimization=False ( #4485 )
...
* resolve bug
* add self._running_manual_optim
* update
* update tests
* update lightning module
* resolve bug
* update tests
* update
* resolve pep8
* update
* replace by `ddp_spawn`
* temporary fix
* update
* update
* move update to training_loop
* make both ddp_spawn
* introduce `manual_optimizer_step`
* update changelog
* added changelog wrong place
* add force_optimizer_step
* update docstring for tests
* update optimizer_step
* update zero_grad
* resolve flake8
* move update into manual_optimizer_step
* add zero_grad
* remove zero_grad tests
* remove manual_backward in AMP, it doesn't help
* update
* loosen tests
* update
* update doc
* add TODO
* Removed unnecessary get model from native amp
* Remove try except with pytest raise
* Add seed, clean up imports, remove try catch to reproduce error
* update code
* update test
* revert back
* formatting
* Update pytorch_lightning/core/lightning.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-11-10 19:44:51 +00:00
William Falcon
ee35907170
Accelerator docs ( #4583 )
...
* accelerator docs
* accelerator docs
2020-11-08 17:24:41 -05:00
William Falcon
3ba48d3bc4
ref: unify slurm and TE under backendPlugin 5/n" ( #4582 )
...
* ref: unify slurm and TE under backendPlugin 4/n
* ref: unify slurm and TE under backendPlugin 5/n
2020-11-08 16:20:19 -05:00
William Falcon
624f5b5938
ref: unify slurm and TE under backendPlugin 3/n ( #4581 )
2020-11-08 15:32:37 -05:00
William Falcon
bfaf014096
ref: unify slurm and TE under backendPlugin 2/n ( #4580 )
2020-11-08 15:07:16 -05:00
William Falcon
0f64f15f52
ref: unify slurm and TE under backendPlugin 1/n ( #4578 )
...
* ref: unify slurm and TE under backendPlugin
* ref: unify slurm and TE under backendPlugin
2020-11-08 14:28:55 -05:00
cool425589
5e09fd31e9
show progressbar only on progress_rank 0 on ddp_slurm ( #4437 )
...
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-11-06 01:36:22 +01:00
Travis Addair
51cc7a89ee
Horovod: fixed early stopping and added metrics aggregation ( #3775 )
...
* Fixed early stopping for Horovod
* Refactored to sync_dist_if_available
* Bump min Horovod version to support hvd.is_initialized
* Changelog
* Added back change for Horovod
* Removed redundant checks for initialization
* Implement metrics gathering for Horovod
* Added test for EvalResult
* Renamed ddp_sync_on_step -> dist_sync_on_step
* Added metric test for Horovod
* Added option pass callable allgather function to metric base class
* Added dist_sync_fn
* Fixed calls to private _sync_dist
* Fixed Horovod test
* Added sync_tensor to the distributed backend
* Skip Windows
* Insert test path
* Removed redundant import
* Updated drone
* Unset HOROVOD_GPU_ALLREDUCE
* Unset
* No cache dir
* No uninstall
* Unset variables
* Uninstall Horovod during initialization
* Replaced more references to ddp_sync_on_step
* Fixed imports
* Fixed attribute
* Added back default
* Lint
* Added back docstring
* Made gather_all_tensors default
* Added whitespace
* Update tests/models/test_horovod.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update pytorch_lightning/metrics/metric.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update CHANGELOG.md
Co-authored-by: Teddy Koker <teddy.koker@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-11-05 12:52:02 -05:00
Ananya Harsh Jha
01ab2a933d
[bug] [docs] Clearer optimizer_step override instructions ( #4455 )
...
* fix
* flags
* remove defaults
2020-11-02 22:13:34 +00:00
chaton
102fa9ee7d
[BUGFIX] AMP + Precision unscale grad ( #4441 )
...
* move unscale within Native plugin
* remove gradient tracking from lightning backward
* forgot trainer.fit
* typo
* update
* cleanup
* set to 1.6
* typo
* skip if below 1.6 strict
* update changelog
* remove useless code
* Update tests/plugins/test_amp_plugin.py
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
* Update tests/plugins/test_amp_plugin.py
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
* update changelog
* Update CHANGELOG.md
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Jeff Yang <ydcjeff@outlook.com>
2020-11-02 16:36:48 +00:00
Adrian Wälchli
28d45a26a3
Set correct device ids in DDP [wip] ( #4297 )
...
* repro
debug
c
d
dd
d
d
d
ads
d
d
d
f
rank
f
v
d
d
d
d
d
d
d
d
d
d
d
set
drop PL_DDP_PID
clean up
keep set gpus
revert
Revert "drop PL_DDP_PID"
This reverts commit 7d88cae469541ef19128f9c20919fd3a6f863039.
d
pid
gpus
clean up
clean up
misconfig?
misconfig
clean
clean
* fix pep
* changelog
* remove script
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-10-24 17:33:47 -04:00
Sean Naren
5641b266d5
Bug/4319 ddp checkpoint ( #4323 )
...
* Broadcast best model path to ensure we sync with main process + wait for main process to save
* Add barrier call to ensure all processes are in sync
* Added changelog commit
* Move sync of best model path/score to model checkpoint, keep barrier to ensure all processes complete
* Ensure we broadcast as tuple
* Add init check
* Update pytorch_lightning/callbacks/model_checkpoint.py
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
* Update pytorch_lightning/callbacks/model_checkpoint.py
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
* Removed model checkpoint code, added barrier to trainer to enforce we syncronize and wait for all processes to finish before completing training
* Add barrier within teardown call, removed horovod teardown to inherit from base accelerator
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
2020-10-24 16:55:49 -04:00
William Falcon
753362d0a4
enable ddp as a plugin ( #4285 )
...
* enable custom ddp plugin
* enable custom ddp plugin
* enable custom ddp plugin
* enable custom ddp plugin
* enable custom ddp plugin
* enable custom ddp plugin
* enable custom ddp plugin
* enable custom ddp plugin
* enable custom ddp plugin
* enable custom ddp plugin
* enable custom ddp plugin
Co-authored-by: chaton <thomas@grid.ai>
2020-10-22 05:15:51 -04:00
Justus Schock
0ec4107697
Optimizer closure ( #4190 )
...
* closure for all optimizers
* rename hook and take care of alternating backwards
* add comment
* training_loop_fix
* closure whenever possible
* training_loop
* simple tests that count backward calls
* fix test to work with closure
* remove debugging statement
* better place
* check grads after backward
* start fixing manual optimization
* skip step when result returned by closure was None
* fix gradient clipping test to work with closure
* attribute dict result only for automatic optimization
* adjust backward calls in accelerator
* adjust where to call gradient clipping
* adjust backward calls in tests
* Apply suggestions from code review
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* pass kwargs to xla optimizer
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-10-21 19:34:29 +01:00
Akihiro Nitta
d27ee8b5bf
docs: Add empty lines in docstring [ci skip] ( #4232 )
...
* Add empty lines in docstring for proper docs
* Remove Returns:
* Remove unnecessary Returns:
* Update pytorch_lightning/accelerators/ddp2_accelerator.py
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* fix returns
Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2020-10-21 09:00:39 -04:00
Jirka Borovec
f37444fa3e
CI: add flake8 ( #4239 )
2020-10-19 21:20:17 +01:00
Akihiro Nitta
b45b57cc58
Use `Optional` for arguments set to `None` by default ( #4164 )
...
* Use `Optional` for variables set to `None` by default
* Use `Optional` instead of `Union[None, ...]` for consistency
2020-10-15 23:02:50 +02:00
Sean Naren
98eb736496
Added getstate/setstate method for torch.save serialization ( #4127 )
...
* Added getstate/setstate method for torch.save serialization, added additional Optional Typing to results object
* Added tests to ensure torch.save does not fail
* Added flags to ensure compatible ddp cpu environment
* Removed torch version check due to minimum already being 1.3, reduced epochs for speed
* Moved tests to separate file
* Update to accelerator, move to ddp_spawn to prevent hanging ddp
2020-10-13 16:47:23 -04:00
William Falcon
09c2020a93
notices ( #4118 )
2020-10-13 07:18:07 -04:00
William Falcon
4c4b090c66
depre ( #4088 )
2020-10-12 05:58:31 -04:00
William Falcon
b9f2682b7d
clean docs, enable grad clip in manual mode ( #4078 )
...
* docs
* docs
2020-10-11 13:12:35 -04:00
William Falcon
7ffe05a3d1
ref: accelerator names ( #4066 )
...
* ref: accelerator names
* docs
2020-10-11 01:05:14 -04:00
William Falcon
a4b9221fc5
ref: decouple apex second attemp part n/n ( #4065 )
...
* ref: decouple apex second attemp part n/n
* ref: decouple apex second attemp part n/n
2020-10-10 22:04:50 -04:00
William Falcon
0281b077d8
ref: decouple apex second attemp part 10/n ( #4064 )
...
* ref: decouple apex second attemp part 9/n
* ref: decouple apex second attemp part 9/n
* ref: decouple apex second attemp part 9/n
2020-10-10 20:05:05 -04:00
William Falcon
dca86c310e
ref: decouple apex second attemp part 6/n ( #4060 )
...
* ref: decouple apex second attemp part 6/n
* ref: decouple apex second attemp part 6/n
2020-10-10 15:28:25 -04:00
William Falcon
ce2edf1192
ref: decouple apex second attemp part 4/n ( #4056 )
...
* ref: decouple apex second attemp part 4/n
* ref: decouple apex second attemp part 4/n
* Update lightning.py
* ref: decouple apex second attemp part 4/n
2020-10-10 12:19:22 -04:00
William Falcon
3a6717ca34
ref: decouple apex second attemp part 3/n ( #4055 )
2020-10-10 11:05:57 -04:00
William Falcon
7285613974
ref: decouple apex second attemp part 2/n ( #4054 )
...
* ref: decouple apex second attemp part 2/n
* ref: decouple apex second attemp part 2/n
2020-10-10 10:24:20 -04:00
William Falcon
e854d3744c
ref: decouple apex second attemp part 1/n ( #4052 )
2020-10-10 09:53:02 -04:00
William Falcon
5b261a230e
enable passing in custom accelerators ( #4050 )
...
* enable custom accelerators
* ref: finish decoupling apex, LM and backward
* ref: finish decoupling apex, LM and backward
* ref: finish decoupling apex, LM and backward
2020-10-10 09:21:08 -04:00
William Falcon
2b255a3df4
ref: enable custom clusters (1/n) ( #4048 )
...
* enable cluster plugins
* enable cluster plugins + test backend choices
* enable cluster plugins + test backend choices
* enable cluster plugins + test backend choices
* enable cluster plugins + test backend choices
* enable cluster plugins + test backend choices
* enable cluster plugins + test backend choices
2020-10-10 08:09:29 -04:00
William Falcon
0c42aa03fd
enables plugins ( #4041 )
...
* plugin hardware
* plugin hardware
* plugin hardware
2020-10-09 22:03:46 -04:00
William Falcon
048a816be3
added tests for the training epoch end ( #3967 )
2020-10-07 22:27:36 -04:00
William Falcon
b922409624
clean and organize fit ( #3938 )
...
* clean and organize fit
* clean and organize fit
* clean and organize fit
* clean and organize fit
* clean and organize fit
2020-10-07 11:04:10 -04:00
William Falcon
9c415d2c71
moves configure ddp to each backend ( #3924 )
...
* moves configure ddp to each backend
* moves configure ddp to each backend
* moves configure ddp to each backend
* added torch manual seed in test_mean_error
* test for complicated batch structure
* test for complicated batch structure
* test for complicated batch structure
Co-authored-by: ananyahjha93 <ananya@pytorchlightning.ai>
2020-10-07 00:50:16 -04:00
William Falcon
e3007ffe0c
moves sync bn to each backend ( #3925 )
2020-10-06 22:42:33 -04:00
William Falcon
af5887c0aa
fixed ddp flag crash ( #3927 )
2020-10-06 22:41:08 -04:00
Lezwon Castelino
69833dad5b
Added check to verify xla device is TPU ( #3274 )
...
* tpu device check
* replaced with xmp spawn
* Revert "replaced with xmp spawn"
This reverts commit 6835380f
* replaced all instances of XLA_AVAILABLE
* moved inner_f to global scope
* made refactors
* added changelog
* added TPU_AVAILABLE variable
* fix codefactor issues
* removed form trainer and early stopping
* add TORCHXLA_AVAILABLE check
* added tests
* refactoring
* Update pytorch_lightning/utilities/xla_device_utils.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* updated function names
* fixed bug
* updated CHANGELOG.md
* added todo
* added type hints
* isort and black
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-10-06 19:54:37 +02:00
Sean Naren
e4a56fa5cf
Ensure global seed exists before passing into env subprocess.Popen call ( #3904 )
2020-10-06 12:31:49 -04:00
William Falcon
70e792344a
test selecting the correct backend. temp backends while slurm and TE are decoupled ( #3848 )
...
* test selecting the correct backend. tem backends while slurm and TE are decoupled
* test selecting the correct backend. tem backends while slurm and TE are decoupled
2020-10-04 15:44:50 -04:00
William Falcon
2c21f7d7e2
ref: adding compute environments (2/n) ( #3842 )
...
* ref: adding compute environments (2/n)
* ref: adding compute environments (2/n)
* ref: adding compute environments (2/n)
* ref: adding compute environments (2/n)
2020-10-04 08:48:46 -04:00
Lezwon Castelino
4da240ea1b
added broadcast option to tpu ( #3814 )
...
* added broadcast option to tpu
* add device
* moved tpu broadcast to tpu_backend
* removed Lightning dist
* decode bytes
* pep8 fix
* fix bug
* test for broadcast
* updated changelog
2020-10-04 07:47:33 -04:00
William Falcon
1f8ff7c48c
ref: callback system and init ddp (1/n) ( #3836 )
...
* refactored callback system and init ddp
* refactored callback system and init ddp
* refactored callback system and init ddp
* refactored callback system and init ddp
2020-10-03 23:39:17 -04:00
William Falcon
35d1111994
[WIP] ref: decoupled ddp, ddp spawn (finish 3733) ( #3819 )
...
* ref: finish #3733
* remove deprecated test
* remove deprecated test
* remove deprecated test
* remove deprecated test
* remove deprecated test
* remove deprecated test
* remove deprecated test
* remove deprecated test
* remove deprecated test
* remove deprecated test
* remove deprecated test
* remove deprecated test
* remove deprecated test
* Update pytorch_lightning/accelerators/ddp_backend.py
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
* remove deprecated test
* remove deprecated test
* remove deprecated test
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
2020-10-03 14:05:31 -04:00
William Falcon
ed1450a293
ref: clean up ddp before final fix ( #3817 )
...
* ref: clean up ddp before final fix
* ref: clean up ddp before final fix
* ref: clean up ddp before final fix
* ref: clean up ddp before final fix
* ref: clean up ddp before final fix
2020-10-03 12:01:02 -04:00
William Falcon
0838c6bfce
ref: decoupled ddp2 ( #3816 )
2020-10-03 09:02:35 -04:00
William Falcon
a677833f84
ref: separate slurm from ddp ( #3809 )
...
* ref: separate slurm from ddp
* ref: separate te from ddp
* ref: merge
* ref: merge
* ref: merge
2020-10-02 23:08:34 -04:00
William Falcon
74484edecd
ref: separate te from ddp ( #3810 )
...
* ref: separate te from ddp
* ref: separate te from ddp
* ref: separate te from ddp
2020-10-02 21:00:51 -04:00
William Falcon
a28528cc8b
ref: remove weight loading hack for ddp_cpu ( #3808 )
2020-10-02 19:28:50 -04:00
William Falcon
afa43837a4
ref: part 8 of #3733 ( #3806 )
2020-10-02 18:46:18 -04:00
ananthsub
3ab730e316
Swap torch.load for fsspec load in ddp spawn backend ( #3787 )
...
* Update ddp_spawn_backend.py
* Update ddp_cpu_spawn_backend.py
* log
Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
2020-10-02 21:00:01 +02:00
William Falcon
7c6ed1fa28
ref: part 7 of #3733 ( #3802 )
...
* ref: part 7 of #3733
* ref: part 7 of #3733
2020-10-02 14:23:27 -04:00
Jirka Borovec
62eabdd535
revert backend types ( #3788 )
...
* revert backend types
* todo
* todo
2020-10-02 06:18:44 -04:00
Akihiro Nitta
ebc1b23fa3
Use `raise .. from ..` to explicitly chain exceptions ( #3750 )
...
* Fix exception chaining
* names
* Change exception names for consistency
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
* Change exception names for consistency
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
2020-10-01 21:45:44 +02:00
William Falcon
622c5c3982
ref: part 4 of #3733 ( #3773 )
...
* ref: part 4 of #3733
* ref: part 4 of #3733
* ref: part 4 of #3733
* ref: part 4 of #3733
2020-10-01 11:26:58 -04:00
William Falcon
440f837f6d
ref: part a of #3733 ( #3766 )
...
* ref: part a of #3733
* ref: part a of #3733
2020-10-01 08:15:23 -04:00
Lezwon Castelino
8be002ccc7
skip best_model_path if checkpoint_callback is None ( #2962 )
...
* skip best_model_path if checkpoint_callback is None
* removed test
2020-10-01 06:57:26 -04:00
William Falcon
a38d108a68
add dist lib to enable syncing anything across devices ( #3762 )
...
* add dist lib to enable syncing anything across devices
2020-10-01 01:21:38 -04:00
Jirka Borovec
31a36f04df
define distributed as a type ( #3740 )
...
* define type
* miss
* Apply suggestions from code review
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* miss
* warn
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2020-09-30 08:33:01 -04:00
William Falcon
c41ea86b35
ref: move backends back to individual files (1/5) (ddp_cpu) ( #3712 )
...
* ref: make each backend independent for easier debugging and independent debugging
* ref: make each backend independent for easier debugging and independent debugging
* ref: make each backend independent for easier debugging and independent debugging
* ref: make each backend independent for easier debugging and independent debugging
* ref: make each backend independent for easier debugging and independent debugging
* ref: make each backend independent for easier debugging and independent debugging
* ref: test val epoch end
* ref: test val epoch end
2020-09-29 01:59:18 -04:00
Rohit Gupta
783750547d
disable optimizers setup during testing ( #3059 )
...
* disable configure_optimizers during testing
* minor changes
* hvd and ddp
* fix precision during testing
* fix ddp
* fix amp
* fix cpu
* update dp
* simplify optimizers
* add test
* codefactor
* ref optimizer setup
* chlog
* suggestions
* isort
* rebased with master
2020-09-29 01:09:04 +02:00
William Falcon
931995b55b
remove flake 8 ( #3687 )
2020-09-27 20:40:02 -04:00
William Falcon
031274c25d
fix dp issues + update examples and test examples ( #3618 )
...
* fix dp
* fix dp
* fix dp
* fix dp
* fix examples
* fix examples
* fix examples
* fix examples
* fix examples
* fix examples
* fix examples
* fix examples
* fix examples
* fix examples
* fix examples
* fix examples
* fix examples
* fix examples
* fix examples
* fix examples
* fix examples
* fix examples
2020-09-23 00:19:46 -04:00
Adrian Wälchli
a71d62d840
Fix deterministic behavior in ddp_spawn ( #3573 )
...
* docs
* set env variable
* fix
* changelog
2020-09-20 19:42:58 -04:00
William Falcon
890588a9ee
ref: precision plugins 1/n ( #3504 )
...
* ref: precision plugins 1/n
* ref: precision plugins 1/n
2020-09-15 09:56:12 -04:00
William Falcon
810b445097
ref: apex plugin ( #3502 )
...
* ref: apex plugin
* ref: apex plugin
* ref: apex plugin
2020-09-15 06:02:42 -04:00
William Falcon
6bcfa8b068
ref: merge backends x/n ( #3482 )
2020-09-12 16:28:29 -04:00
William Falcon
518a0c0e92
ref: merge backends x/n ( #3480 )
2020-09-12 15:27:11 -04:00
William Falcon
0045119b3f
ref: merge backends x/n ( #3478 )
...
* ref: merge backends x/n
* ref: merge backends x/n
* ref: merge backends x/n
* ref: merge backends x/n
2020-09-12 13:55:55 -04:00
William Falcon
00d155ae01
ref: merge backends x/n ( #3477 )
2020-09-12 12:36:55 -04:00
William Falcon
59d8472548
ref: slurm connector 1/n ( #3476 )
...
* ref: slurm connector 1/n
* ref: slurm connector 1/n
* ref: slurm connector 1/n
* ref: slurm connector 1/n
2020-09-12 11:07:15 -04:00
William Falcon
ff0064f956
ref: group connectors ( #3472 )
...
* ref: accelerator connector methods 3/n
* ref: accelerator connector methods 3/n
2020-09-11 23:33:09 -04:00
William Falcon
dd324e4086
ref: accelerator connector methods x/n ( #3470 )
2020-09-11 22:25:48 -04:00
William Falcon
de99222834
ref: accelerator connector methods x/n ( #3469 )
...
* ref: accelerator connector methods x/n
* ref: accelerator connector methods x/n
2020-09-11 21:52:22 -04:00
William Falcon
ef20310873
ref: move specific accelerator code x/n ( #3457 )
...
* ref: organize args x/n
* ref: move specific accelerator code x/n
* ref: move specific accelerator code x/n
* ref: move specific accelerator code x/n
2020-09-11 10:56:21 -04:00
William Falcon
70af47db84
ref: organize args 4/n ( #3456 )
2020-09-10 21:58:47 -04:00
William Falcon
3281586ab4
ref: organize args 3/n ( #3449 )
...
* ref: organize args 3/n
* ref: organize args 3/n
* ref: organize args 3/n
* ref: organize args 3/n
* ref: organize args 3/n
* ref: organize args 3/n
2020-09-10 13:21:04 -04:00
William Falcon
a208d6da46
ref: organize args 2/n ( #3448 )
...
* ref: organize args 2/n
* ref: organize args 2/n
* ref: organize args 2/n
2020-09-10 10:51:35 -04:00
William Falcon
541c4ab01d
ref: organize args 3/n ( #3447 )
...
* ref: organize args 2/n
* ref: organize args 2/n
* ref: organize args 2/n
* ref: organize args 2/n
2020-09-10 08:55:30 -04:00
William Falcon
deb82d9c08
ref: organize args 2/n ( #3442 )
...
* ref: organize args 2/n
* ref: organize args 2/n
2020-09-10 08:07:55 -04:00
William Falcon
49290a569b
ref: organize args 1/n ( #3435 )
...
* ref: organize args 1/n
* ref: organize args 1/n
2020-09-10 07:24:42 -04:00
William Falcon
8f6b115511
ref: added model connector ( #3407 )
...
* ref: added model connector
* ref: added model connector
* ref: added model connector
2020-09-09 00:24:20 -04:00
Travis Addair
091d37f968
Added check for apex AMP and unit tests for Horovod + AMP ( #3404 )
...
* Added check for apex AMP and unit tests for Horovod + AMP
* Changelog
* Fixed order of Horovod and Apex optimizer wrapping
2020-09-08 20:30:57 -04:00
William Falcon
9939f53b7c
ref: inner train loop (intermediate step) 12/n ( #3372 )
...
* ref: inner train loop (intermediate step) 12/n
* ref: inner train loop (intermediate step) 12/n
* ref: inner train loop (intermediate step) 12/n
* ref: inner train loop (intermediate step) 12/n
* ref: inner train loop (intermediate step) 12/n
* ref: inner train loop (intermediate step) 12/n
2020-09-06 17:50:47 -04:00
William Falcon
38b9677638
ref: inner train loop (intermediate step) 5/n ( #3365 )
2020-09-05 18:27:28 -04:00
William Falcon
c7ef5ee874
ref: inner train loop (intermediate step) 3/n ( #3363 )
2020-09-05 17:01:46 -04:00
William Falcon
f55efb7616
ref: inner train loop (intermediate step) 1/n ( #3361 )
2020-09-05 10:10:49 -04:00
William Falcon
5a474c452c
ref: inner train loop (intermediate step) 1/n ( #3359 )
2020-09-05 08:55:22 -04:00
William Falcon
0a119403d6
ref: moved accelerator router ( #3309 )
...
* ref: moved accelerator
* ref: moved accelerator
* ref: moved accelerator
* ref: moved accelerator
2020-09-01 15:48:28 -04:00
William Falcon
b0298cead8
ref: move train outside of setup training ( #3297 )
...
* ref: move train outside of setup training
* ref: move train outside of setup training
* ref: move train outside of setup training
* ref: move train outside of setup training
2020-08-31 20:36:52 -04:00
William Falcon
bcd13f70b8
ref: run_pretrain_routine -> setup_training ( #3294 )
...
* ref: .tune()
* ref: run_pretrain_routine -> setup_training
2020-08-31 18:06:11 -04:00
Philipp Singer
0aee137ba7
DP device fix ( #3196 )
2020-08-27 09:01:29 -04:00
William Falcon
4272360076
ddp backend refactor ( #3210 )
2020-08-26 21:02:15 -04:00
William Falcon
3a26b4ff5c
ddp backend refactor ( #3209 )
2020-08-26 20:31:09 -04:00
William Falcon
6bae404bed
ref: ddp backend refactor (3) ( #3208 )
...
* ddp backend refactor
* ddp backend refactor
2020-08-26 20:03:09 -04:00
William Falcon
a8daf914f8
ddp backend refactor ( #3207 )
2020-08-26 19:10:24 -04:00
William Falcon
ff3c2f4cff
ddp backend refactor ( #3204 )
2020-08-26 18:43:28 -04:00
William Falcon
f3384d0cbb
ref: ddps train hooks ( #3203 )
...
* ddps train
* ddps train
2020-08-26 15:37:40 -04:00
William Falcon
ef07b0c4b3
acceleartor fit 1 ( #3200 )
2020-08-26 14:20:38 -04:00
William Falcon
f064d74be8
refactored dataloader process hook ( #3139 )
2020-08-24 21:53:56 -04:00
William Falcon
82d1128966
eval step scaling factor ( #3136 )
2020-08-24 20:26:39 -04:00
William Falcon
6c3cec3a3c
training amp scaling refactor ( #3135 )
2020-08-24 19:59:46 -04:00
William Falcon
0b3cb3c955
ref: moved ___step_end hooks ( #3130 )
...
* moved eval hooks
* moved eval hooks
* moved eval hooks
* moved eval hooks
* moved eval hooks
* moved eval hooks
* moved eval hooks
2020-08-24 17:50:47 -04:00
William Falcon
6068b29d29
ref: remove obscure forward call in eval + CPU backend ___step ( #3123 )
...
* remove obscure forward call in eval
* remove obscure forward call in eval
* remove obscure forward call in eval
* remove obscure forward call in eval
* remove obscure forward call in eval
* remove obscure forward call in eval
2020-08-24 12:31:40 -04:00
William Falcon
18160b81b5
refactored horovod backend ( #3122 )
2020-08-24 11:13:49 -04:00
William Falcon
8ebf4fe173
ref: refactored horovod backend ( #3121 )
...
* refactored horovod backend
* refactored horovod backend
2020-08-24 10:35:32 -04:00
William Falcon
8d7ca5cd2c
ref: refactored gpu backend __step ( #3120 )
...
* refactored gpu backend __step
* refactored gpu backend __step
* refactored gpu backend __step
* refactored gpu backend __step
2020-08-24 09:22:05 -04:00
William Falcon
527b9dca36
refactored ddp backend forward ( #3119 )
2020-08-24 07:33:14 -04:00
William Falcon
3c88b0dd83
Refactor 1: moved tpu xxx_step to backend ( #3118 )
...
* moved tpu training_step
* refactored eval step
* refactored eval step
* refactored eval step
2020-08-24 07:02:06 -04:00
Ananya Harsh Jha
9445c800b0
set device to root gpu ( #3042 )
2020-08-18 19:28:35 -04:00
Adrian Wälchli
188e06c261
ddp fix for trainer.test() + add basic ddp tests ( #2997 )
...
* add ddp script variations
* add ddp test
* rename
* shell
* test
* test
* try call
* try without subprocess
* test
* display the error
* list all variations
* try string
* try copy env
* debug
* pythonpath
* path
* update test
* change
* simple ddp test
* replace
* remove random port
* random port
* str
* clean up
* check run spawn
* clean up
* docs
* docs
* update test
* docs
* changelog
* changelog
2020-08-16 11:19:57 -04:00
William Falcon
e7794eb79a
Fixes #2407 ( #2981 )
...
* fix gpus index error
2020-08-14 16:22:48 -04:00
Jirka Borovec
5bce06c050
nb. devices ( #2973 )
2020-08-14 11:37:21 +02:00
William Falcon
0c264689cb
Fixes #2942 ( #2969 )
...
* Fixes #2942
* doc fix
2020-08-13 21:54:57 -04:00
Jirka Borovec
4354690e55
add apex test ( #2921 )
...
* add apex test
* rename
* level
* events
* wrap
* evt
* miss
* apex
* apex
* apex
* apex
* apex
* apex
* Update tests/models/test_amp.py
Co-authored-by: William Falcon <waf2107@columbia.edu>
* notes
* notes
Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-08-13 10:03:13 -04:00
Phil
e3528afae3
Move optimizer creation after device placement for ddp backends. ( #2904 )
2020-08-12 06:34:59 -04:00
Jirka Borovec
a6e7aa7796
allow using apex with any PT version ( #2865 )
...
* wip
* setup
* type
* name
* wip
* docs
* imports
* fix if
* fix if
* use_amp
* Apply suggestions from code review
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Apply suggestions from code review
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* fix tests
* Apply suggestions from code review
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* fix tests
* todos
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-08-08 11:07:32 +02:00
Jirka Borovec
b7d72706c3
clean imports ( #2867 )
...
* clean imports
* miss
2020-08-08 00:33:51 +02:00
Jirka Borovec
f8c058215f
simplify tests & cleaning ( #2588 )
...
* simplify
* tmpdir
* revert
* clean
* accel
* types
* test
* edit test acc
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Update test acc
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-08-07 23:22:05 +02:00
William Falcon
4dbd761a1c
refactor 3/n ( #2709 )
...
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
2020-07-25 20:56:50 -04:00
William Falcon
b34217e410
Refactor 2/n ( #2708 )
...
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
2020-07-25 17:31:34 -04:00
William Falcon
071e09fe38
refactor 1/n for v1.0.0 ( #2704 )
...
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
2020-07-25 14:38:51 -04:00