chaton
f2e99d617f
deprecate enable_pl_optimizer as it is not restored properly ( #5244 )
...
* update
* clean test
* still in progress
* udpdate test
* update
* update
* resolve flake
* add test for zero_grad
* update
* works without accumulated_grad
* update
* update
* resolve amp
* revert back to True
* update
* clean tests
* cleaned out
* typo
* update test
* git repare bug
* remove print
* udpate
* Fix formatting/optimizer imports
* Refactor the test for cleanliness
* Add vanilla model to the test, better var names
* Fixed var names, let's clean up these mock tests
* repare test
* update test
* resolve flake8
* add manual_optimization
* update tests
* resolve flake8
* add random accumulate_grad_batches
* improve test
* Update tests/trainer/optimization/test_parity_automatic_optimization.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update tests/trainer/optimization/test_parity_automatic_optimization.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* update
* clean tests
* correct bug
* Apply suggestions from code review
* format
* adress comments
* update on comments
* wip
* typo
* depreceate enable_pl_optimizer
* resolve latest bugs
* update
* resolve merge
* add comment
* Update pytorch_lightning/core/lightning.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update tests/deprecated_api/test_remove_1-3.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update pytorch_lightning/trainer/connectors/optimizer_connector.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update pytorch_lightning/trainer/trainer.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update pytorch_lightning/trainer/trainer.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update tests/trainer/optimization/test_parity_automatic_optimization.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* update on comments
* update restore
* add a property
* remove setstate as not needed anymore
* update test
* provide optimizer to on_before_zero_grad
* update on comments
* update on comments
* Update pytorch_lightning/trainer/trainer.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Update tests/trainer/optimization/test_parity_automatic_optimization.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Update tests/trainer/optimization/test_parity_automatic_optimization.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Update tests/trainer/optimization/test_parity_automatic_optimization.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* mofidy import
* update changelog
* resolve flake8
* update
* update
* clean doc
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-62-109.ec2.internal>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2021-01-08 16:13:12 -05:00
chaton
d510707bc9
[bug-fix] Call transfer_batch_to_device in DDPlugin ( #5195 )
...
* hacking out
* update
* remove useless on_before_forward
* update
* remove overriden
* iremove os
* use on_before_forward
* resolve flake8
* add test
* update
* add single_process_per_device
* resolve flake8
* update
* resolve
* update
* update
* update
* add comment
* resolve bug with sharded
* update
* remove property
* update
* resolve test
* resolve bug
* update on comments
* update doc
* Update pytorch_lightning/core/hooks.py
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* update on comments
* Update pytorch_lightning/plugins/ddp_plugin.py
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* Update pytorch_lightning/plugins/ddp_plugin.py
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* resolve pep8
* add device_ids to pipe
* update on comments
* update
* resolve
* update
* update
* update
Co-authored-by: Ubuntu <ubuntu@ip-172-31-62-109.ec2.internal>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2021-01-08 12:05:22 -05:00
ananthsub
94838d3be2
Fix hang in DDP HPC accelerators ( #5157 )
...
* Fix hang in DDP HPC accelerators
init_device was never called
* Update CHANGELOG.md
2020-12-16 21:07:11 +01:00
chaton
2c3d43dcb5
Initialize trainer with None in DDPAccelerator ( #4915 )
...
* Initialize trainer with None
* add typing to all accelerators
* resolve imports
* update
* add typing
* removed typo
* update
* Fix formatting and imports in accelerator
Co-authored-by: maxjeblick <maxjeblick@users.noreply.github.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
2020-12-10 15:24:44 +01:00
Jirka Borovec
d5fa02e798
simplify accelerator steps ( #5015 )
...
* simplify accelerator steps
* Apply suggestions from code review
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2020-12-10 18:36:13 +05:30
Jirka Borovec
cdbddbe99f
release 1.1.0 ( #5048 )
...
* release 1.1.0
* pep8
2020-12-10 00:52:39 +00:00
Jirka Borovec
ce9179591d
ref: clean config [1/n] add intermediate setters ( #4990 )
...
* add intermediate setters
* show inputs
* fix options
* move
* fix
* less talk
* fix
* talk less
* str
* cases
* rename
Co-authored-by: chaton <thomas@grid.ai>
2020-12-09 14:13:57 -05:00
Rohit Gupta
bcbba3b702
Simplify GPU and TPU accelerator ( #5024 )
2020-12-09 14:12:44 -05:00
Jirka Borovec
53d7c9555c
drop usage of deprecated distributed_backend ( #5009 )
...
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
2020-12-09 09:18:23 +01:00
Ananya Harsh Jha
127454ade2
All gatherwith grads ( #5012 )
...
* all_gather
* ddp
* horovod
* grad tests
* fixed ddp
* ddp fixed, removed tpu, horovod for now
* changelog
* windows fix
* windows fix
* removed batch from ctx
* all_gather
* ddp
* horovod
* grad tests
* fixed ddp
* ddp fixed, removed tpu, horovod for now
* changelog
* windows fix
* windows fix
* removed batch from ctx
* removed code duplication
* merge
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-12-08 23:20:01 +00:00
Sean Naren
ee9b3fe574
[feat] pp 1/n ( #5016 )
...
* Added changes for RPC plugin
* Add missing kwargs
* Fix code format
* Loading refactors by introducing is_distributed var, fix optimizer step flow
* Add rpc guard
* Added docstrings and typing
* resolve comments
* Add additional rpc hook, refactor name of exit process hook for clarity
* remove annotation
* Modify behaviour to allow optional return, add test for rpc plugin
* resolve tests
* rename is_ddp_based
* update
* update for windows
* update
* resolve test
* code smell
* Revert back to init_ddp_connection for backwards compat
* Swap to explicit name for property
* Add missing speed parity increase for CI variability, fix call counts for child process
Co-authored-by: tchaton <thomas@grid.ai>
2020-12-08 22:02:10 +00:00
maxjeblick
79ae66d026
Initialize trainer with None ( #4847 )
...
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: edenlightning <66261195+edenlightning@users.noreply.github.com>
2020-12-08 22:49:55 +05:30
chaton
2393474350
[hotfix] ddp + manual_optimisation ( #4976 )
...
* Rely on ddp plugin for blocking sync behaviour, and skip if we're using manual optimization
* debug
* Revert "debug"
This reverts commit ccca6b6b
* Expose manual reduce for automatic optimization
* Add input arguments
* Enable parity test
* clean imports
* Expose hook after to ensure we reset
* Fix naming
* add
* fix test
* resolve on comments
* typo
* Update tests/trainer/optimization/test_manual_optimization.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update tests/trainer/optimization/test_manual_optimization.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* update on comments
* resolve comments
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-12-07 19:31:54 +00:00
chaton
02152c1729
Simplify optimization Logic ( #4984 )
...
* Rely on ddp plugin for blocking sync behaviour, and skip if we're using manual optimization
* debug
* Revert "debug"
This reverts commit ccca6b6b
* Expose manual reduce for automatic optimization
* Add input arguments
* Enable parity test
* clean imports
* Expose hook after to ensure we reset
* Fix naming
* add
* fix test
* uniformize optimizer logic
* resolve test
* resovle flake8
* resolve amp bug
* update tests
* remove bug
* remove optimizer_step in accelerators
* typo
* update lightning optimizer
* set doesn't work with ddp_spawn
* resolve flake8
* update threshold
* ignore pyright
* correct codeFactor
* remove useless if
* remove zer_grad function
* simplify step
* remove typo
* resolve bug
* Apply suggestions from code review
* update on comments
* resolve bugs
* remove tests
* Update pytorch_lightning/trainer/configuration_validator.py
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* simplify testing
* add more tests
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2020-12-07 12:55:49 +00:00
Jirka Borovec
3976db597d
refactor imports of optional dependencies ( #4859 )
...
* refactor imports of optional dependencies
* fix
* fix
* fix
* fix
* fix
* flake8
* flake8
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
2020-12-04 10:26:10 +01:00
Lezwon Castelino
12cb9942a1
Tpu save ( #4309 )
...
* convert xla tensor to cpu before save
* move_to_cpu
* updated CHANGELOG.md
* added on_save to accelerators
* if accelerator is not None
* refactors
* change filename to run test
* run test_tpu_backend
* added xla_device_utils to tests
* added xla_device_utils to test
* removed tests
* Revert "added xla_device_utils to test"
This reverts commit 0c9316bb
* fixed pep
* increase timeout and print traceback
* lazy check tpu exists
* increased timeout
removed barrier for tpu during test
reduced epochs
* fixed torch_xla imports
* fix tests
* define xla utils
* fix test
* aval
* chlog
* docs
* aval
* Apply suggestions from code review
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2020-12-02 13:05:11 +00:00
chaton
c2e6e68c7e
optimizer clean up ( #4658 )
...
* add LightningOptimizer
* typo
* add mock closure
* typo
* remove logic in optimizer_step
* update
* update
* update
* desactivate LightningOptimizer for hovorod
* resolve flake
* typo
* check optimizer name
* change name
* added backward to LightningOptimizer
* remove use_lightning_optimizer
* move update
* simplify init
* resolve comments
* resolve bug
* update
* update
* resolve bugs
* resolve flake8
* set state
* work manual_optimizer_step
* add doc
* add enable_pl_optimizer
* make optimizer_step
* add make_optimizer_step
* add examples
* resolve test
* add test_optimizer_return_options_enable_pl_optimizer
* add enable_pl_optimizer=True
* update
* update tests
* resolve bugs
* update
* set Trainer to False
* update
* resolve bugs
* update
* remove from doc
* resolve bug
* typo
* update
* set to True
* simplification
* typo
* resolve horovod
* unwrap horovod
* remove Optimizer
* resolve horovod
* move logic to amp_backend
* doesn't seem to be pickable
* update
* add again
* resolve some bugs
* cleanup
* resolve bug with AMP
* change __repr__
* round at -12
* udpate
* update
* update
* remove from horovod
* typo
* add convert_to_lightning_optimizers in each accelerators
* typo
* forgot
* forgot a convert_to_lightning_optimizers
* update
* update
* update
* increase coverage
* update
* resolve flake8
* update
* remove useless code
* resolve comments + add support for LightningOptimizer base class
* resolve flake
* check optimizer get wrapped back
* resolve DDPSharded
* reduce code
* lightningoptimizer
* Update pytorch_lightning/core/optimizer.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Update pytorch_lightning/core/lightning.py
* remove reference to step function
* Apply suggestions from code review
* update on comments
* resolve
* Update CHANGELOG.md
* add back training_step in apex and native_amp
* rename optimizer_step
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2020-12-01 00:09:46 +00:00
Jirka Borovec
217650320e
simplify imports Omegaconf ( #4873 )
...
* hydra
* omegaconf
2020-11-27 01:00:56 +01:00
Jirka Borovec
442d57f1e9
simplify imports xla / TPU ( #4872 )
...
* xla
* tpu
* fix
* fix
* flake8
2020-11-27 00:37:48 +01:00
Sean Naren
404af43cde
5/n: Extract reference model call to plugins/accelerators ( #4773 )
...
* Encapsulate extracting reference model within the plugin to allow custom wrapper logic to live within the plugin/accelerators
* Add missing new lines
* Fix call to accelerator
* Removed double blank
* Use accelerator backend
* Handle case where wrapper has not been initialized within the plugin
* Added basic get model tests, add better typing
* Change model name
* Split GPU/DDP test
* Add stronger typing, skip ddp test on windows
* Fix import
* Fix import in dp
* Fixed PEP8 definition
* Add ddp launcher for ddp testing
* Modify accelerator reference model to property, change name to reflect func
* Revert property as this is incorrect.=
* Revert across accelerators
* Modified name to get_model_from_plugin
* Code review changes, fix issue with dp
* Add verb to function getter
Co-authored-by: chaton <thomas@grid.ai>
2020-11-23 17:21:47 +00:00
ananthsub
45c57600af
Move init_ddp_connection to DDP Plugin ( #4407 )
...
* Move init_ddp_connection to DDP Plugin
* cluster-env
* trainer?
* imports
* Update ddp_plugin.py
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2020-11-18 15:49:22 -05:00
Sean Naren
e7134a9135
Sharded Plugin 2/n: Allow ddp plugin to modify optimizer state saving ( #4675 )
...
* Allow ddp plugin to modify optimizer state saving
* Rely on the accelerator for optimizer states
* Ensure we init the accelerator for the saving function
* Better comment for optim state dump
* Revert "Ensure we init the accelerator for the saving function"
This reverts commit af65effa
* Added accelerator check to initialize tuner before saving model checkpoint
* Simplify comment
* Revert "Added accelerator check to initialize tuner before saving model checkpoint"
This reverts commit f9929c0c
* Return single optimizer state to reduce duplication
* Fixed docstring
* Fixed typing
* Fixed comment
* Added CHANGELOG.md
Co-authored-by: chaton <thomas@grid.ai>
2020-11-18 16:38:35 +00:00
Sean Naren
8283680aa0
Sharded Plugin 3/n: Expose step input to DDP plugin ( #4686 )
...
* Allow ddp plugin to move the input to a different device if needed
* Swapped name to on_before_forward to align with hooks in the future
* Update pytorch_lightning/plugins/ddp_plugin.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Pass variable arg type to hook, add example
* Remove blank space (pep check)
* Added blank line
Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-11-18 15:45:30 +00:00
chaton
4018237c30
[FEAT] Add lambda closure to manual_optimizer_step ( #4618 )
...
* added lambda_closure
* move to types
* add 2 new tests
* make example more complex
* add complex example to doc
* added more tests
* resolve doc
* typo
* update
* update tpu optimizer_step
* Apply suggestions from code review
* Update pytorch_lightning/core/lightning.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* update
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-11-12 19:22:06 +00:00
Sean Naren
bacabaebaf
Sharded Accelerator 1/n: Expose clip gradients to plugins via abstract class ( #4639 )
...
* Added abstract precision plugin to expose clip_gradients function, use within accelerator to clip gradients
* Exclude model from override, keep optimizer (needed for sharded clip gradients), add override for O2 support apex
* Fix doc
* Applied codereview changes
* Refactored clip function to encapsulate tpu changes with tpu accelerator. Default to standard clip function for vanilla torch
* Pass correct grad clip val
* Moved var to property
* Apply code review suggestions
2020-11-12 17:18:09 +00:00
Sean Naren
33470ba605
Prevent crash if sync_dist=True on CPU ( #4626 )
...
* Added test/fix for sync_dist raising NotImplementedError
* Fixed comments/formatting
* Revert base class change, enforce sync tensors across accelerators, added GPU test
2020-11-11 22:04:05 +00:00
chaton
7e08b0d710
[bug-fix] DDP and automatic_optimization=False ( #4485 )
...
* resolve bug
* add self._running_manual_optim
* update
* update tests
* update lightning module
* resolve bug
* update tests
* update
* resolve pep8
* update
* replace by `ddp_spawn`
* temporary fix
* update
* update
* move update to training_loop
* make both ddp_spawn
* introduce `manual_optimizer_step`
* update changelog
* added changelog wrong place
* add force_optimizer_step
* update docstring for tests
* update optimizer_step
* update zero_grad
* resolve flake8
* move update into manual_optimizer_step
* add zero_grad
* remove zero_grad tests
* remove manual_backward in AMP, it doesn't help
* update
* loosen tests
* update
* update doc
* add TODO
* Removed unnecessary get model from native amp
* Remove try except with pytest raise
* Add seed, clean up imports, remove try catch to reproduce error
* update code
* update test
* revert back
* formatting
* Update pytorch_lightning/core/lightning.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-11-10 19:44:51 +00:00
William Falcon
ee35907170
Accelerator docs ( #4583 )
...
* accelerator docs
* accelerator docs
2020-11-08 17:24:41 -05:00
William Falcon
3ba48d3bc4
ref: unify slurm and TE under backendPlugin 5/n" ( #4582 )
...
* ref: unify slurm and TE under backendPlugin 4/n
* ref: unify slurm and TE under backendPlugin 5/n
2020-11-08 16:20:19 -05:00
William Falcon
624f5b5938
ref: unify slurm and TE under backendPlugin 3/n ( #4581 )
2020-11-08 15:32:37 -05:00
William Falcon
bfaf014096
ref: unify slurm and TE under backendPlugin 2/n ( #4580 )
2020-11-08 15:07:16 -05:00
William Falcon
0f64f15f52
ref: unify slurm and TE under backendPlugin 1/n ( #4578 )
...
* ref: unify slurm and TE under backendPlugin
* ref: unify slurm and TE under backendPlugin
2020-11-08 14:28:55 -05:00
cool425589
5e09fd31e9
show progressbar only on progress_rank 0 on ddp_slurm ( #4437 )
...
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-11-06 01:36:22 +01:00
Travis Addair
51cc7a89ee
Horovod: fixed early stopping and added metrics aggregation ( #3775 )
...
* Fixed early stopping for Horovod
* Refactored to sync_dist_if_available
* Bump min Horovod version to support hvd.is_initialized
* Changelog
* Added back change for Horovod
* Removed redundant checks for initialization
* Implement metrics gathering for Horovod
* Added test for EvalResult
* Renamed ddp_sync_on_step -> dist_sync_on_step
* Added metric test for Horovod
* Added option pass callable allgather function to metric base class
* Added dist_sync_fn
* Fixed calls to private _sync_dist
* Fixed Horovod test
* Added sync_tensor to the distributed backend
* Skip Windows
* Insert test path
* Removed redundant import
* Updated drone
* Unset HOROVOD_GPU_ALLREDUCE
* Unset
* No cache dir
* No uninstall
* Unset variables
* Uninstall Horovod during initialization
* Replaced more references to ddp_sync_on_step
* Fixed imports
* Fixed attribute
* Added back default
* Lint
* Added back docstring
* Made gather_all_tensors default
* Added whitespace
* Update tests/models/test_horovod.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update pytorch_lightning/metrics/metric.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update CHANGELOG.md
Co-authored-by: Teddy Koker <teddy.koker@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-11-05 12:52:02 -05:00
Ananya Harsh Jha
01ab2a933d
[bug] [docs] Clearer optimizer_step override instructions ( #4455 )
...
* fix
* flags
* remove defaults
2020-11-02 22:13:34 +00:00
chaton
102fa9ee7d
[BUGFIX] AMP + Precision unscale grad ( #4441 )
...
* move unscale within Native plugin
* remove gradient tracking from lightning backward
* forgot trainer.fit
* typo
* update
* cleanup
* set to 1.6
* typo
* skip if below 1.6 strict
* update changelog
* remove useless code
* Update tests/plugins/test_amp_plugin.py
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
* Update tests/plugins/test_amp_plugin.py
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
* update changelog
* Update CHANGELOG.md
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Jeff Yang <ydcjeff@outlook.com>
2020-11-02 16:36:48 +00:00
Adrian Wälchli
28d45a26a3
Set correct device ids in DDP [wip] ( #4297 )
...
* repro
debug
c
d
dd
d
d
d
ads
d
d
d
f
rank
f
v
d
d
d
d
d
d
d
d
d
d
d
set
drop PL_DDP_PID
clean up
keep set gpus
revert
Revert "drop PL_DDP_PID"
This reverts commit 7d88cae469541ef19128f9c20919fd3a6f863039.
d
pid
gpus
clean up
clean up
misconfig?
misconfig
clean
clean
* fix pep
* changelog
* remove script
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-10-24 17:33:47 -04:00
Sean Naren
5641b266d5
Bug/4319 ddp checkpoint ( #4323 )
...
* Broadcast best model path to ensure we sync with main process + wait for main process to save
* Add barrier call to ensure all processes are in sync
* Added changelog commit
* Move sync of best model path/score to model checkpoint, keep barrier to ensure all processes complete
* Ensure we broadcast as tuple
* Add init check
* Update pytorch_lightning/callbacks/model_checkpoint.py
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
* Update pytorch_lightning/callbacks/model_checkpoint.py
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
* Removed model checkpoint code, added barrier to trainer to enforce we syncronize and wait for all processes to finish before completing training
* Add barrier within teardown call, removed horovod teardown to inherit from base accelerator
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
2020-10-24 16:55:49 -04:00
William Falcon
753362d0a4
enable ddp as a plugin ( #4285 )
...
* enable custom ddp plugin
* enable custom ddp plugin
* enable custom ddp plugin
* enable custom ddp plugin
* enable custom ddp plugin
* enable custom ddp plugin
* enable custom ddp plugin
* enable custom ddp plugin
* enable custom ddp plugin
* enable custom ddp plugin
* enable custom ddp plugin
Co-authored-by: chaton <thomas@grid.ai>
2020-10-22 05:15:51 -04:00
Justus Schock
0ec4107697
Optimizer closure ( #4190 )
...
* closure for all optimizers
* rename hook and take care of alternating backwards
* add comment
* training_loop_fix
* closure whenever possible
* training_loop
* simple tests that count backward calls
* fix test to work with closure
* remove debugging statement
* better place
* check grads after backward
* start fixing manual optimization
* skip step when result returned by closure was None
* fix gradient clipping test to work with closure
* attribute dict result only for automatic optimization
* adjust backward calls in accelerator
* adjust where to call gradient clipping
* adjust backward calls in tests
* Apply suggestions from code review
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* pass kwargs to xla optimizer
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-10-21 19:34:29 +01:00
Akihiro Nitta
d27ee8b5bf
docs: Add empty lines in docstring [ci skip] ( #4232 )
...
* Add empty lines in docstring for proper docs
* Remove Returns:
* Remove unnecessary Returns:
* Update pytorch_lightning/accelerators/ddp2_accelerator.py
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
* fix returns
Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2020-10-21 09:00:39 -04:00
Jirka Borovec
f37444fa3e
CI: add flake8 ( #4239 )
2020-10-19 21:20:17 +01:00
Akihiro Nitta
b45b57cc58
Use `Optional` for arguments set to `None` by default ( #4164 )
...
* Use `Optional` for variables set to `None` by default
* Use `Optional` instead of `Union[None, ...]` for consistency
2020-10-15 23:02:50 +02:00
Sean Naren
98eb736496
Added getstate/setstate method for torch.save serialization ( #4127 )
...
* Added getstate/setstate method for torch.save serialization, added additional Optional Typing to results object
* Added tests to ensure torch.save does not fail
* Added flags to ensure compatible ddp cpu environment
* Removed torch version check due to minimum already being 1.3, reduced epochs for speed
* Moved tests to separate file
* Update to accelerator, move to ddp_spawn to prevent hanging ddp
2020-10-13 16:47:23 -04:00
William Falcon
09c2020a93
notices ( #4118 )
2020-10-13 07:18:07 -04:00
William Falcon
4c4b090c66
depre ( #4088 )
2020-10-12 05:58:31 -04:00
William Falcon
b9f2682b7d
clean docs, enable grad clip in manual mode ( #4078 )
...
* docs
* docs
2020-10-11 13:12:35 -04:00
William Falcon
7ffe05a3d1
ref: accelerator names ( #4066 )
...
* ref: accelerator names
* docs
2020-10-11 01:05:14 -04:00
William Falcon
a4b9221fc5
ref: decouple apex second attemp part n/n ( #4065 )
...
* ref: decouple apex second attemp part n/n
* ref: decouple apex second attemp part n/n
2020-10-10 22:04:50 -04:00
William Falcon
0281b077d8
ref: decouple apex second attemp part 10/n ( #4064 )
...
* ref: decouple apex second attemp part 9/n
* ref: decouple apex second attemp part 9/n
* ref: decouple apex second attemp part 9/n
2020-10-10 20:05:05 -04:00