Carlos Mocholí
ed13040729
Connect the model to the training type plugin at the start of run ( #8536 )
2021-08-04 17:43:34 +02:00
Caleb Robinson
9ca02f58ae
Fix an import deprecation warning ( #8687 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-08-03 22:17:28 +00:00
Jirka Borovec
f67892ea96
CI: yesqa ( #8564 )
...
* add yesqa
* fix flake8
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-08-02 16:05:56 +00:00
Sean Naren
07b7dc9c17
[Fix] Add delay property for checkpointing, refactor loading checkpoint (DeepSpeed Checkpointing Fix 1/n) ( #8627 )
...
* Add property to delay checkpointing, move loading checkpoint file into the run function to allow deepspeed engine to be loaded
* Add a small test
* Apply suggestions from code review
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Update pytorch_lightning/accelerators/accelerator.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Address review
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-30 11:31:08 +01:00
Santiago Castro
b256d6acd3
Avoid unnecessary list creation ( #8595 )
2021-07-28 13:36:45 +05:30
Carlos Mocholí
a64cc37394
Replace `yapf` with `black` ( #7783 )
...
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-26 13:37:35 +02:00
thomas chaton
c9af1a7aec
[bugfix] Reduce memory leaks ( #8490 )
...
* reduce memory leak
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update changelog
* Apply suggestions from code review
Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
* resolve flake8
* update on comments
* resolve bug
* update
* Undo whitespace changes
* remove bug
* resolve flake8
* revert change
* update on comments
* delete the ddp wrapper as it hold memory
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* resolve flake8
* update on comments
* update changelog
* resolve test
* Update CHANGELOG
* Refactor teardown
* Fix comment
* Do it for non-gpu too
* remove ref when the model is not a lightning_module
* Fix import error
* move down
* resolve bug
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* resolve assignement
* update
* move above
* Fix device calls to support tpu training
* Updat todo
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Kaushik B <kaushikbokka@gmail.com>
2021-07-21 11:37:05 +02:00
Carlos Mocholí
6ce77a102b
Set minimum PyTorch version to 1.6 ( #8288 )
...
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2021-07-13 17:12:49 +00:00
Carlos Mocholí
c5a120ed9d
Update to Mypy>0.9 ( #8386 )
2021-07-13 08:23:36 +02:00
Carlos Mocholí
eb6d991218
Refactor plugins backward ( #8328 )
2021-07-08 16:02:09 +02:00
Adrian Wälchli
d73c32ab51
move `torch.cuda.set_device()` to enable collective calls earlier in setup ( #8312 )
2021-07-07 13:15:41 +02:00
Adrian Wälchli
ea5cfd2005
move batch to device before sending it to hooks ( #7378 )
...
* update train step
* test
* x
* limits
* val
* typeo
* x
* x
* step
* min gpus
* run all loops
* x
* limit test
* profiler
* clean up accelerator code
* move files
* rename
* move tests
* changelog
* reorder callbacks and model hooks
* add test description
* replace unneccessary method
* fix chlog
* adjust batch_to_device for DP Plugin
* update tests for dataloader idx
* unused imports
* hook change
* switch None
* clear memory
* change to None
* None
* None
* memory savings
* remove redundant todo
* hack
* cheat
* Revert "cheat"
This reverts commit a8433bd0b4
.
* Revert "hack"
This reverts commit 43a6d1edeb
.
* update new epoch loop
* remove from old loop code
* update chlog
* update hook test
* changelog
* teardown
* integrate changes in new eval loop
* fix hook calls
* add prediction step
* bad merge
* Revert "bad merge"
This reverts commit 488080863c
.
* fix train batch hook test
* rm -rf _notebooks
* update chlog
* release memory
* fix type
* notebooks mess
* debug
* Revert "debug"
This reverts commit eec4ee2f77
.
* teardown
* fix teardown bug
* debug
* x
* debug
* Revert "debug"
This reverts commit a6e6101946
.
Revert "debug"
This reverts commit 5ddeaec069
.
debug
debug
Revert "debug"
This reverts commit 605be746f7daedf265b2c05a1c153ce543394435.
Revert "Revert "debug""
This reverts commit a7612d5410409ed886cfb609457349ecf44cbfa8.
debug
x
x
x
s
tol
x
tol
* Fix changelog
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-07-05 09:31:39 +01:00
Carlos Mocholí
74eb6cc7e9
Clean `cuda.empty_cache` usage ( #8199 )
2021-06-30 13:04:24 +02:00
deepsource-autofix[bot]
03154eb30a
Refactor unnecessary `else` / `elif` when `if` block has a `return` statement ( #8156 )
...
Co-authored-by: deepsource-autofix[bot] <62050782+deepsource-autofix[bot]@users.noreply.github.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2021-06-28 15:27:41 +05:30
Carlos Mocholí
4d9b72b8a9
Nuke RPC ( #8101 )
2021-06-23 18:31:13 +00:00
Sean Naren
96433d03ea
IPU Integration 5/5 ( #7867 )
...
* Initial changes
* Add broken example for now
* Fix reference
* Fix format
* Code runs
* Fixes
* Clear up files
* Add tests, helpers, fixes
* Small cleanups
* Refactors based on review
* Swap to special tests
* Add special tests
* Add source
* Cleanups
* Add logic to attach/detach model from devices
* Fixes for tests
* Fixes for tests
* Move earlier
* Cleanups
* Add check for nvcc
* Add tests, cleanups
* Fix errors
* fix
* Try condition
* Add missing annotation
* Clearer
* Clearer message
* Fix variable
* Cleanups
* Add comment
* CHANGELOG.md
* Add simple selection test
* Remove special=True to see what happens
* Fix test
* Update tests/accelerators/test_ipu.py
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
* Convert ipu_cores -> ipus
* Add typing, fail earlier
* simplify precision
* Add test, add helper
* fix accum
* Update pytorch_lightning/plugins/training_type/ipu.py
Co-authored-by: thomas chaton <thomas@grid.ai>
* Use stages
* Make sure warning message returned
* thorw error
* Add more tests, use fs
* add comment
* Clean
* Address feedback, add IPU tests
* Fixes
* Fix signature
* Add types
* Remove autoround
* Add docstring
* ipu_cores -> ipus
* Add test, remove unnecessary precision set
* Add optimizer test
* Add precision back with test
* Address code review
* Change to probs
* Move some of the asserts earlier
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
2021-06-11 15:07:04 +00:00
Sean Naren
41be61c6f2
[IPU] Add hooks for IPU lifecycle 4/5 ( #7864 )
2021-06-07 12:06:41 +00:00
Sean Naren
6388c29e87
[IPU] Add reset dataloader hooks to training type plugin 3/n ( #7861 )
...
* Add hooks
* Add tests for hooks
* Add changelog
* Test changes, add typing
2021-06-07 10:37:09 +00:00
shuyingsunshine21
2242423b75
refactor accelerator teardown -> training type plugin teardown ( #7579 )
2021-05-22 13:19:24 -07:00
Rohit Gupta
7ca41734da
Add `dataloader_idx` to batch transfer hooks ( #6241 )
...
* replace with kwargs
* chlog
* fix
* add test
* fix
* device
* deepspeed
* pep
* optional
* docs
* bc
* comments
* pep
* mypy
* pep
* Apply suggestions from code review
* kwargs
* docs
* .
* .
* 1.3 -> 1.4
* kwargs -> step_kwargs
2021-05-13 23:03:55 +05:30
shuyingsunshine21
8538c1f61e
Accelerator model state dict ( #7474 )
...
* Fix some test errors
Summary:
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
* checkpoint consolidation
* Update ddp_spawn.py
* Update test_metric_result_integration.py
* Update test_results.py
* Update utils.py
* Update utils.py
* Update test_all_gather_grad.py
* Update test_all_gather_grad.py
* Update test_results.py
* Revert "Update test_results.py"
This reverts commit 9d4a2b891d
.
* Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkpoint_consolidate"
This reverts commit c5053da789
, reversing
changes made to 0d23d75bc9
.
* Revert "Update test_all_gather_grad.py"
This reverts commit 0d23d75bc9
.
* Revert "Update utils.py"
This reverts commit 70fe5da9c6
.
* Revert "Update utils.py"
This reverts commit a9aae99f6e
.
* Revert "Update test_results.py"
This reverts commit ea74906878
.
* Revert "Update test_metric_result_integration.py"
This reverts commit bf70e431b3
.
* Revert "Update ddp_spawn.py"
This reverts commit f17210183b
.
* Revert "checkpoint consolidation"
This reverts commit 536c1323b0
.
* Revert "Revert "checkpoint consolidation""
This reverts commit 3a9fde915a
.
* Revert "Revert "Revert "checkpoint consolidation"""
This reverts commit 7a369f47e1
.
* Revert "Revert "Update ddp_spawn.py""
This reverts commit 8222dc98ea
.
* Revert "Revert "Update test_metric_result_integration.py""
This reverts commit 6c095b2370
.
* Revert "Revert "Update test_results.py""
This reverts commit 250d0aaaa2
.
* Revert "Revert "Update utils.py""
This reverts commit 8651d54d79
.
* Revert "Revert "Update test_all_gather_grad.py""
This reverts commit dcdcd29731
.
* modify distributed environment to make test pass
* modify model state dict to training type plugin
* remove changes
* add changelog
* fixing isort for pre-commit failure
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address code review
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: SeanNaren <sean@grid.ai>
2021-05-11 16:39:04 +01:00
Adrian Wälchli
6bc616d78f
fix display bug ( #7395 )
2021-05-10 11:26:15 +08:00
Carlos Mocholí
8208c330eb
Use `torch.nn.utils.clip_grad_norm_` and add `clip_grad_by_value` support for TPU ( #7025 )
...
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-05-07 16:41:39 +00:00
Jirka Borovec
b181b8c646
release 1.3.0 ( #7404 )
...
* v1.3.0
* ci event
* chlog
* badge
* formatting
2021-05-06 15:05:35 -04:00
Martin Kristiansen
c3fc0313ef
Updating docs and error message: half precision not available on CPU ( #7384 )
...
* Updating docs and error message to specify that half precission not available on CPU
* update messages
Co-authored-by: Martin Kristiansen <martinkristiansen@sixgill.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: jirka <jirka.borovec@seznam.cz>
2021-05-06 09:05:50 +00:00
ananthsub
6104a6316a
[1/2] Deprecate `outputs` in `on_train_epoch_end` hooks ( #7339 )
...
* Remove outputs from on_train_epoch_end
* iterate
* Update callback_hook.py
* update
* Update training_loop.py
* Update test_training_loop.py
* early stop?
* fix
* update tests
* Update test_hooks.py
* Update pytorch_lightning/trainer/callback_hook.py
Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
* Update pytorch_lightning/trainer/training_loop.py
Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
* Update trainer.py
* Update pytorch_lightning/trainer/trainer.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-05-05 17:18:16 +02:00
ananthsub
98670c83a9
Deprecate`truncated_bptt_steps` flag on Trainer in favor of same setting on the LightningModule ( #7323 )
...
* deprecate-tbptt-trainer
* Update CHANGELOG.md
* Update lightning.py
* test
* Update lightning.py
* Update training_loop.py
* Update training_loop.py
* Update lightning.py
* Update training_loop.py
* Update training_loop.py
* update docs
* Update accelerator.py
* Update accelerator.py
* more docs
* tweaks
* chlog
* comments
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-05-05 11:21:00 +01:00
Carlos Mocholí
8c0ea92af2
`TrainerState` refactor [5/5] ( #7173 )
...
* `TrainerState` refactor
* flake8
* Update finished check
* Test cleanup
* Fix tests
* Fixes
* Reorder
* flake8
* Update CHANGELOG
* Better docs
* Better docs
* Remove default
* Update tests
* Bad merge
2021-05-04 12:50:56 +02:00
ananthsub
39274273a4
Update accelerator.py ( #7318 )
2021-05-03 11:17:26 -04:00
Adrian Wälchli
e0c64f0ef6
Fix Adagrad optimizer not working with DDP/GPU ( #7277 )
...
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
2021-05-03 03:57:17 +05:30
thomas chaton
16d6c9828d
[bugfix] Apex never instantiated. ( #7274 )
...
* update
* update
* update apex
* update
* update
* update
* remove test.py
* update
* update
* update on comments
* update changelog
* update
* update
* typo
2021-04-30 13:16:28 -04:00
Kaushik B
94fcaaf5d7
Add `debug` flag to TPU Training Plugins (PT_XLA_DEBUG) ( #7219 )
2021-04-27 20:34:25 +00:00
Carlos Mocholí
ca6c87ffbe
Add back `clip_gradients(model)` ( #7231 )
2021-04-27 11:34:02 +00:00
Kaushik B
6be0a859db
Update teardown for TPU acc ( #7211 )
2021-04-26 13:30:46 +01:00
ananthsub
bc3f08b0e3
[fix] Add barrier to accelerator's teardown ( #6814 )
2021-04-26 09:23:29 +00:00
thomas chaton
f58865aada
Properly set `LightningModule.device` after model replacement ( #7188 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-23 16:36:52 +02:00
thomas chaton
013756404b
[bugfix] Add set_default_tensor_type to torch.DoubleTensor with precision=64 ( #7108 )
...
* update
* Update pytorch_lightning/plugins/precision/double.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Update pytorch_lightning/plugins/precision/double.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Update pytorch_lightning/plugins/precision/double.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* resolve tests
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-20 15:25:37 +00:00
Kaushik B
f168a535ca
Add MpModelWrapper in TPU Spawn ( #7045 )
...
Co-authored-by: tchaton <thomas@grid.ai>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-20 13:05:27 +00:00
Carlos Mocholí
898ec8a94a
Create pytorch_lightning/utilities/types.py ( #7048 )
2021-04-19 14:43:16 +02:00
thomas chaton
7b0b0d2844
update ( #7056 )
2021-04-16 21:22:19 +01:00
Carlos Mocholí
f29ecbfd90
Typing for accelerators and plugins ( #7022 )
2021-04-15 16:48:16 +00:00
Ethan Harris
f645df5e9a
Add typings for evaluation_loop.py and remove some dead code ( #7015 )
2021-04-15 07:36:04 +00:00
CeShine Lee
24d0295ff1
Fix the `gradient_clip_algorithm` has no effect issue. ( #6928 )
2021-04-14 14:17:06 +05:30
Kaushik B
1b3e4f9fb9
Fix sync_dist for tpus ( #6950 )
2021-04-13 14:17:15 +05:30
Adrian Wälchli
20ff50caa6
Accelerator API docs ( #6936 )
...
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-04-10 12:25:07 +05:30
Kaushik B
55525031c6
Fix TPU Spawn gather ( #6896 )
2021-04-09 18:30:59 +05:30
Anthony Kim
7f6154fcad
Add `Trainer(gradient_clip_algorithm='value'|'norm')` ( #6123 )
...
* add changelog
* add clip by value
* fix bug in training tricks.rst
* fix bug in trainer.rst
* Update trainer.rst
* Update trainer.rst
* Update CHANGELOG.md
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update pytorch_lightning/plugins/precision/deepspeed_precision.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update pytorch_lightning/utilities/enums.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* yapf formatting
* update training tricks
* update based on comment
* update based on comment
* Update pytorch_lightning/trainer/trainer.py
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
* update based on comment
* pep8
* mypy
* mypy
* Update docs/source/advanced/training_tricks.rst
Co-authored-by: thomas chaton <thomas@grid.ai>
* Update sharded_native_amp.py
* Update test_sharded_parity.py
* update test codes
* Update test_tpu.py
* Update pytorch_lightning/trainer/connectors/training_trick_connector.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Update test_trainer.py
* Update enums.py
* Update enums.py
* add super-class initialization to precision plugins.
* add clip_grad horovod cpu test
* add clip_grad horovod cpu test
* use subprocess check_call
* change order of horovod tests
* set max_epochs 2 in horovod test
* remove clip_grad_val test from horovod-cpu
* remove "type: ignore"
* divide clip grad val test in horovod
* update based on comments
* add super-class initialization to precision plugins.
* bugfix
* bugfix
* revert some changes
* revert some changes
* Update tests/models/test_horovod.py
* merge master
* Delete signature test
No point in testing a signature
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-04-06 08:27:37 -05:00
ananthsub
bb9ace4333
[typing] Add typehint for broadcast in training type plugin ( #6777 )
...
* Update training_type_plugin.py
* Update accelerator.py
* Update pytorch_lightning/plugins/training_type/training_type_plugin.py
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2021-04-02 20:55:34 +02:00
Kaushik B
a72a7992a2
Update clip gradients signature for precision plugins ( #6764 )
2021-03-31 17:06:48 +05:30
thomas chaton
1302766f83
DeepSpeed ZeRO Update ( #6546 )
...
* Add context to call hook to handle all modules defined within the hook
* Expose some additional parameters
* Added docs, exposed parameters
* Make sure we only configure if necessary
* Setup activation checkpointing regardless, saves the user having to do it manually
* Add some tests that fail currently
* update
* update
* update
* add tests
* change docstring
* resolve accumulate_grad_batches
* resolve flake8
* Update DeepSpeed to use latest version, add some comments
* add metrics
* update
* Small formatting fixes, clean up some code
* Few cleanups
* No need for default state
* Fix tests, add some boilerplate that should move eventually
* Add hook removal
* Add a context manager to handle hook
* Small naming cleanup
* wip
* move save_checkpoint responsability to accelerator
* resolve flake8
* add BC
* Change recommended scale to 16
* resolve flake8
* update test
* update install
* update
* update test
* update
* update
* update test
* resolve flake8
* update
* update
* update on comments
* Push
* pull
* Update pytorch_lightning/plugins/training_type/deepspeed.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Update pytorch_lightning/plugins/training_type/deepspeed.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* update
* Apply suggestions from code review
* Swap to using world size defined by plugin
* update
* update todo
* Remove deepspeed from extra, keep it in the base cuda docker install
* Push
* pull
* update
* update
* update
* update
* Minor changes
* duplicate
* format
* format2
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-03-30 13:39:02 -04:00
Kaushik B
f79a13e495
[Model Parallel] Add configure sharded model hook ( #6679 )
...
* Add base hook for model parallel
* fix callback signature
* Simplify hook
* Add hook logic
* add tests
* add property setter
* add logic for being called once
* Update changelog
* Fix
* fix return type
* fix lambda callback test
* Fix tests
* Apply code suggestions
* add logic for setup_optimizers_predispatch
* add common dummy model
* Swap call order
* Remove test that isn't needed anymore
* Update tests
* Add a bit more doc
* Few code review fixes
* Update pytorch_lightning/accelerators/accelerator.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Change hook name
* Fix test
* Test setup hook, refactor names
* Swap call order of callbacks and model initialization
* Change name of context manager
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-03-29 14:50:51 -06:00
thomas chaton
646cf2f7d4
[refactor] Move save_function to accelerator 1/n [DeepSpeed] ( #6689 )
...
* move save_checkpoint responsability to accelerator
* update
2021-03-29 21:02:37 +02:00
Jirka Borovec
d471fa30b3
add copyr ( #6661 )
2021-03-24 14:29:46 +01:00
thomas chaton
fd5cb7fcc3
Add PyTorch 1.8 Profiler 5/5 ( #6618 )
...
* Refactor profilers
* Update PassThrough
* WIP - This is broken and will change
* Update pytorch_lightning/profiler/pytorch.py
Co-authored-by: thomas chaton <thomas@grid.ai>
* resolve tests
* resolve tests
* find output
* try something
* update
* add support for test and predict
* update
* update
* use getattr
* test
* test
* update
* tests
* update
* update
* update
* update
* update
* remove file
* update
* update
* update
* update
* update
* test
* update#
* update
* update tests
* update
* add suport for 1.8
* rename records
* add support for 1.8
* update
* resolve flake8
* resolve test
* Refactor basic profilers
* Fixes
* Unused import
* Introduce setup
* Profile on all ranks. Print to stdout on 0
* Introduce dirpath + filename
* CHANGELOG
* Add tests. Address comments
* add `on_run_stage_setup`
* add on_run_stage_setup function
* update
* add test for RegisterRecordFunction
* update lightnng flow direction
* move variable to private
* remove trace
* Undo code that should be in 3/4
* Multi-stage multi-rank
* 2/5 changes
* Pass stage in __del__
* Remove TODOs
* Describe on_evaluation_end. Add tests
* Typo
* Address comments
* deepcopy tests
* Advanced teardown
* Fix teardown test
* Fix tests
* Minor change
* Update CHANGELOG.md
* Fix test
* Quick fixes
* Fix 6522
* resolve ddp tests
* resolve tests
* resolve some tests
* update tests
* resolve tests
* update
* resolve tests
* resolve some tests
* Missed fixes from 3/5
* Fixes
* resolve some tests
* resolve test for 1.7.1
* Broken refactor
* Missed stage
* Minor changes
* resolve tests
* Update CHANGELOG
* resolve bug
* remove print
* Typo
* Cleanup
* resolve ddp test
* remove barrier
* update profiler
* update
* Smaller model
* update
* resolve tests
* update
* Minor changes. CHANGELOG
* Minimize diff
* update to 1.8.1
* RunIf. Extra code. Check segfault
* resolve tests
* Typo. Bad merge
* Fixing a bad merge
* replace for kineto
* Update pytorch_lightning/profiler/pytorch.py
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
* Update pytorch_lightning/profiler/pytorch.py
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
* Minor changes
* Bad merge
* Use lists for flexibility
* Use sets
* predict_step
* Ananth's suggestion
* update
* Docs
* Update pl_examples/basic_examples/profiler_example.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* update example
* update example
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-03-23 20:43:21 +00:00
Jirka Borovec
3cf0c3117a
fix back-compatibility for Accel ( #6655 )
2021-03-23 17:41:36 +01:00
thomas chaton
0995d30fab
Flash predict step ( #6577 )
...
* add predict_step
* Update predict_loop.py
* Update trainer.py
* Update trainer.py
* resolve bugs
* update
* update
* update
* resolve bug
* resolve some failing tests
* udpate tests
* update
* resolve tests
* add a test
* remove typo
* add a test for attachement
* update
* changed to on_train_dataloader
* remove __flash_special_attr__
* resolve tests
* update
* update
* update
* update on comments
* Update pytorch_lightning/trainer/data_loading.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-03-23 11:13:13 -04:00
Sean Naren
58c9fa7edb
Allow training type plugin to delay optimizer creation (FSDP 2/n) ( #6331 )
...
* Allow training_type_plugin to delay optimizer configure
* Add missing references to trainer, add a CPU accelerator based test
2021-03-22 11:43:53 +00:00
Kaushik B
87c03b1038
Update Gradient Clipping for TPU Accelerator ( #6576 )
2021-03-20 01:02:57 +05:30
Ethan Harris
983a888f49
Fix all_gather for tpu_cores=8 ( #6587 )
2021-03-19 21:56:58 +05:30
Sean Naren
4e9b453854
[Fix] Move init dist connection into the setup function ( #6506 )
...
* Move connection setup into the setup function. Call setup hook after we set up the accelerator
* Added CHANGELOG.md
* fix setup order in callback test
* fix input arguments in test
* Mock distributed function, remove protection to turn into training type hook
* Remove import
* Add missing mock, ensure custom plugin does not create children process
* Skip test on windows
* Update deepspeed to init connection in setup
* Do not initialize distributed module
* Move DeepSpeed tests to special tests since dist communication is being set up
* Special the test to see if this fixes CI
* Delete accelerator connector test to see if its causing build to fail
* Delete deepspeed test
* Revert "Delete accelerator connector test to see if its causing build to fail"
This reverts commit edde60b8
* Revert "Delete deepspeed test"
This reverts commit 9d317429
* Reverse hook
* Reverse setup hooks to debug again
* Add todo so i know where i left off
* For single device move in pre_dispatch after setup function
* Add additional model to device hook if any additional parameters have been set
* See if we can enable deepspeed tests
* Revert "See if we can enable deepspeed tests"
This reverts commit b5450def
* See if this hook approach works
* Introduce new granular hooks
* Remove import, fix tpu spawn by moving the function to setup
* Added missing special test
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-03-18 14:33:39 -07:00
Jirka Borovec
6453091b8a
Prune metrics base classes 2/n ( #6530 )
...
* base class
* extensions
* chlog
* _stable_1d_sort
* _check_same_shape
* _input_format_classification_one_hot
* utils
* to_onehot
* select_topk
* to_categorical
* get_num_classes
* reduce
* class_reduce
* tests
2021-03-15 19:28:18 +00:00
thomas chaton
0544efd453
[bug] Update broadcast + reduce decision ModelCheckpoint] ( #6410 )
...
* resolve bug
* update
* update changelog
* update PR
* Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* add todo
* resolve issues
* resolve flake8
* update
* add coverage for reduce
* wip
* restore back to brodbact
* remove test.py
* resolve flake8
* update
* check world size
* resolve test
* update
* use pytorch version when defined
* update on comments
* update on comments
* flake8
* resolve bugs
* Update CHANGELOG.md
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* update
* update
* update
* update
* remove test
* update
* resolve flake8
* update
* update
* update
* proxy
* update
* update
* resolve typo
* prune
* update parallel
* update
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-03-14 17:14:27 +00:00
Rohit Gupta
c53edce1a1
Disable batch transfer in DP mode ( #6098 )
...
* add exceptions and test
* hook
* fix
* clean up
* clean up
* regex
* regex
* docs
* rev
* comment and docs
* chlog
* Apply suggestions from code review
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Apply suggestions from code review
Co-authored-by: chaton <thomas@grid.ai>
* Monkey-patch device count
* docs
* pep
* api_change
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
2021-03-11 10:51:10 -05:00
Elia Cereda
d0596fac94
Refactor RunningStage usage in advance of implementing Trainer.validate() ( #4945 )
...
* Update code
Co-authored-by: EliaCereda
* More property updates
* Move properties. Introduce trainer._fitting
* Use trainer.fitting
* Fix reset dataloaders
* Unused code
* RunningStage.SANITY_CHECKING
* Use setters
* Fix bugs
* Fix bugs
* TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}
* Fix bugs
* Fix bugs
* Fix tests
* Update CHANGELOG. Add deprecation warning. Fix tests
* Unused imports
* Optional trainer
* More deprecation. More refactoring
* Correct version
* Use properties
* Address comments
* flake8
* Missed renamings
* Typo
* is -> ==
It is recommended to use for Enums since they are singletons, however, since the LightningEnum subclasses str, it's not a good idea in case a user sets the state/stage with a str
* Also for tests
* Typo
* Address @tchaton's comments
* PEP8
* Correct property
* Update CHANGELOG
* Apply suggestions from code review
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Update pytorch_lightning/trainer/trainer.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Remove called sanity check
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-03-06 12:40:19 +00:00
Kunal Mundada
49c579f1f0
docstring changes in accelerators ( #6327 )
...
* docstring changes in accelerators
* docstrings moved
* whitespaces removed
* PEP8 correction[1]
2021-03-04 20:21:53 +00:00
Sean Naren
d01e8fdc86
[fix] Use training type plugin hook when saving (FSDP 1/n) ( #6321 )
...
* Rely on training type plugin when saving
* Add better typing to training type plugin
2021-03-04 18:09:33 +00:00
thomas chaton
484dce11ec
[bugfix] TPU + all_gather + SingleTPU shouldn't call xm.all_gather ( #6296 )
...
* resolve an issue with TPU
* update
* add changelog
2021-03-03 08:54:20 -05:00
thomas chaton
1aac481957
[bugfix] TPU test hangs to barrier on 1 process ( #6272 )
...
* update
* resolve flake8
* update
* update
* update changelog
* update
* resolve flake8
Co-authored-by: Your Name <you@example.com>
2021-03-02 18:01:35 -05:00
Jirka Borovec
58a6d59784
simplify skip-if tests >> 0/n ( #5920 )
...
* skipif + yapf + isort
* tests
* docs
* pp
2021-03-01 12:17:09 +00:00
Justus Schock
0647340f3b
Add mypy typing to precision plugins. ( #6149 )
...
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2021-02-26 14:27:16 +01:00
Justus Schock
3ed8ef8af9
type accelerators ( #6148 )
2021-02-25 06:42:23 +00:00
Adrian Wälchli
ae6ce17598
fix amp/apex misconfiguration error for cpu ( #6107 )
...
* fix weird test
* fix apex plugin test
* fix raise
* cpu test
* fix type
* add changelog
2021-02-22 01:02:31 +01:00
Adrian Wälchli
fc9bb53e13
fix flake8 for new plugins ( #5951 )
...
* flake8
* fix cyclic import
* isort
2021-02-18 18:28:23 +00:00
Rohit Gupta
bcc0004955
Add before_batch_transfer and after_batch_transfer hooks ( #3671 )
...
* add hooks
* comment
* docs
* add tests
* make it private
* fix tests
* docs
* chlog
* testcode
* codefactor
* fix doctest
* fix doctest
* suggestions
* is always overriden
* pep and BoringModel
* BoringModel
* docs
* docs
* docs
* fix
* rebase
* rebase
* suggestions
* docs
* suggestions
* try fix docs
* docs
* update name
* yapf
* docs
* rebase
* yapf
2021-02-18 06:58:12 -05:00
Sean Naren
b019c25dde
Add descriptions to accelerator broadcast function/clean up all_gather ( #6044 )
...
* Add descriptions to accelerator broadcast function/clean up all_gather
* Remove todo
2021-02-18 06:12:52 -05:00
Sean Naren
b7c2e0a80e
Trainer only references accelerator ( #6039 )
...
* Trainer only references accelerator where it can
* Move teardown to the trainer, as it is reponsible for the accelerator
2021-02-17 21:41:18 +01:00
Sean Naren
7189d673f6
DeepSpeed Integration ( #5954 )
...
* Add initial deepspeed changes
* Address code review
* Move static method outside of function
* Fixes
* Add missing annotation
* Remove seed setting
* Doc changes
* Doc changes, add address reviews
* Fix docs
* Try fixing issue by moving to torch adam
* Clean up check
* Changes, better APIs!
* Add wrapper, swap to git install revision
* Add special test
* Add warning
* Address review
* Add better disclaimer
* Turn off ZeRO for testing due to compilation
* Add description on modifying parameters via the plugin
* Doc strings clear
* Small doc fixes
* Fix hash, reduce test
* Added CI change
* Move to azure pipeline
* Fix test name
* Add missing flag
* Remove sudo...
* Try conda instead
* Swap to conda base
* Try suggested install
* Apply suggestions from code review
* Apply suggestions from code review
* Revert "Apply suggestions from code review"
This reverts commit 41cca05a
* Revert "Apply suggestions from code review"
This reverts commit e06ec29e
* Remove setter
* Address most review
* Move out function, remove DeepSpeed from requirements
* Install deepspeed/mpi4py within container
* Use special tests, move to master commit for deepspeed
* Export path
* Force compile to happen first
* Remove!
* Debugging ninja
* Fix error in optimizer step logic
* Attempt to fix symbolic link
* Reverse to aid debugging
* Export path again
* Clean up mess
* var
* Revert "var"
This reverts commit 3450eaca
* Address review, add todo
* Add note about unsupported functionality
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: tchaton <thomas@grid.ai>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-02-17 15:23:42 -05:00
Adrian Wälchli
e0bb33c8cc
move accelerator_connector.py to the connectors subfolder ( #6033 )
...
* move accelerator connector
* rename BackendConnector -> AcceleratorConnector
2021-02-17 18:38:08 +00:00
chaton
6bc4490d01
[HotFix] Resolve TPU Training ( #6027 )
...
* fix tpus
* update
* add back reduction in val_loss
* resolve some bugs with TPUs
* update changelog
* update on comments
* forgot status
* Fix train_bn arg
* resolve comments
* update on comments
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-02-17 16:40:13 +00:00
chaton
e982800b81
Add PredictLoop ( #5752 )
...
* integrate distrib_type
* sync changes
* sync
* fixes
* add forgotten generators
* add missing logic
* update
* import
* missed imports
* import fixes
* isort
* mv f
* changelog
* format
* move helper to parallel plugin
* d
* add world size
* clean up
* duplicate
* activate ddp_sharded and tpu
* set nvidia flags
* remove unused colab var
* use_tpu <-> on_tpu attrs
* make some ddp_cpu and clusterplugin tests pass
* Ref/accelerator connector (#5742 )
* final cleanup
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* connector cleanup
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* trainer cleanup
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* accelerator cleanup + missing logic in accelerator connector
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add missing changes to callbacks
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* reflect accelerator changes to lightning module
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* clean cluster envs
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* cleanup plugins
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add broadcasting
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* yapf
* remove plugin connector
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* plugins
* add predict_loop
* manual optimization
* clean predictloop
* update optimizer routing
* add predict loop on new accelerator
* resolve a bug
* add rank to torchelastic
* add predict_loop
* add predict loop on new accelerator
* resolve a bug
* fix memory mixed precision
* update
* setstate on trainer for pickling in ddp spawn
* add predict_loop
* clean predictloop
* add predict loop on new accelerator
* resolve a bug
* add predict_loop
* add predict loop on new accelerator
* resolve a bug
* add predict_loop
* add predict loop on new accelerator
* resolve a bug
* add predict_loop
* add predict loop on new accelerator
* resolve a bug
* add predict_loop
* clean predictloop
* add predict loop on new accelerator
* resolve a bug
* add predict_loop
* add predict loop on new accelerator
* resolve a bug
* resolve tests
* add predict method
* add back commented accelerator code
* adapt test for sync_batch_norm to new plugin
* fix deprecated tests
* fix ddp cpu choice when no num_processes are given
* yapf format
* skip a memory test that cannot pass anymore
* remove sanetize
* rename train to run_train
* remove useless hooks
* add misconfigurationException
* remove wrong naming
* resolve some legacy
* udpate docstring
* fix pickle error in spawn plugin
* x
* avoid
* x
* fix cyclic import in docs build
* add support for sharded
* update typing
* add sharded and sharded_spawn to distributed types
* make unwrap model default
* refactor LightningShardedDataParallel similar to LightningDistributedDataParallel
* update sharded spawn to reflect changes
* update sharded to reflect changes
* Merge 1.1.5 changes
* fix merge
* fix merge
* yapf isort
* fix merge
* yapf isort
* fix indentation in test
* copy over reinit scheduler implementation from dev1.2
* fix apex tracking calls with dev_debugger
* reduce diff to dev1.2, clean up
* fix trainer config test when gpus>0 and num_processes >0 and ddp_cpu
* sort plugin tests legacy/new
* fix error handling for amp on cpu
* fix merge
fix merge
fix merge
* [Feat] Resolve manual_backward (#5837 )
* resolve manual_backward
* resolve flake8
* update
* resolve for ddp_spawn
* resolve flake8
* resolve flake8
* resolve flake8
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
* fix tests/accelerator tests on cpu
* [BugFix] Resolve manual optimization (#5852 )
* resolve manual_optimization
* update
* update
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
* Remove copy trainer parameters to happen earlier within the loop and add safe guard to get ref model (#5856 )
* resovle a bug
* Accelerator refactor sharded rpc (#5854 )
* rpc branch
* merge
* update handling of rpc
* make devices etc. Optional in RPC
* set devices etc. later if necessary
* remove devices from sequential
* make devices optional in rpc
* fix import
* uncomment everything
* fix cluster selection
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
* resolve bug
* fix assert in rpc test
* resolve a test
* fix docs compilation
* accelerator refactor - fix for sharded parity test (#5866 )
* fix memory issue with ddp_spawn
* x
x
x
x
x
x
x
x
x
* x
* Remove DDP2 as this does not apply
* Add missing pre optimizer hook to ensure lambda closure is called
* fix apex docstring
* [accelerator][BugFix] Resolve some test for 1 gpu (#5863 )
* update
* revert init
* resolve a bug
* update
* resolve flake8
* update
* update
* update
* revert init
* resolve a bug
* update
* resolve flake8
* update
* update
* update
* update
* update
* revert init
* resolve a bug
* update
* resolve flake8
* update
* update
* update
* revert init
* update
* resolve flake8
* update
* update
* update
* update
* update
* all_gather
* update
* make plugins work, add misconfig for RPC
* update
* update
* remove breaking test
* resolve some tests
* resolve flake8
* revert to ddp_spawn
Co-authored-by: root <root@ip-172-31-88-60.ec2.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de>
* yapf isort
* resolve flake8
* fix apex doctests
* fix apex doctests 2
* resolve docs
* update drone
* clean env
* update
* update
* update
* update
* merge
* Fix RPC related tests, clean out old API, update for new accelerator API [skip ci] (#5881 )
* Fix RPC related tests, clean out old API, update for new accelerator API
* Move tests out of legacy folder, update paths and names
* Update test_remove_1-4.py
* Expose properties for tpu cores/gpus/num_gpus
* Add root GPU property
* Move properties to properties.py
* move tests that were previously in drone
* Fix root GPU property (#5908 )
* Move root GPU to property, remove horovod set as this is handled in horovod plugin, ensure we mock correctly to set GPU accelerator
* Add missing tests back
* fix best model path transfer when no checkpoint callback available
* Fix setup hook order [wip] (#5858 )
* Call trainer setup hook before accelerator setup
* Add test case
* add new test
* typo
* fix callback order in test
Co-authored-by: tchaton <thomas@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* rename ddp sequential -> rpc sequential for special test
* revert
* fix stupid merge problem
* Use property in connector for sampler (#5913 )
* merge the import conflicts
* fix spawning of processes in slurm
* [wip] Fix some bugs for TPU [skip ci] (#5878 )
* fixed for single tpu
* fixed spawn
* fixed spawn
* update
* update
* wip
* resolve bugs
* resolve bug
* update on comment
* removed decorator
* resolve comments
* set to 4
* update
* update
* need cleaning
* update
* update
* update
* resolve flake8
* resolve bugs
* exclude broadcast
* resolve bugs
* change test
* update
* update
* skip if meet fails
* properly raise trace
* update
* add catch
* wrap test
* resolve typo
* update
* typo
Co-authored-by: Lezwon Castelino <lezwon@gmail.com>
Co-authored-by: Your Name <you@example.com>
* resolve some tests
* update
* fix imports
* update
* resolve flake8
* update azure pipeline
* skip a sharded test on cpu that requires a gpu
* resolve tpus
* resolve bug
* resolve flake8
* update
* updat utils
* revert permission change on files
* suggestions from carlos
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* remove unrelated formatting changes
* remove incomplete comment
* Update pytorch_lightning/accelerators/__init__.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* remove unrelated formatting change
* add types
* warn 1.7 ddp manual backward only if ddp kwarg unset
* yapf + isort
* pep8 unused imports
* fix cyclic import in docs
* Apply suggestions from code review
* typer in accelerator.py
* typo
* resolve flake8
* update code
* update
* Update pytorch_lightning/trainer/predict_loop.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Update pytorch_lightning/trainer/predict_loop.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* fix merge
* fix merge
* reset legacy accelerator
* add missing rename dispatch
* rename post traning
* update code
* resolved comments
* typo
* typo
* add flow description
* resolve comments
* update on comments
* update flow
* add backticks
* resolve tpu
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: justusschock <justus.schock@posteo.de>
Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: root <root@ip-172-31-88-60.ec2.internal>
Co-authored-by: Lezwon Castelino <lezwon@gmail.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2021-02-16 17:11:56 -05:00
chaton
a52be5bb07
[Hot Fix] Ensure process_dataloader is called when tpu_cores > 1 to use Parallel DataLoader ( #6015 )
...
* hotfix for tpu
* update changelog
* Update CHANGELOG.md
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2021-02-16 17:02:25 -05:00
chaton
6e79bef996
[accelerator][FeatBugFix] Improve manual optimization API ( #5771 )
...
* fix trainer.model access
* move properties
* fix test_transfer_batch_hook
* fix auto_select_gpus
* fix omegaconf test
* fix test that needs to simulate slurm ddp
* add horovod plugin
* fix test with named arguments
* clean up whitespace
* fix datamodules test
* remove old accelerators
* fix naming
* move old plugins
* move to plugins
* create precision subpackage
* create training_type subpackage
* fix all new import errors
* fix wrong arguments order passed to test
* fix LR finder
* Added sharded training type and amp plugin
* Move clip grad to precision plugin
* Added sharded spawn, select accelerators based on distributed_backend + enable custom fp16 plugin automatically
* Fix import issue, attempting to fix tests
* Fix initial test
* Reflect hook logic from master, should wrap model after move to device
* Optional state consolidation, since master has optimizers not wrapped
* change attribute for instance test
* reset optimizers
optimizers are not used in main process, so state would be wrong.
* legacy
* imports in accel
* legacy2
* trainer imports
* fix import errors after rebase
* move hook to new setup location
* provide unwrapping logic
* fix trainer callback system
* added ddp2 implementation
* fix imports .legacy
* move plugins
* restore legacy
* drop test.py from root
* add tpu accelerator and plugins
* fixes
* fix lightning optimizer merge
* reset bugreportmodel
* unwrapping
* step routing forward
* model access
* unwrap
* opt
* integrate distrib_type
* sync changes
* sync
* fixes
* add forgotten generators
* add missing logic
* update
* import
* missed imports
* import fixes
* isort
* mv f
* changelog
* format
* move helper to parallel plugin
* d
* add world size
* clean up
* duplicate
* activate ddp_sharded and tpu
* set nvidia flags
* remove unused colab var
* use_tpu <-> on_tpu attrs
* make some ddp_cpu and clusterplugin tests pass
* Ref/accelerator connector (#5742 )
* final cleanup
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* connector cleanup
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* trainer cleanup
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* accelerator cleanup + missing logic in accelerator connector
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add missing changes to callbacks
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* reflect accelerator changes to lightning module
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* clean cluster envs
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* cleanup plugins
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add broadcasting
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* yapf
* remove plugin connector
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* plugins
* manual optimization
* update optimizer routing
* add rank to torchelastic
* fix memory mixed precision
* setstate on trainer for pickling in ddp spawn
* add predict method
* add back commented accelerator code
* adapt test for sync_batch_norm to new plugin
* fix deprecated tests
* fix ddp cpu choice when no num_processes are given
* yapf format
* skip a memory test that cannot pass anymore
* update on comments
* fix pickle error in spawn plugin
* x
* avoid
* x
* fix cyclic import in docs build
* add support for sharded
* update typing
* add sharded and sharded_spawn to distributed types
* make unwrap model default
* refactor LightningShardedDataParallel similar to LightningDistributedDataParallel
* update sharded spawn to reflect changes
* update sharded to reflect changes
* Merge 1.1.5 changes
* fix merge
* fix merge
* yapf isort
* fix merge
* yapf isort
* fix indentation in test
* copy over reinit scheduler implementation from dev1.2
* fix apex tracking calls with dev_debugger
* reduce diff to dev1.2, clean up
* fix trainer config test when gpus>0 and num_processes >0 and ddp_cpu
* sort plugin tests legacy/new
* fix error handling for amp on cpu
* fix merge
fix merge
fix merge
* [Feat] Resolve manual_backward (#5837 )
* resolve manual_backward
* resolve flake8
* update
* resolve for ddp_spawn
* resolve flake8
* resolve flake8
* resolve flake8
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
* fix tests/accelerator tests on cpu
* [BugFix] Resolve manual optimization (#5852 )
* resolve manual_optimization
* update
* update
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
* Remove copy trainer parameters to happen earlier within the loop and add safe guard to get ref model (#5856 )
* resovle a bug
* Accelerator refactor sharded rpc (#5854 )
* rpc branch
* merge
* update handling of rpc
* make devices etc. Optional in RPC
* set devices etc. later if necessary
* remove devices from sequential
* make devices optional in rpc
* fix import
* uncomment everything
* fix cluster selection
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
* resolve bug
* fix assert in rpc test
* resolve a test
* fix docs compilation
* accelerator refactor - fix for sharded parity test (#5866 )
* fix memory issue with ddp_spawn
* x
x
x
x
x
x
x
x
x
* x
* Remove DDP2 as this does not apply
* Add missing pre optimizer hook to ensure lambda closure is called
* fix apex docstring
* [accelerator][BugFix] Resolve some test for 1 gpu (#5863 )
* update
* revert init
* resolve a bug
* update
* resolve flake8
* update
* update
* update
* revert init
* resolve a bug
* update
* resolve flake8
* update
* update
* update
* update
* update
* revert init
* resolve a bug
* update
* resolve flake8
* update
* update
* update
* revert init
* update
* resolve flake8
* update
* update
* update
* update
* update
* all_gather
* update
* make plugins work, add misconfig for RPC
* update
* update
* remove breaking test
* resolve some tests
* resolve flake8
* revert to ddp_spawn
Co-authored-by: root <root@ip-172-31-88-60.ec2.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de>
* yapf isort
* resolve flake8
* fix apex doctests
* fix apex doctests 2
* resolve docs
* update drone
* clean env
* update
* update
* update
* update
* merge
* Fix RPC related tests, clean out old API, update for new accelerator API [skip ci] (#5881 )
* Fix RPC related tests, clean out old API, update for new accelerator API
* Move tests out of legacy folder, update paths and names
* Update test_remove_1-4.py
* Expose properties for tpu cores/gpus/num_gpus
* Add root GPU property
* Move properties to properties.py
* move tests that were previously in drone
* Fix root GPU property (#5908 )
* Move root GPU to property, remove horovod set as this is handled in horovod plugin, ensure we mock correctly to set GPU accelerator
* Add missing tests back
* fix best model path transfer when no checkpoint callback available
* Fix setup hook order [wip] (#5858 )
* Call trainer setup hook before accelerator setup
* Add test case
* add new test
* typo
* fix callback order in test
Co-authored-by: tchaton <thomas@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* rename ddp sequential -> rpc sequential for special test
* revert
* fix stupid merge problem
* Use property in connector for sampler (#5913 )
* merge the import conflicts
* fix spawning of processes in slurm
* [wip] Fix some bugs for TPU [skip ci] (#5878 )
* fixed for single tpu
* fixed spawn
* fixed spawn
* update
* update
* wip
* resolve bugs
* resolve bug
* update on comment
* removed decorator
* resolve comments
* set to 4
* update
* update
* need cleaning
* update
* update
* update
* resolve flake8
* resolve bugs
* exclude broadcast
* resolve bugs
* change test
* update
* update
* skip if meet fails
* properly raise trace
* update
* add catch
* wrap test
* resolve typo
* update
* typo
Co-authored-by: Lezwon Castelino <lezwon@gmail.com>
Co-authored-by: Your Name <you@example.com>
* resolve some tests
* update
* fix imports
* update
* resolve flake8
* update azure pipeline
* skip a sharded test on cpu that requires a gpu
* resolve tpus
* resolve bug
* resolve flake8
* update
* updat utils
* revert permission change on files
* suggestions from carlos
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* remove unrelated formatting changes
* remove incomplete comment
* Update pytorch_lightning/accelerators/__init__.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* remove unrelated formatting change
* add types
* warn 1.7 ddp manual backward only if ddp kwarg unset
* yapf + isort
* pep8 unused imports
* fix cyclic import in docs
* Apply suggestions from code review
* typer in accelerator.py
* typo
* Apply suggestions from code review
* formatting
* update on comments
* update typo
* Update pytorch_lightning/trainer/properties.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* update
* update on comments
* resolve some comments
* update on comments
* resolve test
* add toggle_model
* update
* update on comments
* update doc
* typo
* update
* typo
* remove space
* update
* update on comments
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: justusschock <justus.schock@posteo.de>
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: root <root@ip-172-31-88-60.ec2.internal>
Co-authored-by: Lezwon Castelino <lezwon@gmail.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2021-02-16 16:00:35 -05:00
Jirka Borovec
936f42aa1c
clean AMP logic ( #5994 )
...
* clean AMP logic
* cleaning
* ...
* ...
* Even apex
2021-02-16 19:06:47 +00:00
Adrian Wälchli
aa60c08641
move device-specific teardown logic from training loop to accelerator ( #5973 )
...
* on train end
* switch order
2021-02-15 17:38:03 -05:00
Adrian Wälchli
c912c4b729
remove legacy accelerators ( #5949 )
...
* remove legacy accelerators
* update imports
* formatting
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-02-14 16:03:45 +00:00
Justus Schock
da6dbc8d1d
PoC: Accelerator refactor ( #5743 )
...
* restoring the result from subprocess
* fix queue.get() order for results
* add missing "block_backward_sync" context manager
* add missing "block_backward_sync" context manager
* fix sync_batchnorm
* fix supported gpu-ids for tuple
* fix clip gradients and inf recursion
* accelerator selection: added cluster_environment plugin
* fix torchelastic test
* fix reduce early stopping decision for DDP
* fix tests: callbacks, conversion to lightning optimizer
* fix lightning optimizer does not pickle
* fix setting benchmark and deterministic option
* fix slurm amp test
* fix prepare_data test and determine node_rank
* fix retrieving last path when testing
* remove obsolete plugin argument
* fix test: test_trainer_config
* fix torchscript tests
* fix trainer.model access
* move properties
* fix test_transfer_batch_hook
* fix auto_select_gpus
* fix omegaconf test
* fix test that needs to simulate slurm ddp
* add horovod plugin
* fix test with named arguments
* clean up whitespace
* fix datamodules test
* remove old accelerators
* fix naming
* move old plugins
* move to plugins
* create precision subpackage
* create training_type subpackage
* fix all new import errors
* fix wrong arguments order passed to test
* fix LR finder
* Added sharded training type and amp plugin
* Move clip grad to precision plugin
* Added sharded spawn, select accelerators based on distributed_backend + enable custom fp16 plugin automatically
* Fix import issue, attempting to fix tests
* Fix initial test
* Reflect hook logic from master, should wrap model after move to device
* Optional state consolidation, since master has optimizers not wrapped
* change attribute for instance test
* reset optimizers
optimizers are not used in main process, so state would be wrong.
* legacy
* imports in accel
* legacy2
* trainer imports
* fix import errors after rebase
* move hook to new setup location
* provide unwrapping logic
* fix trainer callback system
* added ddp2 implementation
* fix imports .legacy
* move plugins
* restore legacy
* drop test.py from root
* add tpu accelerator and plugins
* fixes
* fix lightning optimizer merge
* reset bugreportmodel
* unwrapping
* step routing forward
* model access
* unwrap
* opt
* integrate distrib_type
* sync changes
* sync
* fixes
* add forgotten generators
* add missing logic
* update
* import
* missed imports
* import fixes
* isort
* mv f
* changelog
* format
* move helper to parallel plugin
* d
* add world size
* clean up
* duplicate
* activate ddp_sharded and tpu
* set nvidia flags
* remove unused colab var
* use_tpu <-> on_tpu attrs
* make some ddp_cpu and clusterplugin tests pass
* Ref/accelerator connector (#5742 )
* final cleanup
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* connector cleanup
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* trainer cleanup
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* accelerator cleanup + missing logic in accelerator connector
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add missing changes to callbacks
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* reflect accelerator changes to lightning module
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* clean cluster envs
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* cleanup plugins
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add broadcasting
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* yapf
* remove plugin connector
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* plugins
* manual optimization
* update optimizer routing
* add rank to torchelastic
* fix memory mixed precision
* setstate on trainer for pickling in ddp spawn
* add predict method
* add back commented accelerator code
* adapt test for sync_batch_norm to new plugin
* fix deprecated tests
* fix ddp cpu choice when no num_processes are given
* yapf format
* skip a memory test that cannot pass anymore
* fix pickle error in spawn plugin
* x
* avoid
* x
* fix cyclic import in docs build
* add support for sharded
* update typing
* add sharded and sharded_spawn to distributed types
* make unwrap model default
* refactor LightningShardedDataParallel similar to LightningDistributedDataParallel
* update sharded spawn to reflect changes
* update sharded to reflect changes
* Merge 1.1.5 changes
* fix merge
* fix merge
* yapf isort
* fix merge
* yapf isort
* fix indentation in test
* copy over reinit scheduler implementation from dev1.2
* fix apex tracking calls with dev_debugger
* reduce diff to dev1.2, clean up
* fix trainer config test when gpus>0 and num_processes >0 and ddp_cpu
* sort plugin tests legacy/new
* fix error handling for amp on cpu
* fix merge
fix merge
fix merge
* [Feat] Resolve manual_backward (#5837 )
* resolve manual_backward
* resolve flake8
* update
* resolve for ddp_spawn
* resolve flake8
* resolve flake8
* resolve flake8
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
* fix tests/accelerator tests on cpu
* [BugFix] Resolve manual optimization (#5852 )
* resolve manual_optimization
* update
* update
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
* Remove copy trainer parameters to happen earlier within the loop and add safe guard to get ref model (#5856 )
* resovle a bug
* Accelerator refactor sharded rpc (#5854 )
* rpc branch
* merge
* update handling of rpc
* make devices etc. Optional in RPC
* set devices etc. later if necessary
* remove devices from sequential
* make devices optional in rpc
* fix import
* uncomment everything
* fix cluster selection
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
* resolve bug
* fix assert in rpc test
* resolve a test
* fix docs compilation
* accelerator refactor - fix for sharded parity test (#5866 )
* fix memory issue with ddp_spawn
* x
x
x
x
x
x
x
x
x
* x
* Remove DDP2 as this does not apply
* Add missing pre optimizer hook to ensure lambda closure is called
* fix apex docstring
* [accelerator][BugFix] Resolve some test for 1 gpu (#5863 )
* update
* revert init
* resolve a bug
* update
* resolve flake8
* update
* update
* update
* revert init
* resolve a bug
* update
* resolve flake8
* update
* update
* update
* update
* update
* revert init
* resolve a bug
* update
* resolve flake8
* update
* update
* update
* revert init
* update
* resolve flake8
* update
* update
* update
* update
* update
* all_gather
* update
* make plugins work, add misconfig for RPC
* update
* update
* remove breaking test
* resolve some tests
* resolve flake8
* revert to ddp_spawn
Co-authored-by: root <root@ip-172-31-88-60.ec2.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de>
* yapf isort
* resolve flake8
* fix apex doctests
* fix apex doctests 2
* resolve docs
* update drone
* clean env
* update
* update
* update
* update
* merge
* Fix RPC related tests, clean out old API, update for new accelerator API [skip ci] (#5881 )
* Fix RPC related tests, clean out old API, update for new accelerator API
* Move tests out of legacy folder, update paths and names
* Update test_remove_1-4.py
* Expose properties for tpu cores/gpus/num_gpus
* Add root GPU property
* Move properties to properties.py
* move tests that were previously in drone
* Fix root GPU property (#5908 )
* Move root GPU to property, remove horovod set as this is handled in horovod plugin, ensure we mock correctly to set GPU accelerator
* Add missing tests back
* fix best model path transfer when no checkpoint callback available
* Fix setup hook order [wip] (#5858 )
* Call trainer setup hook before accelerator setup
* Add test case
* add new test
* typo
* fix callback order in test
Co-authored-by: tchaton <thomas@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* rename ddp sequential -> rpc sequential for special test
* revert
* fix stupid merge problem
* Use property in connector for sampler (#5913 )
* merge the import conflicts
* fix spawning of processes in slurm
* [wip] Fix some bugs for TPU [skip ci] (#5878 )
* fixed for single tpu
* fixed spawn
* fixed spawn
* update
* update
* wip
* resolve bugs
* resolve bug
* update on comment
* removed decorator
* resolve comments
* set to 4
* update
* update
* need cleaning
* update
* update
* update
* resolve flake8
* resolve bugs
* exclude broadcast
* resolve bugs
* change test
* update
* update
* skip if meet fails
* properly raise trace
* update
* add catch
* wrap test
* resolve typo
* update
* typo
Co-authored-by: Lezwon Castelino <lezwon@gmail.com>
Co-authored-by: Your Name <you@example.com>
* resolve some tests
* update
* fix imports
* update
* resolve flake8
* update azure pipeline
* skip a sharded test on cpu that requires a gpu
* resolve tpus
* resolve bug
* resolve flake8
* update
* updat utils
* revert permission change on files
* suggestions from carlos
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* remove unrelated formatting changes
* remove incomplete comment
* Update pytorch_lightning/accelerators/__init__.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* remove unrelated formatting change
* add types
* warn 1.7 ddp manual backward only if ddp kwarg unset
* yapf + isort
* pep8 unused imports
* fix cyclic import in docs
* Apply suggestions from code review
* typer in accelerator.py
* typo
* Apply suggestions from code review
* formatting
* update on comments
* update typo
* Update pytorch_lightning/trainer/properties.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* update
* suggestion from code review
* suggestion from code review
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: root <root@ip-172-31-88-60.ec2.internal>
Co-authored-by: Lezwon Castelino <lezwon@gmail.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2021-02-12 15:48:56 -05:00
Adrian Wälchli
bb7d188318
Fix ModelCheckpoint race condition in file existence check ( #5155 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
2021-02-05 21:40:39 +01:00
Sean Naren
524675705a
Sync working directory with all processes when using Hydra ( #5629 )
...
* Append current work dir (hydra dir) to ensure all processes reference the same directory
* Add changelog
* Add different job name for DDP child processes to log to
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
(cherry picked from commit ca68cac57a
)
2021-02-04 20:55:41 +01:00
rohitgr7
a37416843b
Fix sync
...
resolve wrong merge
tpu
yapf
2021-02-03 20:11:35 +01:00
ananthsub
ad4e25fde5
[bugfix] Fix signature mismatch in DDPCPUHPCAccelerator's model_to_device ( #5505 )
...
* Update ddp_cpu_hpc_accelerator.py
* Update CHANGELOG.md
2021-02-03 19:41:45 +01:00
Rohit Gupta
293984bbc0
fix reinit_schedulers with correct optimizer ( #5519 )
...
* update test
* syntax
* fix
* update test
* scheduler
* only apex
* fix
* rev drone
* chlog
2021-02-03 19:41:45 +01:00
lacrosse91
43e926b880
Fix Wrong exception message ( #5492 )
...
* Update accelerator_connector.py
* Apply suggestions from code review
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2021-02-03 19:41:44 +01:00
Rohit Gupta
10c7dbe6a1
Refactor setup_training and remove test_mode ( #5388 )
...
* ref and fix call for on_pretrained_routine
* avoid failing tests
* unnecessary_call
* unnecessary call in accelerators
* tmpdir
* rm test_mode
* pep
* updates
* more ref
* Revert "more ref"
This reverts commit 5d9e95f873
.
* more refac
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-02-03 19:41:42 +01:00
Adrian Wälchli
8943d8bca0
add missing logic to new plugins and accelerator ( #5734 )
...
* add missing logic
* missed imports
* import fixes
* isort
* mv f
* changelog
* format
* move helper to parallel plugin
* d
2021-02-01 13:23:53 -05:00
Justus Schock
b3ebc18bcb
Hardware specific parts of Accelerator Refactoring ( #5719 )
...
* add basic accelerator class.
Co-Authored with @awaelchi
* pep8
Co-authored-by: @awaelchi
* add cpu accelerator
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add gpu accelerator
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add tpu accelerator
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add accelerator connector
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add single device training
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add single tpu
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add tpu spawn
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* make on_colab_kaggle utility func
* add basic accelerator class.
Co-Authored with @awaelchi
* pep8
Co-authored-by: @awaelchi
* add cpu accelerator
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add gpu accelerator
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add tpu accelerator
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add accelerator connector
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add single device training
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add single tpu
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add tpu spawn
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* make on_colab_kaggle utility func
* fixes
* move
* yapf
* .
* .
* .
* flake8
* sync accelerator connector changes from dev1.2
* changelog
* fix tpu handling
* tpu
* aval
* yapf
* Update pytorch_lightning/plugins/training_type/tpu_spawn.py
Co-authored-by: chaton <thomas@grid.ai>
* Update pytorch_lightning/accelerators/accelerator_connector.py
Co-authored-by: chaton <thomas@grid.ai>
* Update pytorch_lightning/plugins/training_type/tpu_spawn.py
Co-authored-by: chaton <thomas@grid.ai>
* Update tpu_spawn.py
* Update pytorch_lightning/accelerators/accelerator_connector.py
Co-authored-by: chaton <thomas@grid.ai>
* indentation
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: chaton <thomas@grid.ai>
2021-02-01 08:34:59 -05:00
Justus Schock
069ae27cef
Accelerator Refactor: Precision Plugins ( #5718 )
...
* add basic accelerator class.
Co-Authored with @awaelchi
* add basic trainign type plugin.
Co-Authored with @awaelchi
* pep8
Co-authored-by: @awaelchi
* update copyright
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add apex_amp
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add mixed base class
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add native amp
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add native amp sharded
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add tpu bfloat
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add inits
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Update precision_plugin.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-01-31 13:12:02 -05:00
Adrian Wälchli
692f77b8a7
Refactor LightningDataParallel ( #5670 )
...
* module
* fix model access
* scalar conversion
* refactor
* kwargs
* auto unsqueeze
* refactor code duplication
* clean up
* docs
* update dp docs
* changelog
* generalize test
* test
* rename
* warning cache
* isort
* unsqueezing test
* device
* device
* scalar test
* device
* device
* include coverage of overrides
* clear
* add deprecation test
* docs
* improve coverage
* increase coverage
* fix merge
* extend test
* rename base class
* mention the predict method in docs
* combine iteration over collection
* remove override
* move
* line
* Apply suggestions from code review
* fix running stage
* f401
* fix cyclic import
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-01-31 06:08:16 -05:00
Justus Schock
5d239ccd70
Base classes for accelerator refactoring ( #5715 )
...
* add basic accelerator class.
Co-Authored with @awaelchi
* Add base plugin class.
Co-authored with @awaelchi
* add basic trainign type plugin.
Co-Authored with @awaelchi
* add basic precision plugin.
Co-Authored with @awaelchi
* Add missing inits.
Co-authored with @awaelchi
* pep8
Co-authored-by: @awaelchi
* ignore flake8
* coverage omit
* imports in init
* lost
* imports
* flake8
* .
* .
* chlog
* Update pytorch_lightning/plugins/training_type/training_type_plugin.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update pytorch_lightning/plugins/training_type/training_type_plugin.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update pytorch_lightning/plugins/training_type/training_type_plugin.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update pytorch_lightning/plugins/training_type/training_type_plugin.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update pytorch_lightning/plugins/training_type/training_type_plugin.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update pytorch_lightning/plugins/training_type/training_type_plugin.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update pytorch_lightning/plugins/training_type/training_type_plugin.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-01-30 14:55:28 -05:00
chaton
3da28fd634
[feat] 1/2 Add trainer.predict ( #5579 )
...
* start adding predict
* add predict
* resolve test
* add predict
* remove limit_predict
* update
* add test for predict
* typo
* update on comments
* remove predict_step
* update ddp_shareded
* check ddp_sharded
* resolve on comments
* resolve isort
* update dp
* add test dp 1 gpu
* made default forward
* resolve path
* resolve bug
* update on comments
* resolve doc
* resolve bug
* update
* resolve bug
* update on comments
* resolve pep8
* update test doc
* update on comments
* solve special tests
* resolve bug
* resolve flake8
* Update pytorch_lightning/callbacks/progress.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Update pytorch_lightning/trainer/trainer.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* add predict to LightningModule
* missing predict
* typo
* rename is_prediction to _predicting
* add
* update
* update
* update doc
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2021-01-27 11:38:14 -05:00
Jirka Borovec
7e2e874d95
Refactor: legacy accelerators and plugins ( #5645 )
...
* tests: legacy
* legacy: accel
* legacy: plug
* fix imports
* mypy
* flake8
2021-01-26 20:04:36 -05:00