Commit Graph

263 Commits

Author SHA1 Message Date
ananthsub fa41c588f4
Remove ProfilerConnector class (#7654)
* Remove ProfilerConnector class

* Update trainer.py

* Update CHANGELOG.md

* Update trainer.py

* Update trainer.py

* tests
2021-05-24 08:58:15 -07:00
shuyingsunshine21 299f2c481b
FSDP with full state dict (#7487)
* Fix some test errors
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* checkpoint consolidation

* Update ddp_spawn.py

* Update test_metric_result_integration.py

* Update test_results.py

* Update utils.py

* Update utils.py

* Update test_all_gather_grad.py

* Update test_all_gather_grad.py

* Update test_results.py

* Revert "Update test_results.py"

This reverts commit 9d4a2b891d.

* Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkpoint_consolidate"

This reverts commit c5053da789, reversing
changes made to 0d23d75bc9.

* Revert "Update test_all_gather_grad.py"

This reverts commit 0d23d75bc9.

* Revert "Update utils.py"

This reverts commit 70fe5da9c6.

* Revert "Update utils.py"

This reverts commit a9aae99f6e.

* Revert "Update test_results.py"

This reverts commit ea74906878.

* Revert "Update test_metric_result_integration.py"

This reverts commit bf70e431b3.

* Revert "Update ddp_spawn.py"

This reverts commit f17210183b.

* Revert "checkpoint consolidation"

This reverts commit 536c1323b0.

* Revert "Revert "checkpoint consolidation""

This reverts commit 3a9fde915a.

* Revert "Revert "Revert "checkpoint consolidation"""

This reverts commit 7a369f47e1.

* Revert "Revert "Update ddp_spawn.py""

This reverts commit 8222dc98ea.

* Revert "Revert "Update test_metric_result_integration.py""

This reverts commit 6c095b2370.

* Revert "Revert "Update test_results.py""

This reverts commit 250d0aaaa2.

* Revert "Revert "Update utils.py""

This reverts commit 8651d54d79.

* Revert "Revert "Update test_all_gather_grad.py""

This reverts commit dcdcd29731.

* modify distributed environment to make test pass

* fix version for ddp plugin test

* fix

* fix

* changelog

* Update CHANGELOG.md

* fsdp with full state dict

* fix missing import

* modify unitest

* fix

* fix

* fix typo

* modify test and add changelog

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* limit max_epoch to 1 for testing

* test

* fix

* update

* testing remove special for multi gpu

* assert gpu

* add assertion for gpu

* fix

* Re-enable special test, use ModelCheckpoint

* Fix paths

* Fix path passing

* test

* test

* fix test

* fix

* pre-commit format

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: SeanNaren <sean@grid.ai>
2021-05-24 08:11:45 +01:00
Carlos Mocholí 3d4dd28bec
Replace `CallbackHookNameValidator` with `FxValidator` [3/n] (#7627)
* Refactor FxValidator

* Fix tests

* Fix tests

* Class attribute

* Fix tests

* Better error message

* Fix tests

* Update pytorch_lightning/trainer/connectors/logger_connector/fx_validator.py
2021-05-21 11:54:16 +01:00
Carlos Mocholí 901b2bac98
Unify `current_fx_name` and `current_hook_fx_name` [2/n] (#7594)
* Minor loggger connector cleanup [1/n]

* Missing line

* Address comments

* Rely on validator

* Unify `current_fx_name` and `current_hook_fx_name`

* Fix test
2021-05-19 20:31:06 +00:00
ananthsub b4e28e7169
[feat] Add stronger validation for checkpoint_callback argument (#7539)
* [feat] Add stronger validation for checkpoint_callback configuration

* chlog

* Update callback_connector.py

* Update test_model_checkpoint.py

* Update pytorch_lightning/trainer/connectors/callback_connector.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update pytorch_lightning/trainer/connectors/callback_connector.py

* Update tests/checkpointing/test_model_checkpoint.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update CHANGELOG.md

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-05-19 19:38:08 +00:00
Carlos Mocholí 76ff600898
Minor logger connector cleanup [1/n] (#7590)
* Minor loggger connector cleanup [1/n]

* Missing line

* Address comments

* Rely on validator
2021-05-19 19:25:32 +00:00
Nic Eggert f4f51e0dcf
Add kubeflow cluster environment (#7300)
* Add kubeflow cluster environment

* Add KubeflowEnvironment to docs

* Add KubeflowEnvironment to the changelog

* break up a long line

* Add method to detect kubeflow environment

* Select Kubeflow environment when available

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Run pre-commit

* task_idx == 0

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-05-17 09:05:24 +01:00
Adrian Wälchli 6e6e29af49
remove trainer hidden state | sanity refactor [2 / n] (#7507) 2021-05-17 08:57:15 +01:00
Alan Du 6ac16ff348
Fix DistribType for `ddp_cpu` (spawn) (#7492) 2021-05-14 20:53:26 +01:00
Rohit Gupta 7ca41734da
Add `dataloader_idx` to batch transfer hooks (#6241)
* replace with kwargs

* chlog

* fix

* add test

* fix

* device

* deepspeed

* pep

* optional

* docs

* bc

* comments

* pep

* mypy

* pep

* Apply suggestions from code review

* kwargs

* docs

* .

* .

* 1.3 -> 1.4

* kwargs -> step_kwargs
2021-05-13 23:03:55 +05:30
Adrian Wälchli dd1a17b071
Refactor result handling in training loop (#7506)
* refactor results

* rename dic -> dict

* simplify

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* changelog

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix None check

* chlog wording

* move process_closure_result to the end

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-05-13 09:30:34 +01:00
shuyingsunshine21 8538c1f61e
Accelerator model state dict (#7474)
* Fix some test errors
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* checkpoint consolidation

* Update ddp_spawn.py

* Update test_metric_result_integration.py

* Update test_results.py

* Update utils.py

* Update utils.py

* Update test_all_gather_grad.py

* Update test_all_gather_grad.py

* Update test_results.py

* Revert "Update test_results.py"

This reverts commit 9d4a2b891d.

* Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkpoint_consolidate"

This reverts commit c5053da789, reversing
changes made to 0d23d75bc9.

* Revert "Update test_all_gather_grad.py"

This reverts commit 0d23d75bc9.

* Revert "Update utils.py"

This reverts commit 70fe5da9c6.

* Revert "Update utils.py"

This reverts commit a9aae99f6e.

* Revert "Update test_results.py"

This reverts commit ea74906878.

* Revert "Update test_metric_result_integration.py"

This reverts commit bf70e431b3.

* Revert "Update ddp_spawn.py"

This reverts commit f17210183b.

* Revert "checkpoint consolidation"

This reverts commit 536c1323b0.

* Revert "Revert "checkpoint consolidation""

This reverts commit 3a9fde915a.

* Revert "Revert "Revert "checkpoint consolidation"""

This reverts commit 7a369f47e1.

* Revert "Revert "Update ddp_spawn.py""

This reverts commit 8222dc98ea.

* Revert "Revert "Update test_metric_result_integration.py""

This reverts commit 6c095b2370.

* Revert "Revert "Update test_results.py""

This reverts commit 250d0aaaa2.

* Revert "Revert "Update utils.py""

This reverts commit 8651d54d79.

* Revert "Revert "Update test_all_gather_grad.py""

This reverts commit dcdcd29731.

* modify distributed environment to make test pass

* modify model state dict to training type plugin

* remove changes

* add changelog

* fixing isort for pre-commit failure

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Address code review

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: SeanNaren <sean@grid.ai>
2021-05-11 16:39:04 +01:00
Adrian Wälchli ad9118f04a
remove trainer hidden state | sanity refactor [1 / n] (#7437)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-05-11 11:09:08 +02:00
shuyingsunshine21 987530cd38
Set `num_nodes` and `sync_batchnorm` From Trainer for Manually Passed Training Type Plugin (#7026)
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-05-08 11:25:51 +00:00
Ethan Harris 45143fd825
Improve val step logging (#7351)
* Fix val step logging

* Add a type

* Fix

* Update CHANGELOG.md
2021-05-07 22:58:03 +00:00
ananthsub 98670c83a9
Deprecate`truncated_bptt_steps` flag on Trainer in favor of same setting on the LightningModule (#7323)
* deprecate-tbptt-trainer

* Update CHANGELOG.md

* Update lightning.py

* test

* Update lightning.py

* Update training_loop.py

* Update training_loop.py

* Update lightning.py

* Update training_loop.py

* Update training_loop.py

* update docs

* Update accelerator.py

* Update accelerator.py

* more docs

* tweaks

* chlog

* comments

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-05-05 11:21:00 +01:00
Carlos Mocholí 374ff750f5
Pass `current_epoch`/`global_step` as monitor candidates [1/2] (#7344)
* Pass `current_epoch`/`global_step` as monitor candidates

* Formatting

* Fix deprecated test

* Update CHANGELOG
2021-05-04 16:05:40 -04:00
Carlos Mocholí 8c0ea92af2
`TrainerState` refactor [5/5] (#7173)
* `TrainerState` refactor

* flake8

* Update finished check

* Test cleanup

* Fix tests

* Fixes

* Reorder

* flake8

* Update CHANGELOG

* Better docs

* Better docs

* Remove default

* Update tests

* Bad merge
2021-05-04 12:50:56 +02:00
Hemil Desai 82c19e1444
Update LR schedulers only when their corresponding Optimizer is being… (#4868)
* Update LR schedulers only when their corresponding Optimizer is being used.

In the case when optimizer frequencies are specified,
the LR scheduler corresponding to a particular optimizer is updated
only when that optimizer is being used in the training loop or epoch.

* pep8speak fixes

* Fix failing tests

* Add docs

* PR Feedback

* Apply suggestions from code review

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* formatting fix

* PR Feedback - part 2

* More PR feedback

* Apply suggestions from code review

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* Add typing imports

* Stronger tests and fixes related to that

* Add more tests plus PR feedback

* Make optimizer_freq_cumsum a cached property

@cached_property is only available after Python 3.8 so had to do it manually.

* Fix tests

* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Avoid mutable defaults

* Parametrize lr scheduling tests

* PR feedback

* Apply suggestions from code review

* spell

* Apply suggestions from code review

* flake8

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2021-05-04 09:37:40 +00:00
Kaushik B 6d7c6d6403
Update Accelerator Connector for Registry (#7214) 2021-05-03 21:03:21 +00:00
Carlos Mocholí 5af086ab9f
Attach data refactor and tuner bugs [4/n] (#7258)
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-04-30 13:54:58 +00:00
Adrian Wälchli b9b3fa371f
fix case where an IterableDataset doesn't produce a batch for an epoch (#7294)
* wip

* fix

* add test

* refactor + test

* rm

* formatting

* update changelog

* doc

* docstring

* remove unused import

* Update CHANGELOG.md

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-04-30 12:45:55 +00:00
Carlos Mocholí a5ac3f8a16
Code cleaning in preparation for #7258 [3/n] (#7262) 2021-04-29 14:40:51 +02:00
Carlos Mocholí bdc4272e99
`_launch` refactor and types [1/n] (#7232) 2021-04-28 17:41:08 +02:00
Vaibhav Balloli ccd87cadfc
Changes resume_from_checkpoint warning to error (#7075)
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-28 15:03:29 +02:00
thomas chaton e147127c0e
[feat] Add better support for predict + ddp 2/3 (#7215)
* wip

* update

* update

* update

* update

* update

* typo

* update on comments

* update

* update

* update

* update

* update changelog

* update

* Fix merge

* Fix merge

* move code

* resolve test

* add extra test

* add an extra test

* update on comments

* add typing

* resolve flake8

* Refactor and Docs

* Fix tests

* Fix tests

* Fix tests

* Duplicate

* Fix tests

* resolve bug

* update

* update on comments

* update

* update changelog

* update

* update

* remove tpu

* resolve flake8

* update on comments

* update on comments

* update on comment

* resolve flake8

* add a cpu test for predict

* add None test

* update

* Update CHANGELOG.md

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* resolve tests

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-27 08:46:45 -04:00
ananthsub 68eac4d948
Enforce Lightning module as source of truth for automatic optimization (#7130)
* make lightning module source of truth for automatic optimization

* Update configuration_validator.py

* Update model_connector.py

* rm-references

* Update CHANGELOG.md

* Update CHANGELOG.md

Co-authored-by: jirka <jirka.borovec@seznam.cz>
2021-04-26 05:36:26 +00:00
Kaushik B 44d775fccf
Update Error message for ProfileConnector (#7204)
* Update Error message for ProfileConnector

* Update test
2021-04-25 11:37:21 -07:00
ananthsub b3fe836656
Move metrics_to_scalars to a dedicated utilities file (#7180)
* rm-trainer-logging

* Update CHANGELOG.md

* Update metrics.py

* Update logging.py

* Update metrics.py
2021-04-24 10:25:33 +01:00
Jirka Borovec aa7d3dc6cc
Fix `torchmetrics` compatibility (#7131)
* get_num_classes

* tmp

* fix one test

* fix deprecated tests

* fix deprecate

* pep8

* deprecate 0.3

* wip

* wip

* HaCK

* brnch

* brnch

* format

* Apply suggestions from code review

* prune

* rev

* mltilabel

* Apply suggestions from code review

* master

* rev

* .

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
2021-04-22 20:45:46 +00:00
Carlos Mocholí 33066f8fd9
Add `on_predict_{batch,epoch}_{start,end}` and `Callback.on_predict_{start,end}` (#7141)
* Update hooks typing and predict hooks

* Update CHANGELOG

* Progress

* Progress

* Add back `on_predict_{start,end}`

* Typing and fix

* Update tests/trainer/logging_/test_logger_connector.py

* Update tests/callbacks/test_lambda_function.py
2021-04-22 10:05:28 -04:00
thomas chaton 99b9dfa883
[bugfix] Remove warning for distributed values (#7132)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: jirka <jirka.borovec@seznam.cz>
2021-04-22 02:14:46 +02:00
Akihiro Nitta 0302b8be32
Disable `lr_scheduler.step()` in manual optimization (#6825)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
2021-04-20 13:00:45 +02:00
Nicki Skafte fbee5a86e7
Correctly reset metric objects in self.log (#7055)
* reset

* fix tests

* fix tests

* Apply suggestions from code review

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* move logic

* chglog

* pep8

* Add test

* Improve test

Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
2021-04-19 14:48:48 +01:00
Carlos Mocholí 898ec8a94a
Create pytorch_lightning/utilities/types.py (#7048) 2021-04-19 14:43:16 +02:00
Kaushik B 832a03af7c
Add Training Type Plugins Registry (#6982)
Co-authored-by: Sean Naren <sean@grid.ai>
Co-authored-by: thomas chaton <thomas@grid.ai>
2021-04-16 18:01:56 +05:30
Adrian Wälchli 67d21609c9
Add Trainer max_time argument + Callback (#6823)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2021-04-16 13:38:57 +02:00
Ethan Harris f645df5e9a
Add typings for evaluation_loop.py and remove some dead code (#7015) 2021-04-15 07:36:04 +00:00
CeShine Lee 24d0295ff1
Fix the `gradient_clip_algorithm` has no effect issue. (#6928) 2021-04-14 14:17:06 +05:30
Adrian Wälchli 33cc9fe138
Clean up environment access in plugins (#6941)
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-04-13 20:07:40 +02:00
Ethan Harris b9bc77293b
Fix inconsistent outputs in `on_*_end` and `*_end` (#6969)
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-04-13 15:16:21 +01:00
ananthsub e891ceb836
Remove evaluation loop legacy dict returns for `*_epoch_end` hooks (#6973)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-13 12:37:54 +01:00
ananthsub 968ac091c0
Remove hardcoding of rank_zero_only.rank in accelerator connector (#6878) 2021-04-08 12:56:59 +05:30
Kaushik B a17c027ea1
Update sync_dist warning for multiple processes (#6790) 2021-04-06 16:57:43 +02:00
Anthony Kim 7f6154fcad
Add `Trainer(gradient_clip_algorithm='value'|'norm')` (#6123)
* add changelog

* add clip by value

* fix bug in training tricks.rst

* fix bug in trainer.rst

* Update trainer.rst

* Update trainer.rst

* Update CHANGELOG.md

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/plugins/precision/deepspeed_precision.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/utilities/enums.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* yapf formatting

* update training tricks

* update based on comment

* update based on comment

* Update pytorch_lightning/trainer/trainer.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* update based on comment

* pep8

* mypy

* mypy

* Update docs/source/advanced/training_tricks.rst

Co-authored-by: thomas chaton <thomas@grid.ai>

* Update sharded_native_amp.py

* Update test_sharded_parity.py

* update test codes

* Update test_tpu.py

* Update pytorch_lightning/trainer/connectors/training_trick_connector.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update test_trainer.py

* Update enums.py

* Update enums.py

* add super-class initialization to precision plugins.

* add clip_grad horovod cpu test

* add clip_grad horovod cpu test

* use subprocess check_call

* change order of horovod tests

* set max_epochs 2 in horovod test

* remove clip_grad_val test from horovod-cpu

* remove "type: ignore"

* divide clip grad val test in horovod

* update based on comments

* add super-class initialization to precision plugins.

* bugfix

* bugfix

* revert some changes

* revert some changes

* Update tests/models/test_horovod.py

* merge master

* Delete signature test

No point in testing a signature

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-04-06 08:27:37 -05:00
Kaushik B cf8e828559
[Fix] TPU Training Type Plugin (#6816) 2021-04-06 15:02:44 +05:30
Carlos Mocholí 0dd2deebea
Remove legacy support for the magic `log`/`progress_bar` keys in dict returns (#6734) 2021-03-31 00:28:04 +02:00
thomas chaton 1302766f83
DeepSpeed ZeRO Update (#6546)
* Add context to call hook to handle all modules defined within the hook

* Expose some additional parameters

* Added docs, exposed parameters

* Make sure we only configure if necessary

* Setup activation checkpointing regardless, saves the user having to do it manually

* Add some tests that fail currently

* update

* update

* update

* add tests

* change docstring

* resolve accumulate_grad_batches

* resolve flake8

* Update DeepSpeed to use latest version, add some comments

* add metrics

* update

* Small formatting fixes, clean up some code

* Few cleanups

* No need for default state

* Fix tests, add some boilerplate that should move eventually

* Add hook removal

* Add a context manager to handle hook

* Small naming cleanup

* wip

* move save_checkpoint responsability to accelerator

* resolve flake8

* add BC

* Change recommended scale to 16

* resolve flake8

* update test

* update install

* update

* update test

* update

* update

* update test

* resolve flake8

* update

* update

* update on comments

* Push

* pull

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* update

* Apply suggestions from code review

* Swap to using world size defined by plugin

* update

* update todo

* Remove deepspeed from extra, keep it in the base cuda docker install

* Push

* pull

* update

* update

* update

* update

* Minor changes

* duplicate

* format

* format2

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-03-30 13:39:02 -04:00
Carlos Mocholí 90444706b2
Remove logger_connector legacy code (#6733) 2021-03-30 12:33:33 +02:00
Kaushik B f79a13e495
[Model Parallel] Add configure sharded model hook (#6679)
* Add base hook for model parallel

* fix callback signature

* Simplify hook

* Add hook logic

* add tests

* add property setter

* add logic for being called once

* Update changelog

* Fix

* fix return type

* fix lambda callback test

* Fix tests

* Apply code suggestions

* add logic for setup_optimizers_predispatch

* add common dummy model

* Swap call order

* Remove test that isn't needed anymore

* Update tests

* Add a bit more doc

* Few code review fixes

* Update pytorch_lightning/accelerators/accelerator.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Change hook name

* Fix test

* Test setup hook, refactor names

* Swap call order of callbacks and model initialization

* Change name of context manager

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-03-29 14:50:51 -06:00