Commit Graph

2959 Commits

Author SHA1 Message Date
Kaushik B 27eb0035ca
Increase TPU Check timeout (#7706)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-05-26 01:44:29 +00:00
Carlos Mocholí d26953c8bc
Add `ModelPruning(prune_on_train_epoch_end)` to choose when to apply pruning (#7704)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-05-26 00:57:56 +02:00
Xinyao(Alvin) Sun 7e2f7e956b
fix: improve UserWarning message (#7685)
* fix: improve UserWarning message
when both overfit and training dtaloader shuffling are enabled

fixes issue: #7656

* chore: update changelog

* Polish userwarning msg in pytorch_lightning/trainer/data_loading.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* shuffling typo

* Update CHANGELOG.md

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-05-25 17:35:15 +00:00
Kaushik B e7057d5898
Add `should_rank_save_checkpoint` property to Training Plugins (#7684)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-05-25 23:02:05 +05:30
Carlos Mocholí a1c40f3207
Remove on epoch guard from the should stop validation check (#7701)
* Remove on epoch guard from the should stop validation check

* Formatting
2021-05-25 15:59:42 +01:00
Carlos Mocholí e2ead9abd7
Refactor some loops code and hook tests (#7682) 2021-05-25 13:27:54 +02:00
Carlos Mocholí 8ba6304c73
Increment the total batch idx before the accumulation early exit (#7692)
* Increment the total batch idx before the accumulation early exit

* Update CHANGELOG
2021-05-25 10:23:40 +02:00
Carlos Mocholí 8b01497e42
Fix global step update when the epoch is skipped (#7677)
* Fix global step update when the epoch is skipped

* Update CHANGELOG

* Move test
2021-05-24 17:36:56 +01:00
Kaushik B 3f460b150a
Move parameter validation specific to TPU Training plugins (#7415)
* Move parameter validation specific to TPU Training plugins

* update docstring
2021-05-24 16:02:01 +00:00
ananthsub fa41c588f4
Remove ProfilerConnector class (#7654)
* Remove ProfilerConnector class

* Update trainer.py

* Update CHANGELOG.md

* Update trainer.py

* Update trainer.py

* tests
2021-05-24 08:58:15 -07:00
Gyeongjae Choi a54bc5dba3
Fix progress bar print error when called before training (#7674)
* Check progress bar existence before printing

* Add tests for predict_progres_bar

* Add tests for progress_bar printing without training

* Update changelog
2021-05-24 17:33:28 +02:00
Carlos Mocholí 2103b5efc9
Move sync code from step result to lightning module [6/n] (#7651) 2021-05-24 13:13:55 +01:00
Xinyao(Alvin) Sun 0c958c5a1f
Fix dataloaders are not reset when tuning the model (#7566)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-05-24 10:21:45 +02:00
shuyingsunshine21 299f2c481b
FSDP with full state dict (#7487)
* Fix some test errors
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* checkpoint consolidation

* Update ddp_spawn.py

* Update test_metric_result_integration.py

* Update test_results.py

* Update utils.py

* Update utils.py

* Update test_all_gather_grad.py

* Update test_all_gather_grad.py

* Update test_results.py

* Revert "Update test_results.py"

This reverts commit 9d4a2b891d.

* Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkpoint_consolidate"

This reverts commit c5053da789, reversing
changes made to 0d23d75bc9.

* Revert "Update test_all_gather_grad.py"

This reverts commit 0d23d75bc9.

* Revert "Update utils.py"

This reverts commit 70fe5da9c6.

* Revert "Update utils.py"

This reverts commit a9aae99f6e.

* Revert "Update test_results.py"

This reverts commit ea74906878.

* Revert "Update test_metric_result_integration.py"

This reverts commit bf70e431b3.

* Revert "Update ddp_spawn.py"

This reverts commit f17210183b.

* Revert "checkpoint consolidation"

This reverts commit 536c1323b0.

* Revert "Revert "checkpoint consolidation""

This reverts commit 3a9fde915a.

* Revert "Revert "Revert "checkpoint consolidation"""

This reverts commit 7a369f47e1.

* Revert "Revert "Update ddp_spawn.py""

This reverts commit 8222dc98ea.

* Revert "Revert "Update test_metric_result_integration.py""

This reverts commit 6c095b2370.

* Revert "Revert "Update test_results.py""

This reverts commit 250d0aaaa2.

* Revert "Revert "Update utils.py""

This reverts commit 8651d54d79.

* Revert "Revert "Update test_all_gather_grad.py""

This reverts commit dcdcd29731.

* modify distributed environment to make test pass

* fix version for ddp plugin test

* fix

* fix

* changelog

* Update CHANGELOG.md

* fsdp with full state dict

* fix missing import

* modify unitest

* fix

* fix

* fix typo

* modify test and add changelog

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* limit max_epoch to 1 for testing

* test

* fix

* update

* testing remove special for multi gpu

* assert gpu

* add assertion for gpu

* fix

* Re-enable special test, use ModelCheckpoint

* Fix paths

* Fix path passing

* test

* test

* fix test

* fix

* pre-commit format

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: SeanNaren <sean@grid.ai>
2021-05-24 08:11:45 +01:00
Xinyao(Alvin) Sun 01109cdf0c
Fix/mismatched toggle optimizer (#7563)
* fix: avoid potential mismatched toggling of optimzier
Refs #7405

chore: update CHANGELOG

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

fix: resolve a confict

chore: update changelog

* feat: add a test that fails in master

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo in tests/trainer/optimization/test_multiple_optimizers.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* Polish tests/trainer/optimization/test_multiple_optimizers.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Polish tests/trainer/optimization/test_multiple_optimizers.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* fix: change placeholder in optimizer_step from positional args to keyword args

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-05-23 04:30:28 +02:00
shuyingsunshine21 2242423b75
refactor accelerator teardown -> training type plugin teardown (#7579) 2021-05-22 13:19:24 -07:00
Carlos Mocholí a8d9b5f783
Remove tbptt `self.log` flags and other dead code [5/n] (#7644) 2021-05-22 01:13:00 +00:00
Carlos Mocholí 33a1f5271f
[2/N] Define dataclasses for progress tracking (#7574)
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
2021-05-22 03:09:08 +02:00
Yifu Wang 8d6e2ff7b2
Improve argument validation for validate(), test(), and predict() (#7605)
Co-authored-by: Yifu Wang <yifuwang@2012@gmail.com>
2021-05-21 09:03:16 -07:00
ananthsub f6d892ac21
[feat] Support custom filesystems in LightningModule.to_torchscript (#7617)
* [feat] Support custom filesystems in LightningModule.to_torchscript

* Update CHANGELOG.md

* Update test_torchscript.py

* Update test_torchscript.py

* Update CHANGELOG.md

* Update test_torchscript.py
2021-05-21 11:23:15 +00:00
Carlos Mocholí e8a46bee15
Remove `Result(minimize)` parameter [4/n] (#7628) 2021-05-21 12:58:52 +02:00
Carlos Mocholí 603ef2cf7f
Use `trainer.call_hook` in the evaluation loop (#7626) 2021-05-21 11:54:52 +01:00
Carlos Mocholí 3d4dd28bec
Replace `CallbackHookNameValidator` with `FxValidator` [3/n] (#7627)
* Refactor FxValidator

* Fix tests

* Fix tests

* Class attribute

* Fix tests

* Better error message

* Fix tests

* Update pytorch_lightning/trainer/connectors/logger_connector/fx_validator.py
2021-05-21 11:54:16 +01:00
i-aki-y 7eafd8eac6
Add run_name argument to the MLFlowLogger constructor (#7622)
* Add run_name argument to the MLFlowLogger

* Update CHANGELOG

* Fix unnecessary line

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix style by using yapf

* Fix import error when mlflow is not installed

* Update CHANGELOG.md

* Update tests/loggers/test_mlflow.py

Co-authored-by: akiyuki ishikawa <aki.y.ishikwa@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-05-21 09:17:32 +01:00
ananthsub 94ef17ce77
Update model_checkpoint.py (#7625) 2021-05-20 23:16:18 +02:00
Andrew Tritt 92cf396de2
Override `broadcast_object_list` for `torch<1.8` (#7592)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-05-20 08:29:55 +00:00
Yifu Wang ed271905cf
Clear predict_progress_bar in ProgressBar.__getstate__ (#7608)
Co-authored-by: Yifu Wang <yifuwang@2012@gmail.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-05-20 01:38:49 +00:00
ananthsub 8266b141ba
[feat] Support time-based checkpointing during training (#7515)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-05-19 22:14:13 +00:00
ananthsub 9f5d4955b6
[1/N] Define dataclasses for progress tracking (#6603)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-05-19 21:02:20 +00:00
Carlos Mocholí 901b2bac98
Unify `current_fx_name` and `current_hook_fx_name` [2/n] (#7594)
* Minor loggger connector cleanup [1/n]

* Missing line

* Address comments

* Rely on validator

* Unify `current_fx_name` and `current_hook_fx_name`

* Fix test
2021-05-19 20:31:06 +00:00
Carlos Mocholí dbea5bb710
Add typing to `ModelPruning` callback (#7529)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-05-19 22:01:42 +02:00
Jan-Henrik Lambrechts 608de6abf4
TensorBoardLogger sub_dir parameter for grouping logs (#6195)
* fixed a small typo

* cleaning up

* added sub_dir argument to tensorboard and wrote test

* sub dir arg exclusively for tensorboard, linted

* resolving merge conflict

* resolved merge conflict

* resolved merge conflict

* resolved merge conflict

* resolve merge conflict before revert

* resolving merge conflict

* reverted to pre-lint

* added tensorboard sub_dir test

* pep8 formatting

* removed sub_dir arg from test_all function:

* updated feature description

* typo in doc description

* updated CHANGELOG

* Update pytorch_lightning/loggers/tensorboard.py

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>

* swapped argument position

* added expandvars tests

* added expandvars

* removed model init

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix tests

* fix failed test

* Revert "fix failed test"

This reverts commit 50b34c66da.

* add env var to test

* fix typo in tests

* fix tests

* for test consistency

* fix typo

* fix typo 2

Co-authored-by: Ubuntu <azureuser@devhenrik.evuifrmjd4lepbj4relcwwu5va.ax.internal.cloudapp.net>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
2021-05-19 19:50:58 +00:00
ananthsub b4e28e7169
[feat] Add stronger validation for checkpoint_callback argument (#7539)
* [feat] Add stronger validation for checkpoint_callback configuration

* chlog

* Update callback_connector.py

* Update test_model_checkpoint.py

* Update pytorch_lightning/trainer/connectors/callback_connector.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update pytorch_lightning/trainer/connectors/callback_connector.py

* Update tests/checkpointing/test_model_checkpoint.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update CHANGELOG.md

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-05-19 19:38:08 +00:00
Carlos Mocholí 76ff600898
Minor logger connector cleanup [1/n] (#7590)
* Minor loggger connector cleanup [1/n]

* Missing line

* Address comments

* Rely on validator
2021-05-19 19:25:32 +00:00
TOKUNAGA Hiroyuki 20f63377f8
Fix the condition for calling update_learning_rates (#7032)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-05-17 17:20:42 +02:00
Adrian Wälchli 502adbced3
refactor optimizer loop logic for manual and automatic optimization (#7526)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
2021-05-17 14:42:01 +02:00
Kaushik B bf46730d92
Support TPU Pod Training (n/n) (#7296) 2021-05-17 11:33:44 +00:00
Nic Eggert f4f51e0dcf
Add kubeflow cluster environment (#7300)
* Add kubeflow cluster environment

* Add KubeflowEnvironment to docs

* Add KubeflowEnvironment to the changelog

* break up a long line

* Add method to detect kubeflow environment

* Select Kubeflow environment when available

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Run pre-commit

* task_idx == 0

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-05-17 09:05:24 +01:00
Adrian Wälchli 6e6e29af49
remove trainer hidden state | sanity refactor [2 / n] (#7507) 2021-05-17 08:57:15 +01:00
Mauricio Villegas d0081778f8
Enable fsspec by default for cli config file (#7521)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-05-17 08:53:00 +01:00
Alan Du 6ac16ff348
Fix DistribType for `ddp_cpu` (spawn) (#7492) 2021-05-14 20:53:26 +01:00
Rohit Gupta 7ca41734da
Add `dataloader_idx` to batch transfer hooks (#6241)
* replace with kwargs

* chlog

* fix

* add test

* fix

* device

* deepspeed

* pep

* optional

* docs

* bc

* comments

* pep

* mypy

* pep

* Apply suggestions from code review

* kwargs

* docs

* .

* .

* 1.3 -> 1.4

* kwargs -> step_kwargs
2021-05-13 23:03:55 +05:30
Carlos Mocholí a584196abf
Default `seed_everything(workers=True)` in the `LightningCLI` (#7504) 2021-05-13 12:18:03 +02:00
Adrian Wälchli dd1a17b071
Refactor result handling in training loop (#7506)
* refactor results

* rename dic -> dict

* simplify

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* changelog

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix None check

* chlog wording

* move process_closure_result to the end

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-05-13 09:30:34 +01:00
Jirka Borovec 298f9e5c2d
Prune deprecated utils modules (#7503)
* argparse_utils

* model_utils

* warning_utils

* xla_device_utils

* chlog

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-05-13 07:24:42 +00:00
Jirka Borovec 946aee0c7b
prune data parallel (#7510) 2021-05-13 06:23:02 +01:00
Carlos Mocholí 072ad52b6b
Add `trainer.predict(ckpt_path)` (#7430)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-05-13 01:49:58 +02:00
Jirka Borovec d4ec75164c
Prune deprecated trainer attributes (#7501)
* use_single_gpu

* use_horovod

* use_ddp2

* use_ddp

* use_dp

* on_gpu

* use_tpu

* on_tpu

* on_cpu

* cleaning

* chlog

* Apply suggestions from code review

* Apply suggestions from code review
2021-05-12 20:10:15 +00:00
Jirka Borovec 96981091c7
Prune deprecated classif. metrics (#7499)
* stat_scores_multiple_classes

* precision_recall

* precision

* recall

* auc

* auroc

* multiclass_auroc

* iou

* clean-up

* chlog

* flake8

* imports

* prune
2021-05-12 18:03:34 +00:00
Jirka Borovec 140b0c727e
Prune deprecated trainer attributes 2 (#7502)
* accelerator_backend

* get_model

* clean

* chlog

* flake8
2021-05-12 10:19:30 -07:00
Carlos Mocholí 83283fdb20
Fix yapf-isort conflict (#7500) 2021-05-12 15:44:57 +02:00
Federico Simonetta 8cdbd03d02
MLFlow now uses env variable as default tracking uri (#7457)
* Clarify logger flag

Clarify behavior of boolean values on the logger flag for Trainer.

* Update docs/source/common/trainer.rst

* doc

* MLFlow now uses env variable as default tracking uri

Solves https://github.com/PyTorchLightning/pytorch-lightning/issues/6894

* Update pytorch_lightning/loggers/mlflow.py

Co-authored-by: thomas chaton <thomas@grid.ai>

* changelog

Co-authored-by: SpontaneousDuck <kennywitham4@gmail.com>
Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: jirka <jirka.borovec@seznam.cz>
2021-05-12 11:26:57 +02:00
Christopher Ehmann b9a52fa2ef
added stage param to LightningDataModule.setup example (#7483)
Co-authored-by: Sileadim <christopher@omnius.com>
2021-05-11 23:43:22 +05:30
shuyingsunshine21 8538c1f61e
Accelerator model state dict (#7474)
* Fix some test errors
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* checkpoint consolidation

* Update ddp_spawn.py

* Update test_metric_result_integration.py

* Update test_results.py

* Update utils.py

* Update utils.py

* Update test_all_gather_grad.py

* Update test_all_gather_grad.py

* Update test_results.py

* Revert "Update test_results.py"

This reverts commit 9d4a2b891d.

* Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkpoint_consolidate"

This reverts commit c5053da789, reversing
changes made to 0d23d75bc9.

* Revert "Update test_all_gather_grad.py"

This reverts commit 0d23d75bc9.

* Revert "Update utils.py"

This reverts commit 70fe5da9c6.

* Revert "Update utils.py"

This reverts commit a9aae99f6e.

* Revert "Update test_results.py"

This reverts commit ea74906878.

* Revert "Update test_metric_result_integration.py"

This reverts commit bf70e431b3.

* Revert "Update ddp_spawn.py"

This reverts commit f17210183b.

* Revert "checkpoint consolidation"

This reverts commit 536c1323b0.

* Revert "Revert "checkpoint consolidation""

This reverts commit 3a9fde915a.

* Revert "Revert "Revert "checkpoint consolidation"""

This reverts commit 7a369f47e1.

* Revert "Revert "Update ddp_spawn.py""

This reverts commit 8222dc98ea.

* Revert "Revert "Update test_metric_result_integration.py""

This reverts commit 6c095b2370.

* Revert "Revert "Update test_results.py""

This reverts commit 250d0aaaa2.

* Revert "Revert "Update utils.py""

This reverts commit 8651d54d79.

* Revert "Revert "Update test_all_gather_grad.py""

This reverts commit dcdcd29731.

* modify distributed environment to make test pass

* modify model state dict to training type plugin

* remove changes

* add changelog

* fixing isort for pre-commit failure

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Address code review

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: SeanNaren <sean@grid.ai>
2021-05-11 16:39:04 +01:00
Justus Schock 7b283e3c46
Bugfix/Multiple dataloaders (#7433)
* Update supporters.py

* Update apply_func.py

* Update supporters.py

* Update model_train_dataloaders.py

* Update model_train_steps.py

* Update test_dataloaders.py

* Update CHANGELOG.md

* Update model_train_steps.py

* Update test_dataloaders.py

* Update test_dataloaders.py

* Update supporters.py

* Update test_supporters.py

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update tests/trainer/test_dataloaders.py

Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>

* Apply suggestions from code review

Co-authored-by: Edgar Riba <edgar.riba@gmail.com>

* Update supporters.py

* Update supporters.py

* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Edgar Riba <edgar.riba@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-05-11 16:33:29 +02:00
ananthsub fdf50a5e4b
Mark certain Trainer APIs as protected (#7420) 2021-05-11 11:53:41 +02:00
Adrian Wälchli ad9118f04a
remove trainer hidden state | sanity refactor [1 / n] (#7437)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-05-11 11:09:08 +02:00
David Fidalgo 4a1134db64
Log epoch metrics before firing the `on_evaluation_end` hook (#7272)
* Log epoch metrics before firing the `on_evaluation_end` hook (addresses #7166)

* test that epoch metrics are logged before `on_evaluation_end` hook

* update CHANGELOG

* Shorter test

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-05-11 10:54:31 +02:00
Carlos Mocholí b65ae79478
Automatically check `DataModule.has_{setup,teardown,prepare_data}` [2/2] (#7238)
* Automatically check `DataModule.has_{setup,teardown,prepare_data}`

* Use variable

* Spacing

* Docs

* Update CHANGELOG

* Remove `_DataModuleWrapper`

* Add test

* Update docs/source/extensions/datamodules.rst

* Bad merge

* add test for invalid name

* Remove ValueError

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-05-11 10:53:00 +02:00
Adrian Wälchli 6bc616d78f
fix display bug (#7395) 2021-05-10 11:26:15 +08:00
shuyingsunshine21 987530cd38
Set `num_nodes` and `sync_batchnorm` From Trainer for Manually Passed Training Type Plugin (#7026)
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-05-08 11:25:51 +00:00
Akihiro Nitta 710b144b9b
Restore `trainer.current_epoch` after tuning (#7434)
* Add a test

* Save and restore current_epoch

* Update CHANGELOG

* alphabetical order
2021-05-08 07:15:52 +02:00
Ethan Harris 45143fd825
Improve val step logging (#7351)
* Fix val step logging

* Add a type

* Fix

* Update CHANGELOG.md
2021-05-07 22:58:03 +00:00
ananthsub f9e050c5e5
Move DP warning suppression to the DataParallel Plugin (#7421) 2021-05-07 23:02:44 +02:00
ananthsub fecce50355
Deprecate TrainerModelHooksMixin (#7422)
* Deprecate TrainerModelHooksMixin

* Update CHANGELOG.md

* Update model_hooks.py

* Update model_hooks.py
2021-05-07 13:19:36 -07:00
Carlos Mocholí 8208c330eb
Use `torch.nn.utils.clip_grad_norm_` and add `clip_grad_by_value` support for TPU (#7025)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-05-07 16:41:39 +00:00
Carlos Mocholí 9ba76ce60c
Unify `configure_optimizers` docs (#7399) 2021-05-07 16:10:24 +02:00
Leonard Lausen 98b94b810c
Fix DeepSpeedPlugin with IterableDataset (#7362)
* deepspeed add train_micro_batch_size_per_gpu argument

* Update naming and doc

* Modify to use auto naming convention, add test

* Add iterable tests

* Fix tests, attempt by mocking

* Import correct package

* Fix comparison

* Set as special test

* Remove import

* Add Changelog

Co-authored-by: SeanNaren <sean@grid.ai>
2021-05-07 10:46:03 +01:00
Jirka Borovec 28103c67c2
show mush go on (#7413)
* chlog + version

* readme

* .
2021-05-06 19:06:21 -04:00
Jirka Borovec b181b8c646
release 1.3.0 (#7404)
* v1.3.0

* ci event

* chlog

* badge

* formatting
2021-05-06 15:05:35 -04:00
Gyeongjae Choi d9bdc56b6a
Add _gpus_arg_default in argparse_utils for backward compatibility (#7402) 2021-05-06 13:35:12 +00:00
Jirka Borovec d52e0a8f3e
v0.1.3.0rc3 + changelogs (#7388)
* v0.1.3.0rc3

* spaces

* wip

* wip

* wip

* wip

* prune

* wip

* wip

* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-05-06 07:28:10 -04:00
Martin Kristiansen c3fc0313ef
Updating docs and error message: half precision not available on CPU (#7384)
* Updating docs and error message to specify that half precission not available on CPU

* update messages

Co-authored-by: Martin Kristiansen <martinkristiansen@sixgill.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: jirka <jirka.borovec@seznam.cz>
2021-05-06 09:05:50 +00:00
Carlos Mocholí 6ad05d3338
Update `configure_optimizers` docs (#7390)
* Update `configure_optimizers` docs

* Update pytorch_lightning/core/lightning.py
2021-05-06 10:39:01 +02:00
ananthsub 651f93a69f
Add documentation for ways to access all batch outputs for on_train_epoch_end hook (#7389)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-05-05 22:18:45 +00:00
ananthsub 7b45bcfedb
[2/2] Remove outputs from evaluation epoch end hooks (#7338)
* Remove outputs from on_train_epoch_end

* iterate

* Update callback_hook.py

* update

* early stop?

* fix

* Update pytorch_lightning/trainer/training_loop.py

Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>

* Update trainer.py

* update

* Update training_loop.py

* early stop?

* fix

* Remove outputs from evaluation epoch end hooks

* update

* Update test_remove_1-5.py

* fix lints

* Update base.py

* rm-outputs

* Update evaluation_loop.py

* try-save-more-memory

* Update trainer.py

* Update trainer.py

* cache-at-start

* Update evaluation_loop.py

* Update training_loop.py

* Update training_loop.py

Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
2021-05-05 19:50:58 +00:00
ananthsub 6104a6316a
[1/2] Deprecate `outputs` in `on_train_epoch_end` hooks (#7339)
* Remove outputs from on_train_epoch_end

* iterate

* Update callback_hook.py

* update

* Update training_loop.py

* Update test_training_loop.py

* early stop?

* fix

* update tests

* Update test_hooks.py

* Update pytorch_lightning/trainer/callback_hook.py

Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>

* Update pytorch_lightning/trainer/training_loop.py

Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>

* Update trainer.py

* Update pytorch_lightning/trainer/trainer.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-05-05 17:18:16 +02:00
ananthsub 98670c83a9
Deprecate`truncated_bptt_steps` flag on Trainer in favor of same setting on the LightningModule (#7323)
* deprecate-tbptt-trainer

* Update CHANGELOG.md

* Update lightning.py

* test

* Update lightning.py

* Update training_loop.py

* Update training_loop.py

* Update lightning.py

* Update training_loop.py

* Update training_loop.py

* update docs

* Update accelerator.py

* Update accelerator.py

* more docs

* tweaks

* chlog

* comments

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-05-05 11:21:00 +01:00
Kaushik B e21b7a62d7
Add ddp_find_unused_parameters_false to Registry (#7224) 2021-05-04 22:40:00 +00:00
Carlos Mocholí 374ff750f5
Pass `current_epoch`/`global_step` as monitor candidates [1/2] (#7344)
* Pass `current_epoch`/`global_step` as monitor candidates

* Formatting

* Fix deprecated test

* Update CHANGELOG
2021-05-04 16:05:40 -04:00
Ethan Harris 2a740ebe77
Fix support for dataloader with None batches (#7342)
* Fix Dataloader None batch

* Fix Dataloader None batch

* Update CHANGELOG.md

* Fix breaking test

* Address comments
2021-05-04 12:24:03 +00:00
ramonemiliani93 5db832f181
Fix auto scaling mode when calling tune method on trainer. (#7321)
* Add test for non-existing mode, the test should fail if something different from `power` or `binsearch` is passed.

* Add newline.

* Apply fix

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update tests/tuner/test_scale_batch_size.py

* Update pytorch_lightning/tuner/batch_size_scaling.py

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-05-04 12:03:51 +00:00
ananthsub 69cf63e2fd
Update trainer.py (#7340) 2021-05-04 11:11:27 +00:00
Carlos Mocholí 8c0ea92af2
`TrainerState` refactor [5/5] (#7173)
* `TrainerState` refactor

* flake8

* Update finished check

* Test cleanup

* Fix tests

* Fixes

* Reorder

* flake8

* Update CHANGELOG

* Better docs

* Better docs

* Remove default

* Update tests

* Bad merge
2021-05-04 12:50:56 +02:00
Adrian Wälchli a6aa1a0f82
make gpus=str in Trainer consistent with command line parsing of string (#6388)
* string gpu input

* update docs

* deprecation warning

* Revert "update docs"

This reverts commit c5f3893413.

* deprecation

* add changelog

* update parser

* update warning

* implement v1.5 behavior ahead of time

* formatting

* set accelerator in test to avoid different warning

* add warning

* remove todo warn

* Update pytorch_lightning/utilities/device_parser.py

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

* resolve flake8

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: tchaton <thomas@grid.ai>
2021-05-04 09:56:27 +00:00
Boris Dayma 2a20102321
fix(wandb): allow custom init args (#6989)
* feat(wandb): allow custom init args

* style: pep8

* fix: get dict args

* refactor: simplify init args

* test: test init args

* style: pep8

* docs: update CHANGELOG

* test: check default resume value

* fix: default value of anonymous

* fix: respect order of parameters

* feat: use look-up table for anonymous

* yapf formatting

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-05-04 09:45:36 +00:00
Hemil Desai 82c19e1444
Update LR schedulers only when their corresponding Optimizer is being… (#4868)
* Update LR schedulers only when their corresponding Optimizer is being used.

In the case when optimizer frequencies are specified,
the LR scheduler corresponding to a particular optimizer is updated
only when that optimizer is being used in the training loop or epoch.

* pep8speak fixes

* Fix failing tests

* Add docs

* PR Feedback

* Apply suggestions from code review

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* formatting fix

* PR Feedback - part 2

* More PR feedback

* Apply suggestions from code review

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* Add typing imports

* Stronger tests and fixes related to that

* Add more tests plus PR feedback

* Make optimizer_freq_cumsum a cached property

@cached_property is only available after Python 3.8 so had to do it manually.

* Fix tests

* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Avoid mutable defaults

* Parametrize lr scheduling tests

* PR feedback

* Apply suggestions from code review

* spell

* Apply suggestions from code review

* flake8

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2021-05-04 09:37:40 +00:00
Carlos Mocholí 3fdb61ac1b
Replace `_DataModuleWrapper` with `__new__` [1/2] (#7289)
* Remove `_DataModuleWrapper`

* Update pytorch_lightning/core/datamodule.py

* Update pytorch_lightning/core/datamodule.py

* Replace `__reduce__` with `__getstate__`
2021-05-04 08:00:24 +00:00
Leonard Lausen 597b309f2e
Fix `Trainer.plugins` type declaration (#7288)
* Fix trainer.plugins type declaration

* Don't ClusterEnvironment(Plugin)

* fix import error, yapf formatter

* Add test

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-05-04 08:42:57 +02:00
SpontaneousDuck f135debb6a
Clarify logger flag (#7190)
* Clarify logger flag

Clarify behavior of boolean values on the logger flag for Trainer.

* Update docs/source/common/trainer.rst

* doc

Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2021-05-04 00:21:28 +00:00
Daniel Mesejo-León 6da747e775
Deprecate `LightningModule.datamodule` reference in favor of the trainer one (#6929) (#7168)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-05-04 00:01:41 +00:00
Adrian Wälchli 3e8db4142b
add forgotten test in #7240 (#7283)
^
2021-05-03 23:56:30 +00:00
Kaushik B 6d7c6d6403
Update Accelerator Connector for Registry (#7214) 2021-05-03 21:03:21 +00:00
ananthsub b7a444883c
Remove model.trainer call inside of dataloading mixin (#7317)
* Update data_loading.py

* Update data_loading.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-05-03 13:53:54 -07:00
Mauricio Villegas 78a6fd5588
Example and documentation for LightningCLI linking model and data arguments (#7299)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-05-03 20:45:46 +00:00
Adrian Wälchli bf1394a472
improve early stopping verbose logging (#6811) 2021-05-03 20:20:48 +00:00
ananthsub 14c552bb92
[bugfix] Fix dataloading for iterable datasets and limit_train_batches (#7306)
* bugfix-dataloading

* rm-logs

* Update CHANGELOG.md

* Update test_dataloaders.py

* Update test_dataloaders.py

* Update training_loop.py

* Update test_dataloaders.py

* Update CHANGELOG.md

* Update CHANGELOG.md

* Update test_dataloaders.py

* Update training_loop.py

* Update training_loop.py

* comments

* address comments

* more tests

* Update progress.py

* Update test_dataloaders.py

* Update test_dataloaders.py

* Update training_loop.py

* Update training_loop.py

* test ckpt fix?

* update again
2021-05-03 19:50:26 +01:00
ananthsub 39274273a4
Update accelerator.py (#7318) 2021-05-03 11:17:26 -04:00
Carlos Mocholí badd0bba30 Move trainer functions (#7295) 2021-05-03 09:26:38 -04:00
Adrian Wälchli e0c64f0ef6
Fix Adagrad optimizer not working with DDP/GPU (#7277)
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
2021-05-03 03:57:17 +05:30
Kaushik B 490cc57809
Device updates for TPU Pod (#7243) 2021-04-30 23:14:06 +05:30
thomas chaton 16d6c9828d
[bugfix] Apex never instantiated. (#7274)
* update

* update

* update apex

* update

* update

* update

* remove test.py

* update

* update

* update on comments

* update changelog

* update

* update

* typo
2021-04-30 13:16:28 -04:00
ananthsub 44fd01734c
Move grad_norm to a dedicated utilities file (#7292)
* rm-grad-norm-mixin

* Update grads.py

* Update CHANGELOG.md

* Apply suggestions from code review

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update docstrings

* Update __init__.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-04-30 09:19:22 -07:00
ananthsub e407edba36
[fix] Attach train+val dataloaders to trainer in trainer loop (#7207)
* Update training_loop.py

* Update test_dataloaders.py

* changelog

* delay reload

* go back

* comments

* Update training_loop.py

* Update test_dataloaders.py

* Update tests/trainer/test_dataloaders.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-04-30 09:01:31 -07:00
thomas chaton 80b9ca0e38
[bugfix] Add reloading support using BaseFinetuning (#7253)
* update

* wip

* udpate

* update

* update

* update

* resolve bug

* update on comments

* update on comments

* update

* update

* formatting

* add comments

* update on comments

* update

* Update pytorch_lightning/callbacks/base.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* update

* update

* Typing and minor changes

* Refactor

* Fix deprecated test

* Broken commit

* Fix broken commit

* flake8

* Update CHANGELOG

* update on comments

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-30 11:14:43 -04:00
Carlos Mocholí 5af086ab9f
Attach data refactor and tuner bugs [4/n] (#7258)
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-04-30 13:54:58 +00:00
Adrian Wälchli ea2287e723
update training type plugin docs regarding result caching (#7261)
* add docs

* typo

* update
2021-04-30 13:03:10 +00:00
Adrian Wälchli b9b3fa371f
fix case where an IterableDataset doesn't produce a batch for an epoch (#7294)
* wip

* fix

* add test

* refactor + test

* rm

* formatting

* update changelog

* doc

* docstring

* remove unused import

* Update CHANGELOG.md

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-04-30 12:45:55 +00:00
ananthsub 969e857690
Rename `trainer._launch` to `trainer._run` (#7265)
* rename-run

* fix
2021-04-30 13:39:02 +01:00
Adrian Wälchli 8232de427a
fix save_hyperparameters(container) if container is empty (#7268)
* fix

* add tests

* changelog

* fix test
2021-04-30 13:38:42 +01:00
Kaushik B ac92b57e2b
No need of warning when saved callback_states is None (#7293) 2021-04-30 10:48:53 +00:00
ananthsub 338f5a3311
Remove exp_save_path on the LightningModule (#7266)
* deprecate-exp-save-path

* Update lightning.py

* Update CHANGELOG.md

* remove-not-deprecate
2021-04-29 17:44:04 -04:00
Adrian Wälchli b6706470c1
fix fast_dev_run parsing from cli (#7240) 2021-04-30 01:16:20 +05:30
ananthsub 14b8dd479a
[2/2] Remove training loop force calling early stopping callback (#7069)
* rebase

* doc

* Update training_loop.py

* Update CHANGELOG.md

* Update CHANGELOG.md

* Update CHANGELOG.md

* Update CHANGELOG.md

* Update CHANGELOG.md
2021-04-29 09:14:53 -07:00
Carlos Mocholí a5ac3f8a16
Code cleaning in preparation for #7258 [3/n] (#7262) 2021-04-29 14:40:51 +02:00
thomas chaton 848288c8d8
[warning] Add a warning with missing callback with resume_from_checkpoint (#7254)
* add a warning

* add changelog
2021-04-29 12:39:45 +00:00
George e272bea4dc
Updated `ModelCheckpoint` documentation (#6873)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-04-28 23:56:58 +00:00
ananthsub 075de9356c
Reset current_fx properties on lightning module in teardown (#7247)
* Update trainer.py

* cleanup module properties in teardown

* Update test_trainer.py

* Update lightning.py

* Formatting

* flake8

* Update pytorch_lightning/trainer/trainer.py

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-28 12:17:20 -07:00
Carlos Mocholí 40f80230fe
Remove `trainer.fit` return value [2/n] (#7237)
* `_fit_impl` refactor and types

* Fix return

* Remove return docstring

* Fixes

* Fixes

* Remove `trainer.fit` return value

* Update CHANGELOG

* flake8

* Undo results change

* Fix test

* Revert changes for a separate PR

* flake8
2021-04-28 19:11:32 +01:00
Carlos Mocholí bdc4272e99
`_launch` refactor and types [1/n] (#7232) 2021-04-28 17:41:08 +02:00
ananthsub 947d1cb757
[1/2] Add support for early stopping during training epoch end (#6944)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: jirka <jirka.borovec@seznam.cz>
2021-04-28 15:18:56 +02:00
Vaibhav Balloli ccd87cadfc
Changes resume_from_checkpoint warning to error (#7075)
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-28 15:03:29 +02:00
Ethan Harris d123aaa6a1
Update fsspec dependency and remove un-needed code (#7210)
* Update fsspec dep and remove un-needed code

* Remove unused import
2021-04-28 09:10:46 +01:00
Ali Benkassou cbc6e30b5d
Replace 'step' with 'global_step' (#7244) 2021-04-28 06:44:11 +00:00
Kaushik B 94fcaaf5d7
Add `debug` flag to TPU Training Plugins (PT_XLA_DEBUG) (#7219) 2021-04-27 20:34:25 +00:00
thomas chaton e76ebd640e
[feat] Add BasePredictionWriter 3/3 (#7127)
* wip

* update

* update

* update

* update

* update

* typo

* update on comments

* update

* update

* update

* update

* update changelog

* update

* Fix merge

* Fix merge

* move code

* resolve test

* add extra test

* add an extra test

* update on comments

* add typing

* resolve flake8

* Refactor and Docs

* Fix tests

* Fix tests

* Fix tests

* Duplicate

* Fix tests

* resolve bug

* update

* update on comments

* Update pytorch_lightning/utilities/imports.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update pytorch_lightning/utilities/device_parser.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* update

* update

* update

* update on comments

* resolve flkae8

* update test

* Apply suggestions from code review

* update on comments

* Update pytorch_lightning/callbacks/prediction_writer.py

Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>

* Update pytorch_lightning/callbacks/prediction_writer.py

Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>

* Update pytorch_lightning/callbacks/prediction_writer.py

Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>

* update on comments

* update

* update on comment

* Apply suggestions from code review

* update

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-04-27 20:23:55 +00:00
Kaushik B c6d9f52cb3
Add a check for TPU Spawn barrrier (#7241) 2021-04-27 19:45:55 +00:00
thomas chaton 5a113a2f05
[bug/feat] Support parameters_to_ignore in DDP (#7239)
* update

* update

* update

* update on comments

* update
2021-04-27 17:49:32 +00:00
Seongmin Park 7fe8d18477
Do not `shuffle` in `LightningDataModule.from_datasets` for `IterableDataset` (#7053)
* Expose shuffle argument in LightningDataModule.from_datasets

* Add test for DataModule initialization with iterable datasets

* Add changelog

* Remove trailing whitespace

* Add more tests for coverage

* Fix sequence dataset coverage

* Fix Sequence dataset tests

* Directly check whether each passed dataset is an IterableDataset

* Expose shuffle argument in LightningDataModule.from_datasets

* Add test for DataModule initialization with iterable datasets

* Add changelog

* Remove trailing whitespace

* Add more tests for coverage

* Fix sequence dataset coverage

* Fix Sequence dataset tests

* Directly check whether each passed dataset is an IterableDataset

* Fix changelog to reflect review direction

* Update CHANGELOG.md

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Fix changelog to reflect review direction (2)

* Add suggested braces

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Reuse isinstance check

* Merged tests with parametrize. Use mocks

Co-authored-by: Seongmin Park <seongmin.park@actionpower.kr>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-04-27 12:53:49 -04:00
ananthsub bab7225507
[fix] Add barriers before and after setup hook is run (#7202)
* Update data_connector.py

* move-barrier

* Update trainer.py

* Update ddp.py

* changelog

* Spacing

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-27 17:19:43 +01:00
thomas chaton f920ba29f2
[bugfix] Metric not logged properly in manual optimization (#7228)
* resolve bug

* update changelog

* typo

* Update tests/trainer/optimization/test_manual_optimization.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-04-27 09:16:51 -04:00
thomas chaton e147127c0e
[feat] Add better support for predict + ddp 2/3 (#7215)
* wip

* update

* update

* update

* update

* update

* typo

* update on comments

* update

* update

* update

* update

* update changelog

* update

* Fix merge

* Fix merge

* move code

* resolve test

* add extra test

* add an extra test

* update on comments

* add typing

* resolve flake8

* Refactor and Docs

* Fix tests

* Fix tests

* Fix tests

* Duplicate

* Fix tests

* resolve bug

* update

* update on comments

* update

* update changelog

* update

* update

* remove tpu

* resolve flake8

* update on comments

* update on comments

* update on comment

* resolve flake8

* add a cpu test for predict

* add None test

* update

* Update CHANGELOG.md

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* resolve tests

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-27 08:46:45 -04:00
Carlos Mocholí ca6c87ffbe
Add back `clip_gradients(model)` (#7231) 2021-04-27 11:34:02 +00:00
Adrian Wälchli 3b36d81c03
Fixed `num_sanity_val_steps` affecting reproducibility of training data shuffling (#7014)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-04-27 09:51:39 +00:00
Kaushik B 5cf9afa176
Add fairscale install msg for Sharded Plugins (#7213) 2021-04-27 08:22:44 +00:00
shuyingsunshine21 52a5cee0a7
Set smarter default for DDP sharded for performance optimization (#6937)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-27 04:01:34 +05:30
ananthsub dd5ec75e48
Deprecate save_function from model checkpoint callback (#7201)
* Update model_checkpoint.py

* Update CHANGELOG.md

* fix-tests

* deprecate not remove

* Update model_checkpoint.py

* Update test_remove_1-5.py
2021-04-26 17:55:26 +01:00
Alessio Bonfiglio ac7d6a35c3
Fix `NeptuneLogger.log_text(step=None)` (#7194) 2021-04-26 15:28:55 +02:00
Kaushik B 6be0a859db
Update teardown for TPU acc (#7211) 2021-04-26 13:30:46 +01:00
ananthsub bc3f08b0e3
[fix] Add barrier to accelerator's teardown (#6814) 2021-04-26 09:23:29 +00:00
ananthsub 68eac4d948
Enforce Lightning module as source of truth for automatic optimization (#7130)
* make lightning module source of truth for automatic optimization

* Update configuration_validator.py

* Update model_connector.py

* rm-references

* Update CHANGELOG.md

* Update CHANGELOG.md

Co-authored-by: jirka <jirka.borovec@seznam.cz>
2021-04-26 05:36:26 +00:00
Kaushik B 44d775fccf
Update Error message for ProfileConnector (#7204)
* Update Error message for ProfileConnector

* Update test
2021-04-25 11:37:21 -07:00
ananthsub 31fcd7d0ab
Deprecate write_predictions on the LightningModule (#7066)
* deprecate-write-predictions

* Update CHANGELOG.md

* Update test_remove_1-5.py

Co-authored-by: thomas chaton <thomas@grid.ai>
2021-04-25 16:54:56 +00:00
ananthsub b3fe836656
Move metrics_to_scalars to a dedicated utilities file (#7180)
* rm-trainer-logging

* Update CHANGELOG.md

* Update metrics.py

* Update logging.py

* Update metrics.py
2021-04-24 10:25:33 +01:00
thomas chaton f58865aada
Properly set `LightningModule.device` after model replacement (#7188)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-23 16:36:52 +02:00
Sean Naren 8439aead66
Update FairScale on CI (#7017)
* Try updating CI to latest fairscale

* Update availability of imports.py

* Remove some of the fairscale custom ci stuff

* Update grad scaler within the new process as reference is incorrect for spawn

* Remove fairscale from mocks

* Install fairscale 0.3.4 into the base container, remove from extra.txt

* Update docs/source/conf.py

* Fix import issues

* Mock fairscale for docs

* Fix DeepSpeed and FairScale to specific versions

* Swap back to greater than

* extras

* Revert "extras"

This reverts commit 7353479f

* ci

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: jirka <jirka.borovec@seznam.cz>
2021-04-23 12:37:00 +01:00
Akihiro Nitta 92af363270
Fix `lr_finder` suggesting too high learning rates (#7076)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-04-23 10:59:40 +00:00
Adrian Wälchli d534e53ec4
add missing predict docs (#7150)
* update docs

* add datamodule predict

* fix docs

* typo
2021-04-23 10:38:44 +00:00
Tharindu Hasthika c502e47abf
Fixed setting of _save_dir when run initiated externally (#7106)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-04-23 01:14:46 +00:00
Jirka Borovec f48ac62334
fix pip install (#7170) 2021-04-22 16:48:11 -04:00
Jirka Borovec aa7d3dc6cc
Fix `torchmetrics` compatibility (#7131)
* get_num_classes

* tmp

* fix one test

* fix deprecated tests

* fix deprecate

* pep8

* deprecate 0.3

* wip

* wip

* HaCK

* brnch

* brnch

* format

* Apply suggestions from code review

* prune

* rev

* mltilabel

* Apply suggestions from code review

* master

* rev

* .

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
2021-04-22 20:45:46 +00:00
Jirka Borovec ef5feac7ba
fix version + yapf (#6999) 2021-04-22 18:25:51 +00:00
Carlos Mocholí 33066f8fd9
Add `on_predict_{batch,epoch}_{start,end}` and `Callback.on_predict_{start,end}` (#7141)
* Update hooks typing and predict hooks

* Update CHANGELOG

* Progress

* Progress

* Add back `on_predict_{start,end}`

* Typing and fix

* Update tests/trainer/logging_/test_logger_connector.py

* Update tests/callbacks/test_lambda_function.py
2021-04-22 10:05:28 -04:00
ananthsub 3f1a08ab00
Fix mypy checks for double precision plugin (#7151) 2021-04-22 11:29:38 +01:00
thomas chaton 99b9dfa883
[bugfix] Remove warning for distributed values (#7132)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: jirka <jirka.borovec@seznam.cz>
2021-04-22 02:14:46 +02:00
Carlos Mocholí 345e9a0245
Fix argparse docs (#7148) 2021-04-22 02:13:00 +02:00
Sean Naren ce14565ed9
[FSDP] Move on save checkpoint outside of zero check (#7134)
* Move on save checkpoint outside of zero check

* Remove unnecessary override

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-22 01:54:47 +02:00
ananthsub 2f84459d26
Broadcast dirpath for tighter consistency in model checkpoint callback (#6978)
* Update model_checkpoint.py

* Update model_checkpoint.py

* Update model_checkpoint.py
2021-04-21 10:20:27 -07:00
thomas chaton 013756404b
[bugfix] Add set_default_tensor_type to torch.DoubleTensor with precision=64 (#7108)
* update

* Update pytorch_lightning/plugins/precision/double.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update pytorch_lightning/plugins/precision/double.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update pytorch_lightning/plugins/precision/double.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* resolve tests

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-20 15:25:37 +00:00
thomas chaton ca21da4f3b
Move save_hyperparameters to its own function (#7119)
* move hyper_parameters

* Update pytorch_lightning/core/lightning.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update pytorch_lightning/utilities/parsing.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* resolve flake8

* update

* resolve tests

* Update pytorch_lightning/core/lightning.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-04-20 11:04:35 -04:00
Kaushik B f168a535ca
Add MpModelWrapper in TPU Spawn (#7045)
Co-authored-by: tchaton <thomas@grid.ai>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-20 13:05:27 +00:00
Akihiro Nitta 0302b8be32
Disable `lr_scheduler.step()` in manual optimization (#6825)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
2021-04-20 13:00:45 +02:00
thomas chaton 9beec26c3e
[bugfix] Add support for CombinedLoader in validation with ddp (#7102)
* add test

* add changelog

* resolve flake8

* remove print
2021-04-20 08:22:02 +00:00
Adrian Wälchli 67528c4665
Fix attribute error for _gpus_arg_default loading checkpoint prior to 1.2.8 (#7043) 2021-04-20 07:34:03 +00:00
Adrian Wälchli 6b15ca95f0
fix logger experiment version in multiple run DDP (#7077)
* fix

* changelog
2021-04-19 17:12:05 +00:00
Adrian Wälchli d12c6cf2b3
more early stopping options (convergence and divergence threshold) (#6868)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-19 16:49:52 +02:00
Adrian Wälchli 60c1c8fe83
Auto-set `DataLoader.worker_init_fn` with `seed_everything` (#6960)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
2021-04-19 16:28:37 +02:00
Akihiro Nitta d1529c28a1
Optimization docs (#6907)
* .

* .

* Fix link to the section

* Fix link to the section

* Consistent indent

* Update docs

* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Add note for optimizer.optimizer

* .

* Update hooks

* Update closure docstring

* Update optimizer methods

* Update optimizer

* Remove manopt + grad clipping (by @flukeskywalker)

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-04-19 10:08:49 -04:00
Adrian Wälchli 2b232d3fbd
fix docs rendering in datamodule (#7064)
* [docs]: add newline to correctly render Example

* whitespace

Co-authored-by: Matthew Sarmiento <matthewcs@me.com>
2021-04-19 10:08:09 -04:00
Carlos Mocholí a5e356adb1
Deprecate `@auto_move_data` in favor of `trainer.predict` (#6993)
* Deprecated `@auto_move_data` in favor of `trainer.predict`

* Update CHANGELOG
2021-04-19 14:53:21 +01:00
Adrian Wälchli e9fca760ac
Set `DistributedSampler` seed if `seed_everything` was called (#7024)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-19 14:50:31 +01:00
Nicki Skafte fbee5a86e7
Correctly reset metric objects in self.log (#7055)
* reset

* fix tests

* fix tests

* Apply suggestions from code review

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* move logic

* chglog

* pep8

* Add test

* Improve test

Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
2021-04-19 14:48:48 +01:00
mlech26l e61daff5cc
Typo LightningMoule -> LightningModule (#7038) 2021-04-19 13:48:44 +01:00
Carlos Mocholí 898ec8a94a
Create pytorch_lightning/utilities/types.py (#7048) 2021-04-19 14:43:16 +02:00
Kaushik B 30b7440e12
TPU Spawn Rank & root device Error (#7074)
* TPU Spawn Rank Error

* Update tpu spawn

* Fix root device property for tpu spawn

* Update changelog
2021-04-18 23:42:48 +02:00
Kaushik B 97be843226
Better approach to register plugins (#7063)
* Better approach to register plugins

* Add ddp_with_find_unused_parameters_false

* Remove unnecessary break

* Revert back the ddp commit

* Update register override logic

* Update register override logic

* fix mypy
2021-04-18 11:23:12 +02:00
thomas chaton 7b0b0d2844
update (#7056) 2021-04-16 21:22:19 +01:00
ananthsub 8bcd169767 [fix] Fix multi-node DDP launch by using local rank instead of global rank for main process (#7061)
* Update ddp.py

* Update CHANGELOG.md
2021-04-16 21:18:54 +01:00
Kaushik B 6a7b4cf5d3
Fix mypy for plugins registry (#7062) 2021-04-17 01:33:41 +05:30
Adrian Wälchli 3fb8eada34
rc2 (#7057) 2021-04-16 20:34:14 +02:00
Kaushik B 832a03af7c
Add Training Type Plugins Registry (#6982)
Co-authored-by: Sean Naren <sean@grid.ai>
Co-authored-by: thomas chaton <thomas@grid.ai>
2021-04-16 18:01:56 +05:30
Adrian Wälchli 67d21609c9
Add Trainer max_time argument + Callback (#6823)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2021-04-16 13:38:57 +02:00
ananthsub 4c07ab5e99
Use PyTorch API logging for Lightning Trainer (#6771)
* Update trainer.py

* Update trainer.py

* Update trainer.py
2021-04-16 00:10:34 +02:00
Carlos Mocholí f29ecbfd90
Typing for accelerators and plugins (#7022) 2021-04-15 16:48:16 +00:00
ananthsub f6f81f0430
[fix] Add a cluster environment teardown to clean up environment state (#6942)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-15 16:06:54 +00:00
Mauricio Villegas f852a4f592
Changed basic_examples to use `LightningCLI` (#6862)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-15 15:01:16 +00:00
Ethan Harris f645df5e9a
Add typings for evaluation_loop.py and remove some dead code (#7015) 2021-04-15 07:36:04 +00:00
Edward Brown 5bd3cd5f71
Bugfix/cuda oom detection and handling (#6934)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-04-15 03:22:11 +02:00
Jirka Borovec 895bea1ad3
rename about (#7002)
* rename about

* .

* ..
2021-04-14 18:56:40 -04:00
Adrian Wälchli d3f73a0a74
Plugin Docs (#6952)
Co-authored-by: edenlightning <66261195+edenlightning@users.noreply.github.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
2021-04-14 20:53:21 +00:00
SpontaneousDuck dcff5036a8
Use PickleError base class to detect all pickle errors (#6917)
* Use PickleError base class to detect all pickle errors

* Update changelog with #6917

* Add pickle test for torch ScriptModule

Co-authored-by: Ken Witham <k.witham@kri.neu.edu>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2021-04-14 20:24:32 +00:00
shuyingsunshine21 03a73b37bc
Train End Error Handling Fix (#6864)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
2021-04-14 20:35:42 +02:00
Nicki Skafte 7c5ad1905d
Bugfix for predict progressbar (#6884)
* gating

* tests

* pep8

* changelog
2021-04-14 09:50:36 +01:00
CeShine Lee 24d0295ff1
Fix the `gradient_clip_algorithm` has no effect issue. (#6928) 2021-04-14 14:17:06 +05:30
Adrian Wälchli 33cc9fe138
Clean up environment access in plugins (#6941)
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-04-13 20:07:40 +02:00
Peng Zhang 89074fa2ad
Fix Multi-GPU join for horovod (#6954)
* fixjoin

* fix join on cpu

* fix typo

* try to undo horovod skip

* undo

* Try removing skip

* Update CHANGELOG

* add back skip for test_horovod_multi_optimizer

* Add back skip

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-13 17:44:41 +01:00
Carlos Mocholí 15926b462c
Add SWA warning if not running every epoch (#6987)
* Add SWA warning if not running every epoch

* Typo
2021-04-13 18:34:40 +02:00
Ethan Harris b9bc77293b
Fix inconsistent outputs in `on_*_end` and `*_end` (#6969)
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-04-13 15:16:21 +01:00
ananthsub e891ceb836
Remove evaluation loop legacy dict returns for `*_epoch_end` hooks (#6973)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-13 12:37:54 +01:00
Hinrich B. Winther b37b58a73e
Fix Checkpoint issue when using Horovod distributed backend (PyTorchLightning#6947) (#6958)
Co-Authored-By: Adrian Wälchli <aedu.waelchli@gmail.com>

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-04-13 09:18:52 +00:00