Commit Graph

2820 Commits

Author SHA1 Message Date
Carlos Mocholí 1dd61e4e35
Extend support for logging a collection (#7771) 2021-06-01 12:51:50 +01:00
Carlos Mocholí 0dd6d3a798
Avoid adding `None` loss values in `training_epoch_end` (#7772) 2021-05-31 19:28:28 +00:00
Adrian Wälchli 7e6010fc93
fix info message when max training time reached (#7780)
* call time_elapsed

* elapsed formatting

* format

* update test

* changelog
2021-05-31 14:50:16 +02:00
Carlos Mocholí d47173bb72
Use typing forward references (#7770)
* Use typing forward references

* Update pytorch_lightning/core/lightning.py
2021-05-31 09:54:28 +02:00
Carlos Mocholí 5f0863e5e5
Organize trainer properties (#7758)
* Organize trainer properties

* Single quote

* Double quote
2021-05-30 13:09:01 +02:00
Carlos Mocholí bc3238be8c
Remove metric tracking from dev debugger (#7759)
* Remove dev debugger metric tracking

* Fix tests

* Fix test

* Import

* Fix tests

* Fix test

* flake8

* Fix tests
2021-05-30 12:03:42 +02:00
Mauricio Villegas f6b5e3df57
Added save_config_filename init argument to LightningCLI (#7741) 2021-05-28 09:30:16 +02:00
Boris Dayma 9097347ea8
feat(wandb): log models as artifacts (#6231)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-05-27 20:15:02 +02:00
Carlos Mocholí 9304c0df8f
Rename and move Result (#7736)
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-05-27 12:27:52 +00:00
Kaushik B 04dcb1786d
Add `__len__` method to IndexBatchSamplerWrapper (#7681)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-05-26 18:20:13 +02:00
Carlos Mocholí 311d9fe67e
Always run validation inside the training loop epoch (#7357)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-05-26 14:26:48 +02:00
Kaushik B 27eb0035ca
Increase TPU Check timeout (#7706)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-05-26 01:44:29 +00:00
Carlos Mocholí d26953c8bc
Add `ModelPruning(prune_on_train_epoch_end)` to choose when to apply pruning (#7704)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-05-26 00:57:56 +02:00
Xinyao(Alvin) Sun 7e2f7e956b
fix: improve UserWarning message (#7685)
* fix: improve UserWarning message
when both overfit and training dtaloader shuffling are enabled

fixes issue: #7656

* chore: update changelog

* Polish userwarning msg in pytorch_lightning/trainer/data_loading.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* shuffling typo

* Update CHANGELOG.md

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-05-25 17:35:15 +00:00
Kaushik B e7057d5898
Add `should_rank_save_checkpoint` property to Training Plugins (#7684)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-05-25 23:02:05 +05:30
Carlos Mocholí a1c40f3207
Remove on epoch guard from the should stop validation check (#7701)
* Remove on epoch guard from the should stop validation check

* Formatting
2021-05-25 15:59:42 +01:00
Carlos Mocholí e2ead9abd7
Refactor some loops code and hook tests (#7682) 2021-05-25 13:27:54 +02:00
Carlos Mocholí 8ba6304c73
Increment the total batch idx before the accumulation early exit (#7692)
* Increment the total batch idx before the accumulation early exit

* Update CHANGELOG
2021-05-25 10:23:40 +02:00
Carlos Mocholí 8b01497e42
Fix global step update when the epoch is skipped (#7677)
* Fix global step update when the epoch is skipped

* Update CHANGELOG

* Move test
2021-05-24 17:36:56 +01:00
Kaushik B 3f460b150a
Move parameter validation specific to TPU Training plugins (#7415)
* Move parameter validation specific to TPU Training plugins

* update docstring
2021-05-24 16:02:01 +00:00
ananthsub fa41c588f4
Remove ProfilerConnector class (#7654)
* Remove ProfilerConnector class

* Update trainer.py

* Update CHANGELOG.md

* Update trainer.py

* Update trainer.py

* tests
2021-05-24 08:58:15 -07:00
Gyeongjae Choi a54bc5dba3
Fix progress bar print error when called before training (#7674)
* Check progress bar existence before printing

* Add tests for predict_progres_bar

* Add tests for progress_bar printing without training

* Update changelog
2021-05-24 17:33:28 +02:00
Carlos Mocholí 2103b5efc9
Move sync code from step result to lightning module [6/n] (#7651) 2021-05-24 13:13:55 +01:00
Xinyao(Alvin) Sun 0c958c5a1f
Fix dataloaders are not reset when tuning the model (#7566)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-05-24 10:21:45 +02:00
shuyingsunshine21 299f2c481b
FSDP with full state dict (#7487)
* Fix some test errors
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* checkpoint consolidation

* Update ddp_spawn.py

* Update test_metric_result_integration.py

* Update test_results.py

* Update utils.py

* Update utils.py

* Update test_all_gather_grad.py

* Update test_all_gather_grad.py

* Update test_results.py

* Revert "Update test_results.py"

This reverts commit 9d4a2b891d.

* Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkpoint_consolidate"

This reverts commit c5053da789, reversing
changes made to 0d23d75bc9.

* Revert "Update test_all_gather_grad.py"

This reverts commit 0d23d75bc9.

* Revert "Update utils.py"

This reverts commit 70fe5da9c6.

* Revert "Update utils.py"

This reverts commit a9aae99f6e.

* Revert "Update test_results.py"

This reverts commit ea74906878.

* Revert "Update test_metric_result_integration.py"

This reverts commit bf70e431b3.

* Revert "Update ddp_spawn.py"

This reverts commit f17210183b.

* Revert "checkpoint consolidation"

This reverts commit 536c1323b0.

* Revert "Revert "checkpoint consolidation""

This reverts commit 3a9fde915a.

* Revert "Revert "Revert "checkpoint consolidation"""

This reverts commit 7a369f47e1.

* Revert "Revert "Update ddp_spawn.py""

This reverts commit 8222dc98ea.

* Revert "Revert "Update test_metric_result_integration.py""

This reverts commit 6c095b2370.

* Revert "Revert "Update test_results.py""

This reverts commit 250d0aaaa2.

* Revert "Revert "Update utils.py""

This reverts commit 8651d54d79.

* Revert "Revert "Update test_all_gather_grad.py""

This reverts commit dcdcd29731.

* modify distributed environment to make test pass

* fix version for ddp plugin test

* fix

* fix

* changelog

* Update CHANGELOG.md

* fsdp with full state dict

* fix missing import

* modify unitest

* fix

* fix

* fix typo

* modify test and add changelog

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* limit max_epoch to 1 for testing

* test

* fix

* update

* testing remove special for multi gpu

* assert gpu

* add assertion for gpu

* fix

* Re-enable special test, use ModelCheckpoint

* Fix paths

* Fix path passing

* test

* test

* fix test

* fix

* pre-commit format

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: SeanNaren <sean@grid.ai>
2021-05-24 08:11:45 +01:00
Xinyao(Alvin) Sun 01109cdf0c
Fix/mismatched toggle optimizer (#7563)
* fix: avoid potential mismatched toggling of optimzier
Refs #7405

chore: update CHANGELOG

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

fix: resolve a confict

chore: update changelog

* feat: add a test that fails in master

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo in tests/trainer/optimization/test_multiple_optimizers.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* Polish tests/trainer/optimization/test_multiple_optimizers.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Polish tests/trainer/optimization/test_multiple_optimizers.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* fix: change placeholder in optimizer_step from positional args to keyword args

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-05-23 04:30:28 +02:00
shuyingsunshine21 2242423b75
refactor accelerator teardown -> training type plugin teardown (#7579) 2021-05-22 13:19:24 -07:00
Carlos Mocholí a8d9b5f783
Remove tbptt `self.log` flags and other dead code [5/n] (#7644) 2021-05-22 01:13:00 +00:00
Carlos Mocholí 33a1f5271f
[2/N] Define dataclasses for progress tracking (#7574)
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
2021-05-22 03:09:08 +02:00
Yifu Wang 8d6e2ff7b2
Improve argument validation for validate(), test(), and predict() (#7605)
Co-authored-by: Yifu Wang <yifuwang@2012@gmail.com>
2021-05-21 09:03:16 -07:00
ananthsub f6d892ac21
[feat] Support custom filesystems in LightningModule.to_torchscript (#7617)
* [feat] Support custom filesystems in LightningModule.to_torchscript

* Update CHANGELOG.md

* Update test_torchscript.py

* Update test_torchscript.py

* Update CHANGELOG.md

* Update test_torchscript.py
2021-05-21 11:23:15 +00:00
Carlos Mocholí e8a46bee15
Remove `Result(minimize)` parameter [4/n] (#7628) 2021-05-21 12:58:52 +02:00
Carlos Mocholí 603ef2cf7f
Use `trainer.call_hook` in the evaluation loop (#7626) 2021-05-21 11:54:52 +01:00
Carlos Mocholí 3d4dd28bec
Replace `CallbackHookNameValidator` with `FxValidator` [3/n] (#7627)
* Refactor FxValidator

* Fix tests

* Fix tests

* Class attribute

* Fix tests

* Better error message

* Fix tests

* Update pytorch_lightning/trainer/connectors/logger_connector/fx_validator.py
2021-05-21 11:54:16 +01:00
i-aki-y 7eafd8eac6
Add run_name argument to the MLFlowLogger constructor (#7622)
* Add run_name argument to the MLFlowLogger

* Update CHANGELOG

* Fix unnecessary line

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix style by using yapf

* Fix import error when mlflow is not installed

* Update CHANGELOG.md

* Update tests/loggers/test_mlflow.py

Co-authored-by: akiyuki ishikawa <aki.y.ishikwa@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-05-21 09:17:32 +01:00
ananthsub 94ef17ce77
Update model_checkpoint.py (#7625) 2021-05-20 23:16:18 +02:00
Andrew Tritt 92cf396de2
Override `broadcast_object_list` for `torch<1.8` (#7592)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-05-20 08:29:55 +00:00
Yifu Wang ed271905cf
Clear predict_progress_bar in ProgressBar.__getstate__ (#7608)
Co-authored-by: Yifu Wang <yifuwang@2012@gmail.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-05-20 01:38:49 +00:00
ananthsub 8266b141ba
[feat] Support time-based checkpointing during training (#7515)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-05-19 22:14:13 +00:00
ananthsub 9f5d4955b6
[1/N] Define dataclasses for progress tracking (#6603)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-05-19 21:02:20 +00:00
Carlos Mocholí 901b2bac98
Unify `current_fx_name` and `current_hook_fx_name` [2/n] (#7594)
* Minor loggger connector cleanup [1/n]

* Missing line

* Address comments

* Rely on validator

* Unify `current_fx_name` and `current_hook_fx_name`

* Fix test
2021-05-19 20:31:06 +00:00
Carlos Mocholí dbea5bb710
Add typing to `ModelPruning` callback (#7529)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-05-19 22:01:42 +02:00
Jan-Henrik Lambrechts 608de6abf4
TensorBoardLogger sub_dir parameter for grouping logs (#6195)
* fixed a small typo

* cleaning up

* added sub_dir argument to tensorboard and wrote test

* sub dir arg exclusively for tensorboard, linted

* resolving merge conflict

* resolved merge conflict

* resolved merge conflict

* resolved merge conflict

* resolve merge conflict before revert

* resolving merge conflict

* reverted to pre-lint

* added tensorboard sub_dir test

* pep8 formatting

* removed sub_dir arg from test_all function:

* updated feature description

* typo in doc description

* updated CHANGELOG

* Update pytorch_lightning/loggers/tensorboard.py

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>

* swapped argument position

* added expandvars tests

* added expandvars

* removed model init

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix tests

* fix failed test

* Revert "fix failed test"

This reverts commit 50b34c66da.

* add env var to test

* fix typo in tests

* fix tests

* for test consistency

* fix typo

* fix typo 2

Co-authored-by: Ubuntu <azureuser@devhenrik.evuifrmjd4lepbj4relcwwu5va.ax.internal.cloudapp.net>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
2021-05-19 19:50:58 +00:00
ananthsub b4e28e7169
[feat] Add stronger validation for checkpoint_callback argument (#7539)
* [feat] Add stronger validation for checkpoint_callback configuration

* chlog

* Update callback_connector.py

* Update test_model_checkpoint.py

* Update pytorch_lightning/trainer/connectors/callback_connector.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update pytorch_lightning/trainer/connectors/callback_connector.py

* Update tests/checkpointing/test_model_checkpoint.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update CHANGELOG.md

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-05-19 19:38:08 +00:00
Carlos Mocholí 76ff600898
Minor logger connector cleanup [1/n] (#7590)
* Minor loggger connector cleanup [1/n]

* Missing line

* Address comments

* Rely on validator
2021-05-19 19:25:32 +00:00
TOKUNAGA Hiroyuki 20f63377f8
Fix the condition for calling update_learning_rates (#7032)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-05-17 17:20:42 +02:00
Adrian Wälchli 502adbced3
refactor optimizer loop logic for manual and automatic optimization (#7526)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
2021-05-17 14:42:01 +02:00
Kaushik B bf46730d92
Support TPU Pod Training (n/n) (#7296) 2021-05-17 11:33:44 +00:00
Nic Eggert f4f51e0dcf
Add kubeflow cluster environment (#7300)
* Add kubeflow cluster environment

* Add KubeflowEnvironment to docs

* Add KubeflowEnvironment to the changelog

* break up a long line

* Add method to detect kubeflow environment

* Select Kubeflow environment when available

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Run pre-commit

* task_idx == 0

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-05-17 09:05:24 +01:00
Adrian Wälchli 6e6e29af49
remove trainer hidden state | sanity refactor [2 / n] (#7507) 2021-05-17 08:57:15 +01:00