Commit Graph

7802 Commits

Author SHA1 Message Date
Dan Dale acaeab27f6
Fix GPU tests that fail to raise expected configuration error when run in a CUDA environment (#14983) 2022-10-04 18:40:55 -04:00
thomas chaton b936fd4380
[app] Add CloudCompute ID serializable within the flow and works state (#14819) 2022-10-04 19:46:44 +00:00
Sherin Thomas 53694eb93d
[App/Improvement] Cleaning up Queue abstraction (#14977)
[App/Improvement] Cleaning up Queue abstraction (#14977)
2022-10-04 22:07:31 +05:30
Ethan Harris ce919ee7d6
Fix commands and API test (#14947) 2022-10-04 15:38:40 +00:00
geoffrey-g-delhomme 9832d36851
Fix `ReduceLROnPlateau` update issue while resuming from a checkpoint (#14702)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-10-04 11:55:51 +00:00
Kishan Savant c059db446e
Remove the deprecated device_stats_monitor_prefix_keys (#14890)
* Remove the deprecated device_stats_monitor_prefix_keys

* Added pr no to changelog.md

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2022-10-03 17:13:02 +00:00
DP c764221615
fixes typing errors in rich_progress.py (#14963) 2022-10-03 14:11:18 +00:00
Adam J. Stewart 09a8001923
Trainer: fix support for non-distributed PyTorch (#14971)
* Trainer: fix non-distributed use
* Update CHANGELOG
2022-10-03 13:15:07 +00:00
Carlos Mocholí 3028fd287d
Fix TPU test CI (#14926)
* Fix TPU test CI

* +x first

* Lite first to uncovert errors faster

* Fixes

* One more

* Simplify XLALauncher wrapping to avoid pickle error

* debug

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Debug commit successful. Trying local definitions

* Require tpu for mock test

* ValueError: The number of devices must be either 1 or 8, got 4 instead

* Fix mock test

* Simplify call, rely on defaults

* Skip OSError for now. Maybe upgrading will help

* Simplify launch tests, move some to lite

* Stricter typing

* RuntimeError: Accessing the XLA device before processes have spawned is not allowed.

* Revert "RuntimeError: Accessing the XLA device before processes have spawned is not allowed."

This reverts commit f65107ebf3.

* Alternative boring solution to the reverted commit

* Fix failing test on CUDA machine

* Workarounds

* Try latest mkl

* Revert "Try latest mkl"

This reverts commit d06813aa67.

* Wrong exception

* xfail

* Mypy

* Comment change

* Spawn launch refactor

* Accept that we cannot lazy init now

* Fix mypy and launch test failures

* The base dockerfile already includes mkl-2022.1.0 - what if we use it?

* try a different mkl version

* Revert mkl version changes

Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2022-10-03 09:13:33 -04:00
otaj e290c206c9
Bump version of fsspec (#14975)
fsspec verbump
2022-10-03 09:53:15 +00:00
Jerome Anand e62521caf1
Update hpu mixed precision link (#14974)
Signed-off-by: Jerome <janand@habana.ai>
2022-10-03 09:05:17 +02:00
Carlos Mocholí be7bfdba27
Remove unused gcsfs dependency (#14962) 2022-10-01 16:08:36 +00:00
otaj 511a070c52
Find last checkpoints on restart (#14907)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-30 20:14:18 +00:00
Ziyad Sheebaelhamd db26e087e7
Close profiler when `StopIteration` is raised (#14945) 2022-09-30 19:29:12 +00:00
Adrian Wälchli d7af8ce2a5
Simplify root node resolution for SLURM environment (#14912)
Co-authored-by: Seppo Enarvi <seppo.git@marjaniemi.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-30 15:40:43 +00:00
Adrian Wälchli cd9247a782
Introduce primitives for input/output dtype conversion in Lite Precision (#14792)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-09-30 15:29:03 +00:00
Andres Algaba 3daa4c9cc0
Remove deprecated on_init_start_end (#14867)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-09-30 15:11:38 +00:00
Pritam Soni 2721a2f06b
feat: option to add custom meta tags to the UI container (#14915) 2022-09-30 18:56:57 +05:30
Carlos Mocholí fd2779e55f
Fix fork skip condition in GitHub workflows (#14955)
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-09-30 08:30:47 -04:00
Mauricio Villegas 15aa9c679d
An instance of SaveConfigCallback should only save the config once (#14927) 2022-09-30 12:16:37 +00:00
Masahiro Wada abea29bfa3
Move type annotation into __init__ (#14943)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-30 11:03:03 +00:00
Lee Jungwon a9142d637a
Fix mypy typing errors in pytorch_lightning/trainer/trainer.py (#14204)
Co-authored-by: otaj <ota@lightning.ai>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-30 10:50:42 +00:00
Akihiro Nitta 021c2f1447
Fix typo in checkgroup.yml (#14959)
Fix typo
2022-09-30 10:12:06 +00:00
Carlos Mocholí 6256a318d7
Refactor launching tests to use our launchers (#14954) 2022-09-30 09:57:18 +02:00
Akihiro Nitta e47d5a2376
CI: Combine conda and full testing into a single workflow (#14387)
* Remove conda job

* Remove conda job from readme

* Remove conda jobs from checkgroup

* Remove conda from docker builds

* Remove base-conda dockerfile

* Rewrite the strategy matrix while keeping equivalent

* Run the workflow on this branch

* Revert "Rewrite the strategy matrix while keeping equivalent"

This reverts commit e54298d60e57cffbf8107890987be3fe4a006c77.

* Add PyTorch versions

* Run on draft and disable unrelated costly CI

* Revert "Run the workflow on this branch"

This reverts commit 51ed8b905d8926b630dce4817124bd486135d3ec.

* tmp: Lightweight relevant CI

* Fix CI pathfilter

* Update matrix

* Drop skipping logic

* pip list

* reorder pip list

* tmp: lightweight ci

* Install specified pytorch

* Fix torch installation

* Uncomment steps

* Increase timeout

* bad merge

* Revert "Run on draft and disable unrelated costly CI"

This reverts commit eb5dc5e6bd.

* Update checkgroup

* Update docs and remove Python/PyTorch versions

* Remove pip-list

* Fail if wrong pytorch version installed

* Add Python 3.8, PyTorch 1.9 job

* tmp: remove azure jobs

* tmp: remove dockers

* tmp: remove others

* Run all combinations

* Include oldest

* Exclude no Python 3.10 distributions

* tmp: no concurrency

* tmp: double timeout

* Add pytest log reporter

* Add pytest-reportlog

* Fewer jobs

* Revert "tmp: no concurrency"

This reverts commit 4a7978dcb3.

* fix artifact name

* Revert test reports

* Revert unrelated changes

* Revert unrelated changes

* Add the combination of ex-conda jobs

* Update checkgroup

* revert timeout

* remove conda job

* revert docker build workflow file

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-09-29 22:39:04 -04:00
Atharva Phatak fdcb5cc90b
Hydra changes to lightning-lite (#14950)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2022-09-29 21:59:35 -04:00
Jirka Borovec f9ef19f108
Run CI helpers' doctests in a workflow (#14498)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2022-09-30 01:56:56 +02:00
Kishan Savant 1e5411b143
Removed the deprecated datamodule_checkpointhooks (#14909)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-09-29 22:31:58 +00:00
Aliaksandr Kuzmik 4c43e57b6f
Comet.ml logger - add usage tracking (#14906)
Co-authored-by: Aliaksandr.Kuzmik <AliaksandrK@comet.ml>
2022-09-29 21:10:54 +00:00
Adrian Wälchli c8059d4464
Update quick start guide with latest info (#14880)
Co-authored-by: thomas chaton <thomas@grid.ai>
2022-09-29 20:54:20 +00:00
Suyash Sonawane 72ac4b592f
Fixed docstring for unwatch method (#14920) 2022-09-29 19:20:42 +00:00
Tianshu Wang 485ab5e0de
Fix wandb `save_dir` is not overridden by `None` `dir` when using CLI (#14878)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-09-29 19:20:07 +00:00
Prince Canuma 04aaf83901
Fix MissingFieldException in offline mode (#14919)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-09-29 18:47:51 +00:00
Adrian Wälchli 498cb60417
Fairscale integration tests for Lite (#14921)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-29 17:46:49 +00:00
Adrian Wälchli 822a7f50af
Align ddp and ddp-spawn strategies in setting up the environment (#11073)
Co-authored-by: Kushashwa Ravi Shrimali <kushashwaravishrimali@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-29 19:30:09 +02:00
Rohit Gupta 3a70e5dbcb
Call `LightningDataModule.load_state_dict` hook while restoring checkpoint using `LightningDataModule.load_from_checkpoint` (#14883) 2022-09-29 16:55:59 +00:00
Ethan Harris 93e802afc2
Simplify bug report template (#14925)
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: edenlightning <66261195+edenlightning@users.noreply.github.com>
2022-09-29 16:49:45 +00:00
Adrian Wälchli d8e90f6581
Fairscale import updates (#14721)
* fairscale imports
* refactor to avoid meta package build issue

Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: thomas chaton <thomas@grid.ai>
2022-09-29 16:45:27 +00:00
Adrian Wälchli 5b446aec4d
DeepSpeed integration tests for Lite (#14901) 2022-09-29 16:39:32 +00:00
Kaushik B 0abdd80104
Prepare v1.8.0rc0 (#14918) 2022-09-29 18:00:25 +02:00
Carlos Mocholí 6e70f55f00
Clean up CODEOWNERS for PL and Lite (#14942)
* Clean up CODEOWNERS for PL and Lite

* Update
2022-09-29 10:17:05 -04:00
Carlos Mocholí b8cc4525bd
Skip CircleCI trigger for forks (#14930) 2022-09-29 10:16:37 -04:00
Carlos Mocholí 7893eb259a
Prepare CI to run on 3090s (#14910) 2022-09-29 14:01:59 +00:00
Carlos Mocholí 4c53eae0f4
Self-review of the recent Trainer changes (#14916) 2022-09-29 13:59:16 +00:00
Carlos Mocholí 4eb7766f3c
Make internal torchscript check a class attribute (#14904) 2022-09-29 13:40:25 +00:00
otaj 5f0c4aad12
Introduce `ckpt_path="hpc"` keyword for checkpoint loading (#14911)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-29 12:45:51 +00:00
Adrian Wälchli ff3c5b7b9d
Docs section for SLURM troubleshooting (#14873)
Co-authored-by: Laverne Henderson <laverne.henderson@coupa.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-09-29 12:41:31 +00:00
Adrian Wälchli a45c047b38
Remove deprecated LightningIPUModule (#14830)
* Remove deprecated LightningIPUModule
* chlog
* fix import
* Fix 1.10 depr test

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-29 13:07:45 +01:00
Masahiro Wada d377d0efde
Fix type hints of tuner/batch_size_scaling.py (#13518)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: otaj <ota@lightning.ai>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2022-09-29 12:00:42 +00:00
Jerome Anand 136d57312d
Upgrade HPU image to release 1.6.1 (#14932) 2022-09-29 11:22:27 +00:00