Commit Graph

7875 Commits

Author SHA1 Message Date
Carlos Mocholí 69fee71f22
Trim flaky amp test (#15051) 2022-10-10 13:49:37 +02:00
Max Ehrlich 5a3007cd6c
Support Slurm Autorequeue for Array Jobs (#15040)
Signed-off-by: Max Ehrlich <max.ehr@gmail.com>
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2022-10-10 13:43:57 +02:00
Mauricio Villegas ddfcddbd1c
LightningCLI add --config option after parser __init__ (#15048) 2022-10-10 11:32:08 +00:00
Adrian Wälchli 8f90084059
Remove deprecated on_load/save_checkpoint behavior (#14835)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-10-10 11:08:13 +00:00
Carlos Mocholí 0b04aa879f
Resolve interactions between CUDA tests (#15042) 2022-10-09 06:20:40 -04:00
Rohit Gupta ca3c4e7f07
Add tuner callback docs (#15030) 2022-10-08 18:21:27 +00:00
Adrian Wälchli c76a95ea12
More tests for TPU accelerator in Lite (#14960)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-10-08 15:42:21 +00:00
Amrutha dfc7886b24
docs: replacement of method type_as in docs to Tensor.to (#15027) 2022-10-08 10:04:15 +00:00
Krishna Kalyan 4bad54f2d7
Fix Broken Link in `lightning_app.core.work.LightningWork` (#15032) 2022-10-07 21:14:23 +00:00
Carlos Mocholí 62ca073a41
Introduce base collective and main subclasses (#15016)
Co-authored-by: otaj <ota@lightning.ai>
2022-10-07 19:53:19 +00:00
otaj 7e518cacd2
Use `torch.testing.assert_close` everywhere (#15031)
remove unnecessary version check
2022-10-07 16:59:04 +02:00
Pritam Soni 80080550d9
feat: allow root path to run the app on `/path` (#14972)
* feat: add base path
* uvicorn fix arg
* Add prefix
* update with base_path fix
* replace base path with root path
* Apply suggestions from code review

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Kaushik B <kaushikbokka@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-10-07 14:09:40 +00:00
Sherin Thomas 8ec7ffb5ce
[App] HTTP Removing Queue health check from Individual App (#15023)
* removing expensive health check from Queue abstraction

* removing expensive health check from Queue abstraction
2022-10-07 17:16:19 +05:30
Sherin Thomas 129f4fa873
[App/Feature] HTTP Queues (#14978)
[App/Feature] HTTP Queues (#14978)
2022-10-06 16:01:49 +05:30
Dan Dale 3b75c52869
Support ddp_fork strategy with native AMP by attempting NVML-based CUDA availability assessment (#14984)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-10-05 18:52:06 -04:00
Rohit Gupta 7fed7a12c5
Add `LRFinder` callback (#13802)
* add BatchSizeFinderCallback callback
* enable fast_dev_run test
* keep tune and remove early_exit
* move exception to setup
* Apply suggestions from code review

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Laverne Henderson <laverne.henderson@coupa.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-10-05 13:15:38 +02:00
Jirka Borovec 5f106957f7
CI: Use self-hosted Azure GPU runners (#14632)
* move config
* Apply suggestions from code review

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-10-05 10:43:54 +00:00
Ethan Harris 0a9fc22b4f
Fix bug in upload file endpoint (#14924)
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-10-05 10:05:41 +00:00
Justus Schock 4c360bfc52
`Optimizable` structural typing (#14994)
* update optimizer typing

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* forgot one file

* update types

* hopefully_last

* zero grad not required as can also be done on model

* consistency with other typing annotations

* revert for deepspeed

* Update deepspeed.py

* Update deepspeed.py

* revert for base plugin

* Update types.py

* add protocol inheritance

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update typing for precision plugin

* Update module.py

* typo

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-10-05 10:04:53 +00:00
Akihiro Nitta 2a657998d4
CI: Reuse clear cache (#14593)
* Remove existing weekly reset logic
* clear cache every week
* Use main tag
2022-10-05 11:52:42 +02:00
Mauricio Villegas 3853580c81
Added support for custom parameters in subclasses of `SaveConfigCallback` (#14998) 2022-10-05 11:10:29 +02:00
Dan Dale ab1eb6531e
Fix fork tests failing in environments with CUDA available (#14982)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-10-05 00:02:55 +00:00
Carlos Mocholí 7ef87464dd
Refactor XLA and TPU checks across codebase (#14550) 2022-10-04 22:54:14 +00:00
Dan Dale acaeab27f6
Fix GPU tests that fail to raise expected configuration error when run in a CUDA environment (#14983) 2022-10-04 18:40:55 -04:00
thomas chaton b936fd4380
[app] Add CloudCompute ID serializable within the flow and works state (#14819) 2022-10-04 19:46:44 +00:00
Sherin Thomas 53694eb93d
[App/Improvement] Cleaning up Queue abstraction (#14977)
[App/Improvement] Cleaning up Queue abstraction (#14977)
2022-10-04 22:07:31 +05:30
Ethan Harris ce919ee7d6
Fix commands and API test (#14947) 2022-10-04 15:38:40 +00:00
geoffrey-g-delhomme 9832d36851
Fix `ReduceLROnPlateau` update issue while resuming from a checkpoint (#14702)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-10-04 11:55:51 +00:00
Kishan Savant c059db446e
Remove the deprecated device_stats_monitor_prefix_keys (#14890)
* Remove the deprecated device_stats_monitor_prefix_keys

* Added pr no to changelog.md

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2022-10-03 17:13:02 +00:00
DP c764221615
fixes typing errors in rich_progress.py (#14963) 2022-10-03 14:11:18 +00:00
Adam J. Stewart 09a8001923
Trainer: fix support for non-distributed PyTorch (#14971)
* Trainer: fix non-distributed use
* Update CHANGELOG
2022-10-03 13:15:07 +00:00
Carlos Mocholí 3028fd287d
Fix TPU test CI (#14926)
* Fix TPU test CI

* +x first

* Lite first to uncovert errors faster

* Fixes

* One more

* Simplify XLALauncher wrapping to avoid pickle error

* debug

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Debug commit successful. Trying local definitions

* Require tpu for mock test

* ValueError: The number of devices must be either 1 or 8, got 4 instead

* Fix mock test

* Simplify call, rely on defaults

* Skip OSError for now. Maybe upgrading will help

* Simplify launch tests, move some to lite

* Stricter typing

* RuntimeError: Accessing the XLA device before processes have spawned is not allowed.

* Revert "RuntimeError: Accessing the XLA device before processes have spawned is not allowed."

This reverts commit f65107ebf3.

* Alternative boring solution to the reverted commit

* Fix failing test on CUDA machine

* Workarounds

* Try latest mkl

* Revert "Try latest mkl"

This reverts commit d06813aa67.

* Wrong exception

* xfail

* Mypy

* Comment change

* Spawn launch refactor

* Accept that we cannot lazy init now

* Fix mypy and launch test failures

* The base dockerfile already includes mkl-2022.1.0 - what if we use it?

* try a different mkl version

* Revert mkl version changes

Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2022-10-03 09:13:33 -04:00
otaj e290c206c9
Bump version of fsspec (#14975)
fsspec verbump
2022-10-03 09:53:15 +00:00
Jerome Anand e62521caf1
Update hpu mixed precision link (#14974)
Signed-off-by: Jerome <janand@habana.ai>
2022-10-03 09:05:17 +02:00
Carlos Mocholí be7bfdba27
Remove unused gcsfs dependency (#14962) 2022-10-01 16:08:36 +00:00
otaj 511a070c52
Find last checkpoints on restart (#14907)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-30 20:14:18 +00:00
Ziyad Sheebaelhamd db26e087e7
Close profiler when `StopIteration` is raised (#14945) 2022-09-30 19:29:12 +00:00
Adrian Wälchli d7af8ce2a5
Simplify root node resolution for SLURM environment (#14912)
Co-authored-by: Seppo Enarvi <seppo.git@marjaniemi.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-30 15:40:43 +00:00
Adrian Wälchli cd9247a782
Introduce primitives for input/output dtype conversion in Lite Precision (#14792)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-09-30 15:29:03 +00:00
Andres Algaba 3daa4c9cc0
Remove deprecated on_init_start_end (#14867)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-09-30 15:11:38 +00:00
Pritam Soni 2721a2f06b
feat: option to add custom meta tags to the UI container (#14915) 2022-09-30 18:56:57 +05:30
Carlos Mocholí fd2779e55f
Fix fork skip condition in GitHub workflows (#14955)
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-09-30 08:30:47 -04:00
Mauricio Villegas 15aa9c679d
An instance of SaveConfigCallback should only save the config once (#14927) 2022-09-30 12:16:37 +00:00
Masahiro Wada abea29bfa3
Move type annotation into __init__ (#14943)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-30 11:03:03 +00:00
Lee Jungwon a9142d637a
Fix mypy typing errors in pytorch_lightning/trainer/trainer.py (#14204)
Co-authored-by: otaj <ota@lightning.ai>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-30 10:50:42 +00:00
Akihiro Nitta 021c2f1447
Fix typo in checkgroup.yml (#14959)
Fix typo
2022-09-30 10:12:06 +00:00
Carlos Mocholí 6256a318d7
Refactor launching tests to use our launchers (#14954) 2022-09-30 09:57:18 +02:00
Akihiro Nitta e47d5a2376
CI: Combine conda and full testing into a single workflow (#14387)
* Remove conda job

* Remove conda job from readme

* Remove conda jobs from checkgroup

* Remove conda from docker builds

* Remove base-conda dockerfile

* Rewrite the strategy matrix while keeping equivalent

* Run the workflow on this branch

* Revert "Rewrite the strategy matrix while keeping equivalent"

This reverts commit e54298d60e57cffbf8107890987be3fe4a006c77.

* Add PyTorch versions

* Run on draft and disable unrelated costly CI

* Revert "Run the workflow on this branch"

This reverts commit 51ed8b905d8926b630dce4817124bd486135d3ec.

* tmp: Lightweight relevant CI

* Fix CI pathfilter

* Update matrix

* Drop skipping logic

* pip list

* reorder pip list

* tmp: lightweight ci

* Install specified pytorch

* Fix torch installation

* Uncomment steps

* Increase timeout

* bad merge

* Revert "Run on draft and disable unrelated costly CI"

This reverts commit eb5dc5e6bd.

* Update checkgroup

* Update docs and remove Python/PyTorch versions

* Remove pip-list

* Fail if wrong pytorch version installed

* Add Python 3.8, PyTorch 1.9 job

* tmp: remove azure jobs

* tmp: remove dockers

* tmp: remove others

* Run all combinations

* Include oldest

* Exclude no Python 3.10 distributions

* tmp: no concurrency

* tmp: double timeout

* Add pytest log reporter

* Add pytest-reportlog

* Fewer jobs

* Revert "tmp: no concurrency"

This reverts commit 4a7978dcb3.

* fix artifact name

* Revert test reports

* Revert unrelated changes

* Revert unrelated changes

* Add the combination of ex-conda jobs

* Update checkgroup

* revert timeout

* remove conda job

* revert docker build workflow file

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-09-29 22:39:04 -04:00
Atharva Phatak fdcb5cc90b
Hydra changes to lightning-lite (#14950)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2022-09-29 21:59:35 -04:00
Jirka Borovec f9ef19f108
Run CI helpers' doctests in a workflow (#14498)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2022-09-30 01:56:56 +02:00