Carlos Mocholí
69fee71f22
Trim flaky amp test ( #15051 )
2022-10-10 13:49:37 +02:00
Max Ehrlich
5a3007cd6c
Support Slurm Autorequeue for Array Jobs ( #15040 )
...
Signed-off-by: Max Ehrlich <max.ehr@gmail.com>
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2022-10-10 13:43:57 +02:00
Mauricio Villegas
ddfcddbd1c
LightningCLI add --config option after parser __init__ ( #15048 )
2022-10-10 11:32:08 +00:00
Adrian Wälchli
8f90084059
Remove deprecated on_load/save_checkpoint behavior ( #14835 )
...
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-10-10 11:08:13 +00:00
Carlos Mocholí
0b04aa879f
Resolve interactions between CUDA tests ( #15042 )
2022-10-09 06:20:40 -04:00
Rohit Gupta
ca3c4e7f07
Add tuner callback docs ( #15030 )
2022-10-08 18:21:27 +00:00
Adrian Wälchli
c76a95ea12
More tests for TPU accelerator in Lite ( #14960 )
...
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-10-08 15:42:21 +00:00
Amrutha
dfc7886b24
docs: replacement of method type_as in docs to Tensor.to ( #15027 )
2022-10-08 10:04:15 +00:00
Krishna Kalyan
4bad54f2d7
Fix Broken Link in `lightning_app.core.work.LightningWork` ( #15032 )
2022-10-07 21:14:23 +00:00
Carlos Mocholí
62ca073a41
Introduce base collective and main subclasses ( #15016 )
...
Co-authored-by: otaj <ota@lightning.ai>
2022-10-07 19:53:19 +00:00
otaj
7e518cacd2
Use `torch.testing.assert_close` everywhere ( #15031 )
...
remove unnecessary version check
2022-10-07 16:59:04 +02:00
Pritam Soni
80080550d9
feat: allow root path to run the app on `/path` ( #14972 )
...
* feat: add base path
* uvicorn fix arg
* Add prefix
* update with base_path fix
* replace base path with root path
* Apply suggestions from code review
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Kaushik B <kaushikbokka@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-10-07 14:09:40 +00:00
Sherin Thomas
8ec7ffb5ce
[App] HTTP Removing Queue health check from Individual App ( #15023 )
...
* removing expensive health check from Queue abstraction
* removing expensive health check from Queue abstraction
2022-10-07 17:16:19 +05:30
Sherin Thomas
129f4fa873
[App/Feature] HTTP Queues ( #14978 )
...
[App/Feature] HTTP Queues (#14978 )
2022-10-06 16:01:49 +05:30
Dan Dale
3b75c52869
Support ddp_fork strategy with native AMP by attempting NVML-based CUDA availability assessment ( #14984 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-10-05 18:52:06 -04:00
Rohit Gupta
7fed7a12c5
Add `LRFinder` callback ( #13802 )
...
* add BatchSizeFinderCallback callback
* enable fast_dev_run test
* keep tune and remove early_exit
* move exception to setup
* Apply suggestions from code review
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Laverne Henderson <laverne.henderson@coupa.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-10-05 13:15:38 +02:00
Jirka Borovec
5f106957f7
CI: Use self-hosted Azure GPU runners ( #14632 )
...
* move config
* Apply suggestions from code review
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-10-05 10:43:54 +00:00
Ethan Harris
0a9fc22b4f
Fix bug in upload file endpoint ( #14924 )
...
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-10-05 10:05:41 +00:00
Justus Schock
4c360bfc52
`Optimizable` structural typing ( #14994 )
...
* update optimizer typing
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* forgot one file
* update types
* hopefully_last
* zero grad not required as can also be done on model
* consistency with other typing annotations
* revert for deepspeed
* Update deepspeed.py
* Update deepspeed.py
* revert for base plugin
* Update types.py
* add protocol inheritance
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update typing for precision plugin
* Update module.py
* typo
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-10-05 10:04:53 +00:00
Akihiro Nitta
2a657998d4
CI: Reuse clear cache ( #14593 )
...
* Remove existing weekly reset logic
* clear cache every week
* Use main tag
2022-10-05 11:52:42 +02:00
Mauricio Villegas
3853580c81
Added support for custom parameters in subclasses of `SaveConfigCallback` ( #14998 )
2022-10-05 11:10:29 +02:00
Dan Dale
ab1eb6531e
Fix fork tests failing in environments with CUDA available ( #14982 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-10-05 00:02:55 +00:00
Carlos Mocholí
7ef87464dd
Refactor XLA and TPU checks across codebase ( #14550 )
2022-10-04 22:54:14 +00:00
Dan Dale
acaeab27f6
Fix GPU tests that fail to raise expected configuration error when run in a CUDA environment ( #14983 )
2022-10-04 18:40:55 -04:00
thomas chaton
b936fd4380
[app] Add CloudCompute ID serializable within the flow and works state ( #14819 )
2022-10-04 19:46:44 +00:00
Sherin Thomas
53694eb93d
[App/Improvement] Cleaning up Queue abstraction ( #14977 )
...
[App/Improvement] Cleaning up Queue abstraction (#14977 )
2022-10-04 22:07:31 +05:30
Ethan Harris
ce919ee7d6
Fix commands and API test ( #14947 )
2022-10-04 15:38:40 +00:00
geoffrey-g-delhomme
9832d36851
Fix `ReduceLROnPlateau` update issue while resuming from a checkpoint ( #14702 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-10-04 11:55:51 +00:00
Kishan Savant
c059db446e
Remove the deprecated device_stats_monitor_prefix_keys ( #14890 )
...
* Remove the deprecated device_stats_monitor_prefix_keys
* Added pr no to changelog.md
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2022-10-03 17:13:02 +00:00
DP
c764221615
fixes typing errors in rich_progress.py ( #14963 )
2022-10-03 14:11:18 +00:00
Adam J. Stewart
09a8001923
Trainer: fix support for non-distributed PyTorch ( #14971 )
...
* Trainer: fix non-distributed use
* Update CHANGELOG
2022-10-03 13:15:07 +00:00
Carlos Mocholí
3028fd287d
Fix TPU test CI ( #14926 )
...
* Fix TPU test CI
* +x first
* Lite first to uncovert errors faster
* Fixes
* One more
* Simplify XLALauncher wrapping to avoid pickle error
* debug
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Debug commit successful. Trying local definitions
* Require tpu for mock test
* ValueError: The number of devices must be either 1 or 8, got 4 instead
* Fix mock test
* Simplify call, rely on defaults
* Skip OSError for now. Maybe upgrading will help
* Simplify launch tests, move some to lite
* Stricter typing
* RuntimeError: Accessing the XLA device before processes have spawned is not allowed.
* Revert "RuntimeError: Accessing the XLA device before processes have spawned is not allowed."
This reverts commit f65107ebf3
.
* Alternative boring solution to the reverted commit
* Fix failing test on CUDA machine
* Workarounds
* Try latest mkl
* Revert "Try latest mkl"
This reverts commit d06813aa67
.
* Wrong exception
* xfail
* Mypy
* Comment change
* Spawn launch refactor
* Accept that we cannot lazy init now
* Fix mypy and launch test failures
* The base dockerfile already includes mkl-2022.1.0 - what if we use it?
* try a different mkl version
* Revert mkl version changes
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2022-10-03 09:13:33 -04:00
otaj
e290c206c9
Bump version of fsspec ( #14975 )
...
fsspec verbump
2022-10-03 09:53:15 +00:00
Jerome Anand
e62521caf1
Update hpu mixed precision link ( #14974 )
...
Signed-off-by: Jerome <janand@habana.ai>
2022-10-03 09:05:17 +02:00
Carlos Mocholí
be7bfdba27
Remove unused gcsfs dependency ( #14962 )
2022-10-01 16:08:36 +00:00
otaj
511a070c52
Find last checkpoints on restart ( #14907 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-30 20:14:18 +00:00
Ziyad Sheebaelhamd
db26e087e7
Close profiler when `StopIteration` is raised ( #14945 )
2022-09-30 19:29:12 +00:00
Adrian Wälchli
d7af8ce2a5
Simplify root node resolution for SLURM environment ( #14912 )
...
Co-authored-by: Seppo Enarvi <seppo.git@marjaniemi.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-30 15:40:43 +00:00
Adrian Wälchli
cd9247a782
Introduce primitives for input/output dtype conversion in Lite Precision ( #14792 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-09-30 15:29:03 +00:00
Andres Algaba
3daa4c9cc0
Remove deprecated on_init_start_end ( #14867 )
...
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-09-30 15:11:38 +00:00
Pritam Soni
2721a2f06b
feat: option to add custom meta tags to the UI container ( #14915 )
2022-09-30 18:56:57 +05:30
Carlos Mocholí
fd2779e55f
Fix fork skip condition in GitHub workflows ( #14955 )
...
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-09-30 08:30:47 -04:00
Mauricio Villegas
15aa9c679d
An instance of SaveConfigCallback should only save the config once ( #14927 )
2022-09-30 12:16:37 +00:00
Masahiro Wada
abea29bfa3
Move type annotation into __init__ ( #14943 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-30 11:03:03 +00:00
Lee Jungwon
a9142d637a
Fix mypy typing errors in pytorch_lightning/trainer/trainer.py ( #14204 )
...
Co-authored-by: otaj <ota@lightning.ai>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-30 10:50:42 +00:00
Akihiro Nitta
021c2f1447
Fix typo in checkgroup.yml ( #14959 )
...
Fix typo
2022-09-30 10:12:06 +00:00
Carlos Mocholí
6256a318d7
Refactor launching tests to use our launchers ( #14954 )
2022-09-30 09:57:18 +02:00
Akihiro Nitta
e47d5a2376
CI: Combine conda and full testing into a single workflow ( #14387 )
...
* Remove conda job
* Remove conda job from readme
* Remove conda jobs from checkgroup
* Remove conda from docker builds
* Remove base-conda dockerfile
* Rewrite the strategy matrix while keeping equivalent
* Run the workflow on this branch
* Revert "Rewrite the strategy matrix while keeping equivalent"
This reverts commit e54298d60e57cffbf8107890987be3fe4a006c77.
* Add PyTorch versions
* Run on draft and disable unrelated costly CI
* Revert "Run the workflow on this branch"
This reverts commit 51ed8b905d8926b630dce4817124bd486135d3ec.
* tmp: Lightweight relevant CI
* Fix CI pathfilter
* Update matrix
* Drop skipping logic
* pip list
* reorder pip list
* tmp: lightweight ci
* Install specified pytorch
* Fix torch installation
* Uncomment steps
* Increase timeout
* bad merge
* Revert "Run on draft and disable unrelated costly CI"
This reverts commit eb5dc5e6bd
.
* Update checkgroup
* Update docs and remove Python/PyTorch versions
* Remove pip-list
* Fail if wrong pytorch version installed
* Add Python 3.8, PyTorch 1.9 job
* tmp: remove azure jobs
* tmp: remove dockers
* tmp: remove others
* Run all combinations
* Include oldest
* Exclude no Python 3.10 distributions
* tmp: no concurrency
* tmp: double timeout
* Add pytest log reporter
* Add pytest-reportlog
* Fewer jobs
* Revert "tmp: no concurrency"
This reverts commit 4a7978dcb3
.
* fix artifact name
* Revert test reports
* Revert unrelated changes
* Revert unrelated changes
* Add the combination of ex-conda jobs
* Update checkgroup
* revert timeout
* remove conda job
* revert docker build workflow file
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-09-29 22:39:04 -04:00
Atharva Phatak
fdcb5cc90b
Hydra changes to lightning-lite ( #14950 )
...
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2022-09-29 21:59:35 -04:00
Jirka Borovec
f9ef19f108
Run CI helpers' doctests in a workflow ( #14498 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2022-09-30 01:56:56 +02:00