Commit Graph

68 Commits

Author SHA1 Message Date
Carlos Mocholí 02074f16c7
Fix PyTorch versions in Lite CI (#15338)
* replace oldest in lite

* Fix PyTorch versions in Lite CI

* This will be moved to install pkg workflow in the mirror PR

* 1.13 fixes

* Windows fix

* sorting

Co-authored-by: otaj <ota@lightning.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-10-26 15:09:08 -04:00
Adrian Wälchli 38a9e69543
Extend the detection of interactive mode (#15293)
* extend interactive mode detection
* update test names
* changelog
* test
2022-10-26 15:24:11 +00:00
Adrian Wälchli 0f9156374d
Mark internal Lite APIs as protected (#15307)
* mark internal lite apis as protected
* formatting
* docs update

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-10-26 12:51:50 +00:00
otaj 76e462a0be
Do not lose references of trainer in test (#15272)
* Fix reference error

* Skip flaky hanging test

* .

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-10-25 09:23:15 -04:00
Dan Dale 27585a9bcf
Fix and refactor `test_deepspeed_engine_is_steppable` test (#15251)
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-10-23 21:25:36 +00:00
Carlos Mocholí 961e395677
Resolve collectives test issues (#15195)
Co-authored-by: otaj <ota@lightning.ai>
2022-10-21 01:08:38 +00:00
Carlos Mocholí b866dc3a6a
Collective's PREMUL_SUM support with PyTorch 1.13 (#15201)
* Collective's PREMUL_SUM support with PyTorch 1.13
* Fix test
* Skip under 1.13
2022-10-20 12:36:06 +00:00
Carlos Mocholí bf458701de
Avoid underscore suffix in filenames (#15189) 2022-10-20 07:39:19 -04:00
otaj 741462f373
[LAI] Make lite tests safe for combined package (#15204)
Make lite tests safe for combined package

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
2022-10-20 09:10:39 +00:00
Adrian Wälchli 576757fd79
Validate SRUN variables when launching in SLURM (#15011) 2022-10-19 21:42:11 +00:00
Adrian Wälchli 045c2f5715
Efficient gradient accumulation in LightningLite (#14966)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-10-19 19:55:12 +00:00
Jirka Borovec d0b092fda8
Lite: setting extras & fix CI (#15192)
* extras
* test.txt
* doctest
* Apply suggestions from code review
* Fix imports
* Oops

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-10-19 19:05:23 +00:00
Carlos Mocholí 24c26f7db2
Standardize Lite's filenames (#15058) 2022-10-19 14:09:41 +02:00
Carlos Mocholí 0e18266023
Fix collective tests with PyTorch 1.13 (#15167) 2022-10-18 14:31:48 +02:00
Justus Schock 27965cc36b
Fix locally failing lite tests (#15137) 2022-10-18 09:49:14 +00:00
Adrian Wälchli ed891e5049
Force NVML-based CUDA check in PyTorch 1.14+ (#15110) 2022-10-13 13:10:29 -04:00
Carlos Mocholí da25d1d30d
Remove unused Lite code (#15000)
* Remove unused Lite code
* Remove duplicate import
* Group variable
* Fix monkeypatch
2022-10-10 22:16:56 +00:00
Carlos Mocholí c334b7766c
Remove old testing artifacts (#15052) 2022-10-10 17:34:18 +00:00
Carlos Mocholí d15bd1520e
[Lite] precision_plugin -> precision (#15001)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-10-10 15:00:32 +00:00
Carlos Mocholí 0b04aa879f
Resolve interactions between CUDA tests (#15042) 2022-10-09 06:20:40 -04:00
Adrian Wälchli c76a95ea12
More tests for TPU accelerator in Lite (#14960)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-10-08 15:42:21 +00:00
Carlos Mocholí 62ca073a41
Introduce base collective and main subclasses (#15016)
Co-authored-by: otaj <ota@lightning.ai>
2022-10-07 19:53:19 +00:00
Dan Dale 3b75c52869
Support ddp_fork strategy with native AMP by attempting NVML-based CUDA availability assessment (#14984)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-10-05 18:52:06 -04:00
Dan Dale ab1eb6531e
Fix fork tests failing in environments with CUDA available (#14982)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-10-05 00:02:55 +00:00
Carlos Mocholí 7ef87464dd
Refactor XLA and TPU checks across codebase (#14550) 2022-10-04 22:54:14 +00:00
Carlos Mocholí 3028fd287d
Fix TPU test CI (#14926)
* Fix TPU test CI

* +x first

* Lite first to uncovert errors faster

* Fixes

* One more

* Simplify XLALauncher wrapping to avoid pickle error

* debug

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Debug commit successful. Trying local definitions

* Require tpu for mock test

* ValueError: The number of devices must be either 1 or 8, got 4 instead

* Fix mock test

* Simplify call, rely on defaults

* Skip OSError for now. Maybe upgrading will help

* Simplify launch tests, move some to lite

* Stricter typing

* RuntimeError: Accessing the XLA device before processes have spawned is not allowed.

* Revert "RuntimeError: Accessing the XLA device before processes have spawned is not allowed."

This reverts commit f65107ebf3.

* Alternative boring solution to the reverted commit

* Fix failing test on CUDA machine

* Workarounds

* Try latest mkl

* Revert "Try latest mkl"

This reverts commit d06813aa67.

* Wrong exception

* xfail

* Mypy

* Comment change

* Spawn launch refactor

* Accept that we cannot lazy init now

* Fix mypy and launch test failures

* The base dockerfile already includes mkl-2022.1.0 - what if we use it?

* try a different mkl version

* Revert mkl version changes

Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2022-10-03 09:13:33 -04:00
Adrian Wälchli d7af8ce2a5
Simplify root node resolution for SLURM environment (#14912)
Co-authored-by: Seppo Enarvi <seppo.git@marjaniemi.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-30 15:40:43 +00:00
Adrian Wälchli cd9247a782
Introduce primitives for input/output dtype conversion in Lite Precision (#14792)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-09-30 15:29:03 +00:00
Carlos Mocholí 6256a318d7
Refactor launching tests to use our launchers (#14954) 2022-09-30 09:57:18 +02:00
Atharva Phatak fdcb5cc90b
Hydra changes to lightning-lite (#14950)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2022-09-29 21:59:35 -04:00
Adrian Wälchli 498cb60417
Fairscale integration tests for Lite (#14921)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-29 17:46:49 +00:00
Adrian Wälchli 5b446aec4d
DeepSpeed integration tests for Lite (#14901) 2022-09-29 16:39:32 +00:00
Adrian Wälchli ea5e817973
Better error message when trying to re-initialize CUDA in forked subprocess (#14709)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-28 05:07:33 -04:00
Carlos Mocholí 9fc4ff3278
Move logic to error out on deprecation warnings into conftest (#14902) 2022-09-27 17:49:25 +02:00
Adrian Wälchli d572a7e2ec
Fix double precision support in Lite (#14827) 2022-09-27 08:38:20 +00:00
Adrian Wälchli d7404c775a
Integration tests for Precision in Lite (#14815)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
2022-09-26 18:50:11 +00:00
Adrian Wälchli dc1dc0df36
Attempt to query device count via NVML (#14631)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-22 09:57:13 +00:00
otaj 5ee2b86c44
Tests for fixed TypeError (#14821)
* tests for 14809
* Apply suggestions from code review

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-09-22 09:04:27 +02:00
Carlos Mocholí 7e803ba53e
Clean-up dtype management (#14823)
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
2022-09-22 00:07:36 +00:00
Adrian Wälchli 3f0fec591d
Update device attribute in Lite's module wrapper (#14822)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-21 19:06:10 +00:00
Carlos Mocholí abc805f9ef
Remove the model argument from Lite's `optimizer_step` via structural typing (#14810)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2022-09-21 19:28:45 +02:00
awaelchli c0ff7a1b77 Add backward-compatibility for LightningLite in PL (#14735) 2022-09-20 13:31:56 +02:00
awaelchli e3e71670e6 Move src/pytorch_lightning/lite to src/lightning_lite (#14735) 2022-09-20 13:31:56 +02:00
Carlos Mocholí e9c571d39f
Move accelerator-specific parsing functions with their accelerators (#14753)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2022-09-18 22:48:45 +00:00
Adrian Wälchli 1092265140
Remove check `num_slurm_tasks` in Lite (#14761) 2022-09-18 14:01:49 -04:00
Adrian Wälchli 35c65b0287
Fix test suite when running on MPS-enabled hardware (#14708) 2022-09-16 19:21:36 +00:00
Adrian Wälchli 47f0d336f1
Standalone Lite: Update LightningLite (#14726) 2022-09-16 17:25:27 +00:00
Adrian Wälchli 619e76f22d
Remove silent behavior when `num_slurm_tasks` does not correspond to number of processes in Trainer (#14300)
* simplify logic
* remove hpc
* update
* add changelog
* more tests
* update test

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-09-16 11:00:09 +00:00
Adrian Wälchli 38d89713a5
Standalone Lite: Connector (#14692)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2022-09-15 14:14:51 +00:00
Adrian Wälchli d3dcd68852
Standalone Lite: DDP Spawn Strategy Family (#14675)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-09-15 10:51:12 +00:00