Commit Graph

36 Commits

Author SHA1 Message Date
Atharva Phatak cdb7006b98
Fix ddp_spawn -> ddp fallback logic when on LSF cluster (#15657)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2022-11-12 17:26:16 +00:00
Adrian Wälchli 18288eb3f3
Checkpoint migration for `ModelCheckpoint` state-key changes (#15606)
Co-authored-by: rohitgr7 <rohitgr1998@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-11-11 13:06:25 +00:00
Adrian Wälchli 75b5042081
Validate that state-key is unique when using multiple callbacks of the same type (#15634) 2022-11-11 05:15:03 -05:00
Rohit Gupta f4ca5623d2
Make checkpointing on train epoch end condition dynamic (#15300)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-11-09 14:27:53 +00:00
Yuxuan Lu ee8a57da0f
Fix usage of fs.listdir in CheckpointConnector (#15413)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-11-04 20:21:52 +00:00
Adrian Wälchli 38a9e69543
Extend the detection of interactive mode (#15293)
* extend interactive mode detection
* update test names
* changelog
* test
2022-10-26 15:24:11 +00:00
Adrian Wälchli 576757fd79
Validate SRUN variables when launching in SLURM (#15011) 2022-10-19 21:42:11 +00:00
Carlos Mocholí 24c26f7db2
Standardize Lite's filenames (#15058) 2022-10-19 14:09:41 +02:00
Rohit Gupta eb17dc9839
Deprecate tuning enum and trainer properties (#15100) 2022-10-13 13:29:50 +00:00
Max Ehrlich 5a3007cd6c
Support Slurm Autorequeue for Array Jobs (#15040)
Signed-off-by: Max Ehrlich <max.ehr@gmail.com>
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2022-10-10 13:43:57 +02:00
Adrian Wälchli c76a95ea12
More tests for TPU accelerator in Lite (#14960)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-10-08 15:42:21 +00:00
Carlos Mocholí 7ef87464dd
Refactor XLA and TPU checks across codebase (#14550) 2022-10-04 22:54:14 +00:00
otaj 5f0c4aad12
Introduce `ckpt_path="hpc"` keyword for checkpoint loading (#14911)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-29 12:45:51 +00:00
Rohit Gupta d1a3a3ebf5
Add BatchSizeFinder callback (#11089)
* add BatchSizeFinderCallback callback

* temp rm from init

* skip with lr_finder tests

* restore loops and intergrate early exit

* enable fast_dev_run test

* add docs and tests

* keep tune and remove early_exit

* add more tests

* patch lr finder

* disable skip

* force_save and fix test

* mypy and circular import fix

* fix mypy

* fix

* updates

* rebase

* address reviews

* add more exceptions for unsupported functionalities

* move exception to setup

* chlog

* unit test

* address reviews

* Apply suggestions from code review

* update

* update

* mypy

* fix

* use it as a util func

* license

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* mypy

* mypy

* review

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* updates

* updates

* fix import

* Protect callback attrs

* don't reset val dataloader

* update test

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-27 08:54:37 -04:00
Adrian Wälchli dc1dc0df36
Attempt to query device count via NVML (#14631)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-22 09:57:13 +00:00
Carlos Mocholí e9c571d39f
Move accelerator-specific parsing functions with their accelerators (#14753)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2022-09-18 22:48:45 +00:00
Adrian Wälchli 35c65b0287
Fix test suite when running on MPS-enabled hardware (#14708) 2022-09-16 19:21:36 +00:00
Adrian Wälchli 47f0d336f1
Standalone Lite: Update LightningLite (#14726) 2022-09-16 17:25:27 +00:00
Adrian Wälchli 619e76f22d
Remove silent behavior when `num_slurm_tasks` does not correspond to number of processes in Trainer (#14300)
* simplify logic
* remove hpc
* update
* add changelog
* more tests
* update test

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-09-16 11:00:09 +00:00
Adrian Wälchli 19a1274093
Better error message when dataloader and datamodule is None (V2) (#14637) 2022-09-13 12:26:03 +00:00
Max Ehrlich e5998e6bf2
Make the SLURM Preemption/Timeout Signal Configurable (#14626)
* Add parameter to change the preemption signal
* Make the signal connector use the custom signal from SLURMEnvironment

Signed-off-by: Max Ehrlich <max.ehr@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-09-12 19:24:35 +00:00
Adrian Wälchli d013bcc5bf
Standalone Lite: Accelerators (#14578)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-09-12 16:00:14 +00:00
Adrian Wälchli 024e7b8204
Standalone Lite: Cluster Environments (#14509) 2022-09-12 12:20:08 +02:00
Adrian Wälchli d2459df2ff
Standalone Lite: Remaining Utilities (#14492)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Laverne Henderson <laverne.henderson@coupa.com>
Co-authored-by: Felonious-Spellfire <felonious.spellfire@gmail.com>
2022-09-07 15:25:23 +00:00
Adrian Wälchli 250c06e406
Remove deprecated HPC model hooks (#14315)
Co-authored-by: rohitgr7 <rohitgr1998@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-08-26 20:59:32 +00:00
Adrian Wälchli fafd254678
Fix device parser logic to avoid creating CUDA context (#14319)
* let environment disable forking

* add helper function and error messages

* tests

* changelog

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-08-26 15:41:38 +00:00
Rohit Gupta c8e22b4572
Avoid raising the sampler warning if num_replicas=1 (#14097)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-08-12 08:44:21 +00:00
Adrian Wälchli 807f9d8c96
Replace unwrapping logic in strategies (#13738)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2022-08-12 08:24:04 +00:00
Rohit Gupta 2d9e00fab6
Profile batch transfer and gradient clipping hooks (#14069) 2022-08-11 23:21:53 +00:00
Carlos Mocholí 3dc08b1ef5
Fix flaky test caused by weak reference (#14157) 2022-08-11 09:33:19 +02:00
Adrian Wälchli a7cebf2416
Fix entry point test for Python 3.10 (#14154) 2022-08-11 01:32:32 +02:00
Rohit Gupta a4e4cab7a6
Deprecate `amp_level` from `Trainer` (#13898)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-08-05 08:31:19 +00:00
Adrian Wälchli e6a8283e9c
Organize accelerator tests (#13986) 2022-08-03 13:49:55 +00:00
Rohit Gupta c67b075cf5
Use `global_step` while restoring logging step for old checkpoints (#13645)
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2022-07-19 18:53:22 +00:00
otaj 33bd270845
Adds Sampler Wrappers for custom samplers in distributed environment (#12959)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-06-22 12:17:53 +02:00
Jirka Borovec ab59f308b1
Future 4/n: test & legacy in test/ folder (#13295)
* move: legacy >> test/

* move: tests >> test/

* rename unittests

* update CI

* tests4pl

* tests_pytorch

* proxi

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* ci

* link

* cli

* standalone

* fixing

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* .

* Apply suggestions from code review

Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* alone

* test -> tests

* Standalone fixes

* ci

* Update

* More fixes

* Fix coverage

* Fix mypy

* mypy

* Empty-Commit

* Fix

* mypy just for pl

* Fix standalone

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-06-15 18:10:49 -04:00