Commit Graph

77 Commits

Author SHA1 Message Date
Yuxuan Lu ee8a57da0f
Fix usage of fs.listdir in CheckpointConnector (#15413)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-11-04 20:21:52 +00:00
Rohit Gupta 61ae35c378
Use sklearn in runif (#15426)
* Use sklearn in runif
* test by removing sklearn dep
* remove repeated code
* seed
2022-11-01 11:40:32 +00:00
Adrian Wälchli 38a9e69543
Extend the detection of interactive mode (#15293)
* extend interactive mode detection
* update test names
* changelog
* test
2022-10-26 15:24:11 +00:00
Adrian Wälchli 0f9156374d
Mark internal Lite APIs as protected (#15307)
* mark internal lite apis as protected
* formatting
* docs update

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-10-26 12:51:50 +00:00
Adrian Wälchli 576757fd79
Validate SRUN variables when launching in SLURM (#15011) 2022-10-19 21:42:11 +00:00
Carlos Mocholí 24c26f7db2
Standardize Lite's filenames (#15058) 2022-10-19 14:09:41 +02:00
Rohit Gupta eb17dc9839
Deprecate tuning enum and trainer properties (#15100) 2022-10-13 13:29:50 +00:00
Ray Schireman 0a5e75e8d1
Add `inference_mode` flag to Trainer (#15034)
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-10-12 12:22:01 +00:00
Carlos Mocholí c334b7766c
Remove old testing artifacts (#15052) 2022-10-10 17:34:18 +00:00
Adrian Wälchli 3183079204
Remove deprecated callback hooks (#14834)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: otaj <ota@lightning.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-10-10 15:46:28 +00:00
Max Ehrlich 5a3007cd6c
Support Slurm Autorequeue for Array Jobs (#15040)
Signed-off-by: Max Ehrlich <max.ehr@gmail.com>
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2022-10-10 13:43:57 +02:00
Adrian Wälchli c76a95ea12
More tests for TPU accelerator in Lite (#14960)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-10-08 15:42:21 +00:00
otaj 7e518cacd2
Use `torch.testing.assert_close` everywhere (#15031)
remove unnecessary version check
2022-10-07 16:59:04 +02:00
Rohit Gupta 7fed7a12c5
Add `LRFinder` callback (#13802)
* add BatchSizeFinderCallback callback
* enable fast_dev_run test
* keep tune and remove early_exit
* move exception to setup
* Apply suggestions from code review

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Laverne Henderson <laverne.henderson@coupa.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-10-05 13:15:38 +02:00
Carlos Mocholí 7ef87464dd
Refactor XLA and TPU checks across codebase (#14550) 2022-10-04 22:54:14 +00:00
otaj 511a070c52
Find last checkpoints on restart (#14907)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-30 20:14:18 +00:00
Andres Algaba 3daa4c9cc0
Remove deprecated on_init_start_end (#14867)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-09-30 15:11:38 +00:00
otaj 5f0c4aad12
Introduce `ckpt_path="hpc"` keyword for checkpoint loading (#14911)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-29 12:45:51 +00:00
Rohit Gupta d1a3a3ebf5
Add BatchSizeFinder callback (#11089)
* add BatchSizeFinderCallback callback

* temp rm from init

* skip with lr_finder tests

* restore loops and intergrate early exit

* enable fast_dev_run test

* add docs and tests

* keep tune and remove early_exit

* add more tests

* patch lr finder

* disable skip

* force_save and fix test

* mypy and circular import fix

* fix mypy

* fix

* updates

* rebase

* address reviews

* add more exceptions for unsupported functionalities

* move exception to setup

* chlog

* unit test

* address reviews

* Apply suggestions from code review

* update

* update

* mypy

* fix

* use it as a util func

* license

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* mypy

* mypy

* review

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* updates

* updates

* fix import

* Protect callback attrs

* don't reset val dataloader

* update test

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-27 08:54:37 -04:00
William Falcon c77d4a8394
Make Trainer Debuggable and understandable again (1/n) (#14861)
* clean trainer 1/n

* clean trainer 1/n

* clean trainer 1/n

* clean trainer 1/n

* clean trainer 1/n

* clean trainer 1/n

* clean trainer 1/n

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clean trainer 1/n

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-09-23 01:15:59 -04:00
Adrian Wälchli dc1dc0df36
Attempt to query device count via NVML (#14631)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-22 09:57:13 +00:00
ritsuki1227 6855f653bb
Set `MLFlowLogger` status to FAILED when training raises an error (#12292)
Co-authored-by: Ritsuki Yamada <ritsuki.yamada@uzabase.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-20 07:43:32 -04:00
Carlos Mocholí e9c571d39f
Move accelerator-specific parsing functions with their accelerators (#14753)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2022-09-18 22:48:45 +00:00
Adrian Wälchli 35c65b0287
Fix test suite when running on MPS-enabled hardware (#14708) 2022-09-16 19:21:36 +00:00
Adrian Wälchli 47f0d336f1
Standalone Lite: Update LightningLite (#14726) 2022-09-16 17:25:27 +00:00
Adrian Wälchli 619e76f22d
Remove silent behavior when `num_slurm_tasks` does not correspond to number of processes in Trainer (#14300)
* simplify logic
* remove hpc
* update
* add changelog
* more tests
* update test

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-09-16 11:00:09 +00:00
Adrian Wälchli 19a1274093
Better error message when dataloader and datamodule is None (V2) (#14637) 2022-09-13 12:26:03 +00:00
Adrian Wälchli 1ee3d1eb72
Avoid warning when cloning tensor in self.log (#14599)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-09-13 16:23:46 +05:30
Adrian Wälchli 4bd135a6f6
Remove deprecated `LoggerCollection` (#14283)
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-09-12 21:46:46 +00:00
Max Ehrlich e5998e6bf2
Make the SLURM Preemption/Timeout Signal Configurable (#14626)
* Add parameter to change the preemption signal
* Make the signal connector use the custom signal from SLURMEnvironment

Signed-off-by: Max Ehrlich <max.ehr@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-09-12 19:24:35 +00:00
Adrian Wälchli d013bcc5bf
Standalone Lite: Accelerators (#14578)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-09-12 16:00:14 +00:00
Adrian Wälchli 024e7b8204
Standalone Lite: Cluster Environments (#14509) 2022-09-12 12:20:08 +02:00
Adrian Wälchli d2459df2ff
Standalone Lite: Remaining Utilities (#14492)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Laverne Henderson <laverne.henderson@coupa.com>
Co-authored-by: Felonious-Spellfire <felonious.spellfire@gmail.com>
2022-09-07 15:25:23 +00:00
Carlos Mocholí 273a9ed8c1
Integrate `lightning_utilities.core.apply_func` (#14537) 2022-09-06 13:52:54 +00:00
awaelchli 9fea2ed9d5 move pl/utilities/apply_func.py to pl/utilities/apply_func.py (#14516) 2022-09-05 20:30:42 +02:00
awaelchli def6548596 move pl/utilities/cloud_io.py to lite/utilities/cloud_io.py (#14515) 2022-09-05 18:30:31 +02:00
Rohit Gupta ce702fd40e
Squeeze tensor while logging (#14489)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-09-05 14:01:51 +00:00
Roberto de Moura Estevão Filho ed0164a3d2
Estimate stepping batches with max_steps if max_epochs is not set (#14317)
Co-authored-by: Roberto Estevão <robertode@microsoft.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-09-05 09:05:21 +00:00
Adrian Wälchli 28e18881a9
Mark stage argument in hooks as required (#14064)
Co-authored-by: rohitgr7 <rohitgr1998@gmail.com>
2022-09-01 15:47:40 +02:00
ananthsub d0d1818d50
Update `has_len_all_ranks` to use `Strategy.root_device` (#12144)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-08-29 20:23:34 +00:00
Rohit Gupta f3574176e2
Change `trainer.should_stop` to not stop in between an epoch and run until `min_steps/min_epochs` only (#13890) 2022-08-27 12:12:24 +00:00
Adrian Wälchli 250c06e406
Remove deprecated HPC model hooks (#14315)
Co-authored-by: rohitgr7 <rohitgr1998@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-08-26 20:59:32 +00:00
Tianshu Wang 8950613552
save checkpoints and profiler output to the first logger (#14325)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-08-26 17:23:54 +00:00
Adrian Wälchli fafd254678
Fix device parser logic to avoid creating CUDA context (#14319)
* let environment disable forking

* add helper function and error messages

* tests

* changelog

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-08-26 15:41:38 +00:00
Justin Goheen 94e567e6f0
Fix mypy errors attributed to `pytorch_lightning.trainer.connectors.data_connector.py` (#13806)
Co-authored-by: rohitgr7 <rohitgr1998@gmail.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-08-26 13:28:27 +00:00
Carlos Mocholí 7a617ec90e
Add back support for logging in the gradient clipping hooks (#14298)
* Add back support for logging in the gradient clipping hooks

* Docs and CHANGELOG

* Fix tests
2022-08-22 09:19:53 -04:00
Rohit Gupta db1835a82c
Fix an issue to avoid the impact of sanity check on `reload_dataloaders_every_n_epochs` for validation (#13964) 2022-08-21 23:55:03 +05:30
Rohit Gupta e949362a6b
Enable `on_before_batch_transfer` for `DPStrategy` and `IPUAccelerator` (#14023)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-08-18 12:12:29 +00:00
Rohit Gupta c8e22b4572
Avoid raising the sampler warning if num_replicas=1 (#14097)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-08-12 08:44:21 +00:00
Adrian Wälchli 807f9d8c96
Replace unwrapping logic in strategies (#13738)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2022-08-12 08:24:04 +00:00