Yuxuan Lu
ee8a57da0f
Fix usage of fs.listdir in CheckpointConnector ( #15413 )
...
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-11-04 20:21:52 +00:00
Rohit Gupta
61ae35c378
Use sklearn in runif ( #15426 )
...
* Use sklearn in runif
* test by removing sklearn dep
* remove repeated code
* seed
2022-11-01 11:40:32 +00:00
Adrian Wälchli
38a9e69543
Extend the detection of interactive mode ( #15293 )
...
* extend interactive mode detection
* update test names
* changelog
* test
2022-10-26 15:24:11 +00:00
Adrian Wälchli
0f9156374d
Mark internal Lite APIs as protected ( #15307 )
...
* mark internal lite apis as protected
* formatting
* docs update
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-10-26 12:51:50 +00:00
Adrian Wälchli
576757fd79
Validate SRUN variables when launching in SLURM ( #15011 )
2022-10-19 21:42:11 +00:00
Carlos Mocholí
24c26f7db2
Standardize Lite's filenames ( #15058 )
2022-10-19 14:09:41 +02:00
Rohit Gupta
eb17dc9839
Deprecate tuning enum and trainer properties ( #15100 )
2022-10-13 13:29:50 +00:00
Ray Schireman
0a5e75e8d1
Add `inference_mode` flag to Trainer ( #15034 )
...
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-10-12 12:22:01 +00:00
Carlos Mocholí
c334b7766c
Remove old testing artifacts ( #15052 )
2022-10-10 17:34:18 +00:00
Adrian Wälchli
3183079204
Remove deprecated callback hooks ( #14834 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: otaj <ota@lightning.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-10-10 15:46:28 +00:00
Max Ehrlich
5a3007cd6c
Support Slurm Autorequeue for Array Jobs ( #15040 )
...
Signed-off-by: Max Ehrlich <max.ehr@gmail.com>
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2022-10-10 13:43:57 +02:00
Adrian Wälchli
c76a95ea12
More tests for TPU accelerator in Lite ( #14960 )
...
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-10-08 15:42:21 +00:00
otaj
7e518cacd2
Use `torch.testing.assert_close` everywhere ( #15031 )
...
remove unnecessary version check
2022-10-07 16:59:04 +02:00
Rohit Gupta
7fed7a12c5
Add `LRFinder` callback ( #13802 )
...
* add BatchSizeFinderCallback callback
* enable fast_dev_run test
* keep tune and remove early_exit
* move exception to setup
* Apply suggestions from code review
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Laverne Henderson <laverne.henderson@coupa.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-10-05 13:15:38 +02:00
Carlos Mocholí
7ef87464dd
Refactor XLA and TPU checks across codebase ( #14550 )
2022-10-04 22:54:14 +00:00
otaj
511a070c52
Find last checkpoints on restart ( #14907 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-30 20:14:18 +00:00
Andres Algaba
3daa4c9cc0
Remove deprecated on_init_start_end ( #14867 )
...
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-09-30 15:11:38 +00:00
otaj
5f0c4aad12
Introduce `ckpt_path="hpc"` keyword for checkpoint loading ( #14911 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-29 12:45:51 +00:00
Rohit Gupta
d1a3a3ebf5
Add BatchSizeFinder callback ( #11089 )
...
* add BatchSizeFinderCallback callback
* temp rm from init
* skip with lr_finder tests
* restore loops and intergrate early exit
* enable fast_dev_run test
* add docs and tests
* keep tune and remove early_exit
* add more tests
* patch lr finder
* disable skip
* force_save and fix test
* mypy and circular import fix
* fix mypy
* fix
* updates
* rebase
* address reviews
* add more exceptions for unsupported functionalities
* move exception to setup
* chlog
* unit test
* address reviews
* Apply suggestions from code review
* update
* update
* mypy
* fix
* use it as a util func
* license
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* mypy
* mypy
* review
* fix
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix
* updates
* updates
* fix import
* Protect callback attrs
* don't reset val dataloader
* update test
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-27 08:54:37 -04:00
William Falcon
c77d4a8394
Make Trainer Debuggable and understandable again (1/n) ( #14861 )
...
* clean trainer 1/n
* clean trainer 1/n
* clean trainer 1/n
* clean trainer 1/n
* clean trainer 1/n
* clean trainer 1/n
* clean trainer 1/n
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* clean trainer 1/n
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-09-23 01:15:59 -04:00
Adrian Wälchli
dc1dc0df36
Attempt to query device count via NVML ( #14631 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-22 09:57:13 +00:00
ritsuki1227
6855f653bb
Set `MLFlowLogger` status to FAILED when training raises an error ( #12292 )
...
Co-authored-by: Ritsuki Yamada <ritsuki.yamada@uzabase.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-09-20 07:43:32 -04:00
Carlos Mocholí
e9c571d39f
Move accelerator-specific parsing functions with their accelerators ( #14753 )
...
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2022-09-18 22:48:45 +00:00
Adrian Wälchli
35c65b0287
Fix test suite when running on MPS-enabled hardware ( #14708 )
2022-09-16 19:21:36 +00:00
Adrian Wälchli
47f0d336f1
Standalone Lite: Update LightningLite ( #14726 )
2022-09-16 17:25:27 +00:00
Adrian Wälchli
619e76f22d
Remove silent behavior when `num_slurm_tasks` does not correspond to number of processes in Trainer ( #14300 )
...
* simplify logic
* remove hpc
* update
* add changelog
* more tests
* update test
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-09-16 11:00:09 +00:00
Adrian Wälchli
19a1274093
Better error message when dataloader and datamodule is None (V2) ( #14637 )
2022-09-13 12:26:03 +00:00
Adrian Wälchli
1ee3d1eb72
Avoid warning when cloning tensor in self.log ( #14599 )
...
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-09-13 16:23:46 +05:30
Adrian Wälchli
4bd135a6f6
Remove deprecated `LoggerCollection` ( #14283 )
...
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-09-12 21:46:46 +00:00
Max Ehrlich
e5998e6bf2
Make the SLURM Preemption/Timeout Signal Configurable ( #14626 )
...
* Add parameter to change the preemption signal
* Make the signal connector use the custom signal from SLURMEnvironment
Signed-off-by: Max Ehrlich <max.ehr@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-09-12 19:24:35 +00:00
Adrian Wälchli
d013bcc5bf
Standalone Lite: Accelerators ( #14578 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-09-12 16:00:14 +00:00
Adrian Wälchli
024e7b8204
Standalone Lite: Cluster Environments ( #14509 )
2022-09-12 12:20:08 +02:00
Adrian Wälchli
d2459df2ff
Standalone Lite: Remaining Utilities ( #14492 )
...
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Laverne Henderson <laverne.henderson@coupa.com>
Co-authored-by: Felonious-Spellfire <felonious.spellfire@gmail.com>
2022-09-07 15:25:23 +00:00
Carlos Mocholí
273a9ed8c1
Integrate `lightning_utilities.core.apply_func` ( #14537 )
2022-09-06 13:52:54 +00:00
awaelchli
9fea2ed9d5
move pl/utilities/apply_func.py to pl/utilities/apply_func.py ( #14516 )
2022-09-05 20:30:42 +02:00
awaelchli
def6548596
move pl/utilities/cloud_io.py to lite/utilities/cloud_io.py ( #14515 )
2022-09-05 18:30:31 +02:00
Rohit Gupta
ce702fd40e
Squeeze tensor while logging ( #14489 )
...
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-09-05 14:01:51 +00:00
Roberto de Moura Estevão Filho
ed0164a3d2
Estimate stepping batches with max_steps if max_epochs is not set ( #14317 )
...
Co-authored-by: Roberto Estevão <robertode@microsoft.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-09-05 09:05:21 +00:00
Adrian Wälchli
28e18881a9
Mark stage argument in hooks as required ( #14064 )
...
Co-authored-by: rohitgr7 <rohitgr1998@gmail.com>
2022-09-01 15:47:40 +02:00
ananthsub
d0d1818d50
Update `has_len_all_ranks` to use `Strategy.root_device` ( #12144 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-08-29 20:23:34 +00:00
Rohit Gupta
f3574176e2
Change `trainer.should_stop` to not stop in between an epoch and run until `min_steps/min_epochs` only ( #13890 )
2022-08-27 12:12:24 +00:00
Adrian Wälchli
250c06e406
Remove deprecated HPC model hooks ( #14315 )
...
Co-authored-by: rohitgr7 <rohitgr1998@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-08-26 20:59:32 +00:00
Tianshu Wang
8950613552
save checkpoints and profiler output to the first logger ( #14325 )
...
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-08-26 17:23:54 +00:00
Adrian Wälchli
fafd254678
Fix device parser logic to avoid creating CUDA context ( #14319 )
...
* let environment disable forking
* add helper function and error messages
* tests
* changelog
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-08-26 15:41:38 +00:00
Justin Goheen
94e567e6f0
Fix mypy errors attributed to `pytorch_lightning.trainer.connectors.data_connector.py` ( #13806 )
...
Co-authored-by: rohitgr7 <rohitgr1998@gmail.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-08-26 13:28:27 +00:00
Carlos Mocholí
7a617ec90e
Add back support for logging in the gradient clipping hooks ( #14298 )
...
* Add back support for logging in the gradient clipping hooks
* Docs and CHANGELOG
* Fix tests
2022-08-22 09:19:53 -04:00
Rohit Gupta
db1835a82c
Fix an issue to avoid the impact of sanity check on `reload_dataloaders_every_n_epochs` for validation ( #13964 )
2022-08-21 23:55:03 +05:30
Rohit Gupta
e949362a6b
Enable `on_before_batch_transfer` for `DPStrategy` and `IPUAccelerator` ( #14023 )
...
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-08-18 12:12:29 +00:00
Rohit Gupta
c8e22b4572
Avoid raising the sampler warning if num_replicas=1 ( #14097 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-08-12 08:44:21 +00:00
Adrian Wälchli
807f9d8c96
Replace unwrapping logic in strategies ( #13738 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2022-08-12 08:24:04 +00:00