Commit Graph

10334 Commits

Author SHA1 Message Date
elmuz cec6ae123d
Fix typo `scrict` -> `strict` in types.py (#19998) 2024-06-20 10:57:35 -04:00
Etay Livne 1e83a1bd32
Check if CometLogger experiment is alive (#19915)
Co-authored-by: Etay Livne <etay.livne@mobileye.com>
2024-06-18 13:15:12 -04:00
liambsmith 394c42aaf6
Fix callback call in Fabric Trainer example (#19986) 2024-06-18 13:14:32 -04:00
awaelchli c1af4d0527
Better graceful shutdown for KeyboardInterrupt (#19976) 2024-06-16 10:43:42 -04:00
PL Ghost b16e998a6e
Adding test for legacy checkpoint created with 2.3.0 (#19974) 2024-06-16 09:37:39 -04:00
Samuel Larkin bb511b0baf
Fix minor typo in Trainer's documentation (#19969) 2024-06-13 18:26:46 -04:00
awaelchli a42484cf8e
Fix failing app tests (#19971) 2024-06-13 20:58:34 +01:00
awaelchli f6fd046552
Release 2.3.0 (#19954) 2024-06-11 12:38:56 -04:00
William Falcon a97814af13
Update README.md 2024-06-11 11:01:22 -04:00
William Falcon fa5da26e39
Update README.md (#19968) 2024-06-11 10:04:51 -04:00
Alexander Jipa 06ea3a0571
Fix resetting epoch loop restarting flag in LearningRateFinder (#19819) 2024-06-07 10:52:58 -04:00
Björn Barz 5fa32d95e3
Ignore parameters causing ValueError when dumping to YAML (#19804) 2024-06-06 18:36:28 -04:00
Douwe den Blanken 4f96c83ba0
Sanitize argument-free object params before logging (#19771)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2024-06-06 14:51:48 -04:00
Bhavay Malhotra a611de0c15
Removing numpy requirement from all files in examples/pytorch/domain_templates (#19947) 2024-06-06 11:02:01 -04:00
Mario Vasilev 812ffdec84
Fix `save_last` type annotation for ModelCheckpoint (#19808) 2024-06-05 20:24:45 -04:00
Liyang90 7668a6bf59
Flexible and easy to use HSDP setting (#19504)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2024-06-05 20:15:03 -04:00
awaelchli 1a6786d682
Destroy process group in atexit handler (#19931) 2024-06-05 19:31:43 -04:00
Gilles Peiffer b9f215d7fd
Replace usage of `grep -P` with `perl` in `run_standalone_tests.sh` (#19942) 2024-06-05 12:32:56 -04:00
Jirka Borovec e0b7c04e63
ci/docs: enable dispatch build without warning as errors (#19948) 2024-06-05 12:32:36 -04:00
Yurij Mikhalevich 5aadfa6250
fix(docs): fix broken link to ensure the docs can be built (#19941)
* fix(docs): fix broken link to ensure the docs can be built

* nit
2024-06-04 22:11:20 -04:00
awaelchli 8bfbe0c908
Fix strict loading from distributed checkpoints vs PyTorch nightly (#19946)
* strict loading

* docstring
2024-06-04 22:09:01 -04:00
Federico Berto 19f0fb978c
Set `_choose_auto_accelerator` to `staticmethod` (#19822) 2024-06-04 21:12:27 -04:00
Alex Spies 351bec7625
Fix typo on `estimated_stepping_batches` property (#19847) 2024-06-04 21:06:16 -04:00
Gilles Peiffer 785f15d148
Remove `numpy` dependencies in `src/lightning/pytorch` (#19841) 2024-06-04 19:45:05 -04:00
Matthew Hoffman bac82b83a8
Remove unknown `[metadata]` table from `pyproject.toml` (#19904) 2024-06-04 19:43:18 -04:00
Gilles Peiffer fd86ea7356
Fix typos in CONTRIBUTING.md (#19937) 2024-06-03 21:20:01 +02:00
PL Ghost a99a6d3af1
Adding test for legacy checkpoint created with 2.2.5 (#19806) 2024-05-31 12:53:54 -04:00
awaelchli 427fdfaf6e
Update docstring for `self.log` about keys in distributed training (#19917) 2024-05-30 19:47:48 +02:00
Ivan Yashchuk dffc0f96ec
Update FlopCounterMode usage in throughput.py (#19926)
`mods` argument is not needed anymore for `FlopCounterMode`:
ffe506e853/torch/utils/flop_counter.py (L595-L596)
2024-05-30 12:14:56 -04:00
awaelchli 95d6b6b9da
Disable skipping training step in distributed training (#19918) 2024-05-30 11:54:48 -04:00
awaelchli 5d7932546d
Update code owners file (#19925)
update
2024-05-30 11:50:02 -04:00
awaelchli 014cdd84ed
Update code owners file (#19922)
* update code owners

* update

* Update .github/CODEOWNERS

Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>

---------

Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2024-05-30 06:12:41 -04:00
awaelchli 98005bbed0
Add Studio badge to tensor parallel docs (#19913) 2024-05-28 09:04:55 -04:00
awaelchli 896c2a656a
Error for unsupported precision types with ModelParallelStrategy (#19902) 2024-05-23 13:43:46 -04:00
awaelchli c09356db1e
(10/10) Support 2D Parallelism - Port Fabric docs to PL (#19899) 2024-05-23 08:55:52 -04:00
awaelchli 7874cd08ec
[TPU] Fix test assertion error from artifacts (#19825) 2024-05-23 07:11:28 -04:00
Jirka Borovec e0d7ede643
docs: prune unused `linkcode` (#19897) 2024-05-23 11:35:53 +02:00
awaelchli 414c86332e
(9/n) Support 2D Parallelism - Remaining Checkpoint Logic (#19888)
Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
2024-05-22 18:13:41 -04:00
Jirka Borovec fa1126ea53
docs: fix link to CLIP (#19896)
* docs: fix link to CLIP

* www

* ignore
2024-05-22 17:46:51 -04:00
awaelchli 341474aaac
(8/n) Support 2D Parallelism - 2D Parallel Fabric Docs (#19887) 2024-05-22 13:47:55 -04:00
awaelchli 8fc7b4ae94
Remove the requirement for FSDPStrategy subclasses to only support GPU (#19894) 2024-05-22 18:31:40 +02:00
awaelchli 987c2c4093
(7/n) Support 2D Parallelism - TP Fabric Docs (#19884)
Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2024-05-22 06:20:40 -04:00
awaelchli 7e87ce05c8
Fix state dict loading in bitsandbytes plugin when checkpoint is already quantized (#19886)
* bugfix

* add test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

* add chlog

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-05-21 13:46:01 -04:00
Gilles Peiffer b1bb3f3173
Update `LearningRateMonitor` docs and tests for `log_weight_decay` (#19805) 2024-05-21 13:31:54 -04:00
awaelchli d76feef0d6
Enable loss-parallel in example (#19882) 2024-05-20 13:19:38 +02:00
awaelchli 82e6e61bea
Remove redundant code to set the device on the LightningModule (#19877)
Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
2024-05-20 06:29:37 +02:00
Luca Antiga d5bf4b9ed3
[App] Extend retry to 4xx except 400, 401, 403, 404 (#19842)
* Extend retry to 4xx except 400, 401, 403, 404

* Remove unused intersphinx mapping for app

---------

Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2024-05-18 22:03:16 -04:00
awaelchli c8059d7bfd
(6/n) Support 2D Parallelism - Trainer example (#19879)
* Add 2D parallel example

* replace with torchtitan code
2024-05-18 20:35:58 -04:00
awaelchli 32e241870b
(5/n) Support 2D Parallelism in Lightning Trainer (#19878)
* ModelParallelStrategy for Lightning Trainer

* mypy

* import fix

* fix torchscript errors

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix docs issue

* fix test execution

* Update src/lightning/pytorch/strategies/model_parallel.py

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
2024-05-17 19:03:31 -04:00
awaelchli 1d0c6aae96
(4/n) Support 2D Parallelism - Loading optimizer states correctly (#19872)
* Load optimizer state

* move to utility

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-05-17 17:17:32 -04:00