Commit Graph

10455 Commits

Author SHA1 Message Date
awaelchli 95d6b6b9da
Disable skipping training step in distributed training (#19918) 2024-05-30 11:54:48 -04:00
awaelchli 5d7932546d
Update code owners file (#19925)
update
2024-05-30 11:50:02 -04:00
awaelchli 014cdd84ed
Update code owners file (#19922)
* update code owners

* update

* Update .github/CODEOWNERS

Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>

---------

Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2024-05-30 06:12:41 -04:00
awaelchli 98005bbed0
Add Studio badge to tensor parallel docs (#19913) 2024-05-28 09:04:55 -04:00
awaelchli 896c2a656a
Error for unsupported precision types with ModelParallelStrategy (#19902) 2024-05-23 13:43:46 -04:00
awaelchli c09356db1e
(10/10) Support 2D Parallelism - Port Fabric docs to PL (#19899) 2024-05-23 08:55:52 -04:00
awaelchli 7874cd08ec
[TPU] Fix test assertion error from artifacts (#19825) 2024-05-23 07:11:28 -04:00
Jirka Borovec e0d7ede643
docs: prune unused `linkcode` (#19897) 2024-05-23 11:35:53 +02:00
awaelchli 414c86332e
(9/n) Support 2D Parallelism - Remaining Checkpoint Logic (#19888)
Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
2024-05-22 18:13:41 -04:00
Jirka Borovec fa1126ea53
docs: fix link to CLIP (#19896)
* docs: fix link to CLIP

* www

* ignore
2024-05-22 17:46:51 -04:00
awaelchli 341474aaac
(8/n) Support 2D Parallelism - 2D Parallel Fabric Docs (#19887) 2024-05-22 13:47:55 -04:00
awaelchli 8fc7b4ae94
Remove the requirement for FSDPStrategy subclasses to only support GPU (#19894) 2024-05-22 18:31:40 +02:00
awaelchli 987c2c4093
(7/n) Support 2D Parallelism - TP Fabric Docs (#19884)
Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2024-05-22 06:20:40 -04:00
awaelchli 7e87ce05c8
Fix state dict loading in bitsandbytes plugin when checkpoint is already quantized (#19886)
* bugfix

* add test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

* add chlog

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-05-21 13:46:01 -04:00
Gilles Peiffer b1bb3f3173
Update `LearningRateMonitor` docs and tests for `log_weight_decay` (#19805) 2024-05-21 13:31:54 -04:00
awaelchli d76feef0d6
Enable loss-parallel in example (#19882) 2024-05-20 13:19:38 +02:00
awaelchli 82e6e61bea
Remove redundant code to set the device on the LightningModule (#19877)
Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
2024-05-20 06:29:37 +02:00
Luca Antiga d5bf4b9ed3
[App] Extend retry to 4xx except 400, 401, 403, 404 (#19842)
* Extend retry to 4xx except 400, 401, 403, 404

* Remove unused intersphinx mapping for app

---------

Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2024-05-18 22:03:16 -04:00
awaelchli c8059d7bfd
(6/n) Support 2D Parallelism - Trainer example (#19879)
* Add 2D parallel example

* replace with torchtitan code
2024-05-18 20:35:58 -04:00
awaelchli 32e241870b
(5/n) Support 2D Parallelism in Lightning Trainer (#19878)
* ModelParallelStrategy for Lightning Trainer

* mypy

* import fix

* fix torchscript errors

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix docs issue

* fix test execution

* Update src/lightning/pytorch/strategies/model_parallel.py

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
2024-05-17 19:03:31 -04:00
awaelchli 1d0c6aae96
(4/n) Support 2D Parallelism - Loading optimizer states correctly (#19872)
* Load optimizer state

* move to utility

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-05-17 17:17:32 -04:00
awaelchli cd8acc26c3
(3/n) Support 2D Parallelism - Efficient loading of full-state checkpoints (#19870)
* memory-optimized loading of full checkpoints into dist model

* simplify

* handle buffers

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* handle strict loading, buffers, and add test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* chlog

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-05-15 13:07:31 -04:00
awaelchli 9455871c93
(2/n) Support 2D Parallelism - Distributed Checkpoints (#19852)
* distributed checkpoints

* use decorator

* refactor if-strict

* update example

* filter non-persistent buffers (todo, add test)

* simplify checkpoint loading for model
2024-05-15 08:19:08 -04:00
thomas chaton 90d04b5b86
Update Lightning Cloud 0.5.69 (#19857) 2024-05-09 16:12:30 +01:00
thomas chaton 8453e31028
Reduce queue fetching (#19856)
* update

* update
2024-05-09 07:46:27 -04:00
awaelchli e0307277a0
Add function to explicitly mark forward methods in Fabric (#19690)
Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
2024-05-08 16:58:33 -04:00
awaelchli 0c8a193d3c
(1/n) Support 2D Parallelism (#19846) 2024-05-07 17:02:58 -04:00
Adrian Wälchli 0f12271d7f bump lightning cloud 2024-05-01 18:45:35 -04:00
Luca Antiga d623708192 xfail tests for deprecated functionality 2024-05-01 17:51:51 -04:00
Luca Antiga 4219f30c96 Fix formatting 2024-05-01 17:51:51 -04:00
Luca Antiga 8103bd7e01 Make sure the HTTP client for queues retries for POST and 5xx 2024-05-01 17:51:51 -04:00
Adrian Wälchli d1949766f8
Fix TensorBoardLogger test on Windows (#19824) 2024-04-29 08:51:56 -04:00
Adrian Wälchli 49ed2b102b
Add PyTorch 2.3 to CI matrix (#19708) 2024-04-29 07:16:13 -04:00
Adrian Wälchli 29136332d6
Avoid interactions through test artifacts (#19821) 2024-04-28 11:56:40 -04:00
Adrian Wälchli 5e0e02b79e
Remove support for PyTorch 1.13 (#19706) 2024-04-27 01:24:07 -04:00
Adrian Wälchli b9680a364d
Update changelog after 2.2.2 release (#19770) 2024-04-22 13:52:43 -04:00
thomas chaton a2b3dddf1d
Update Lightning Cloud to 0.5.67 (#19795) 2024-04-22 17:47:04 +01:00
awaelchli c235f20e71
Remove the requirement for FSDPStrategy subclasses to only support GPU (#19781) 2024-04-17 01:28:44 +02:00
David de la Iglesia Castro 58ad56afec
Use `step` interval in `estimated_stepping_batches` docs example (#19774) 2024-04-15 10:16:17 -04:00
awaelchli ce90b3898a
Sanitize hparams that can't be json-serialized in `WandbLogger.log_hyperparameters()` (#19769) 2024-04-14 15:01:58 +02:00
PL Ghost 67b270bd4d
Adding test for legacy checkpoint created with 2.2.2 (#19760) 2024-04-12 09:19:39 -04:00
Jirka Borovec f642d68508
ci/lint: simlify prettier (#19742) 2024-04-12 13:11:21 +02:00
pre-commit-ci[bot] 3f97e16cd4
[pre-commit.ci] pre-commit suggestions (#19723)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2024-04-12 06:40:25 -04:00
awaelchli dcb91d53d2
Fix initialized weights resetting in `Fabric.setup()` when using FSDP (#19755) 2024-04-11 05:52:28 -04:00
awaelchli 316cc71c2b
Skip tests that cause CLI argparse errors on Python 3.11.9 (#19756) 2024-04-11 05:01:27 -04:00
Dominic Kerr 76b691d80c
Support pathlib.Path file paths when saving ONNX models (#19727)
Co-authored-by: dominicgkerr <dominicgkerr1@gmail.co>
2024-04-03 20:42:25 -04:00
Alexander Jipa ce88483c6f
Add synchronous parameter to MLflowLogger (#19639)
Co-authored-by: Alexander Jipa <azzhipa@amazon.com>
2024-04-03 18:16:14 -04:00
awaelchli 8947d135d6
Skip test with compile error on torch=2.2.2 on Windows (#19734) 2024-04-03 17:53:46 -04:00
dependabot[bot] d25014dbda
build(deps): bump Lightning-AI/utilities from 0.11.0 to 0.11.2 (#19719)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-04-01 10:38:05 -04:00
awaelchli 438f29f07a
Relax restrictions on wrapping a custom batch sampler in predict (#19678) 2024-03-27 23:45:50 +01:00