awaelchli
95d6b6b9da
Disable skipping training step in distributed training ( #19918 )
2024-05-30 11:54:48 -04:00
awaelchli
5d7932546d
Update code owners file ( #19925 )
...
update
2024-05-30 11:50:02 -04:00
awaelchli
014cdd84ed
Update code owners file ( #19922 )
...
* update code owners
* update
* Update .github/CODEOWNERS
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
---------
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2024-05-30 06:12:41 -04:00
awaelchli
98005bbed0
Add Studio badge to tensor parallel docs ( #19913 )
2024-05-28 09:04:55 -04:00
awaelchli
896c2a656a
Error for unsupported precision types with ModelParallelStrategy ( #19902 )
2024-05-23 13:43:46 -04:00
awaelchli
c09356db1e
(10/10) Support 2D Parallelism - Port Fabric docs to PL ( #19899 )
2024-05-23 08:55:52 -04:00
awaelchli
7874cd08ec
[TPU] Fix test assertion error from artifacts ( #19825 )
2024-05-23 07:11:28 -04:00
Jirka Borovec
e0d7ede643
docs: prune unused `linkcode` ( #19897 )
2024-05-23 11:35:53 +02:00
awaelchli
414c86332e
(9/n) Support 2D Parallelism - Remaining Checkpoint Logic ( #19888 )
...
Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
2024-05-22 18:13:41 -04:00
Jirka Borovec
fa1126ea53
docs: fix link to CLIP ( #19896 )
...
* docs: fix link to CLIP
* www
* ignore
2024-05-22 17:46:51 -04:00
awaelchli
341474aaac
(8/n) Support 2D Parallelism - 2D Parallel Fabric Docs ( #19887 )
2024-05-22 13:47:55 -04:00
awaelchli
8fc7b4ae94
Remove the requirement for FSDPStrategy subclasses to only support GPU ( #19894 )
2024-05-22 18:31:40 +02:00
awaelchli
987c2c4093
(7/n) Support 2D Parallelism - TP Fabric Docs ( #19884 )
...
Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2024-05-22 06:20:40 -04:00
awaelchli
7e87ce05c8
Fix state dict loading in bitsandbytes plugin when checkpoint is already quantized ( #19886 )
...
* bugfix
* add test
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update
* add chlog
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-05-21 13:46:01 -04:00
Gilles Peiffer
b1bb3f3173
Update `LearningRateMonitor` docs and tests for `log_weight_decay` ( #19805 )
2024-05-21 13:31:54 -04:00
awaelchli
d76feef0d6
Enable loss-parallel in example ( #19882 )
2024-05-20 13:19:38 +02:00
awaelchli
82e6e61bea
Remove redundant code to set the device on the LightningModule ( #19877 )
...
Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
2024-05-20 06:29:37 +02:00
Luca Antiga
d5bf4b9ed3
[App] Extend retry to 4xx except 400, 401, 403, 404 ( #19842 )
...
* Extend retry to 4xx except 400, 401, 403, 404
* Remove unused intersphinx mapping for app
---------
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2024-05-18 22:03:16 -04:00
awaelchli
c8059d7bfd
(6/n) Support 2D Parallelism - Trainer example ( #19879 )
...
* Add 2D parallel example
* replace with torchtitan code
2024-05-18 20:35:58 -04:00
awaelchli
32e241870b
(5/n) Support 2D Parallelism in Lightning Trainer ( #19878 )
...
* ModelParallelStrategy for Lightning Trainer
* mypy
* import fix
* fix torchscript errors
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix docs issue
* fix test execution
* Update src/lightning/pytorch/strategies/model_parallel.py
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
2024-05-17 19:03:31 -04:00
awaelchli
1d0c6aae96
(4/n) Support 2D Parallelism - Loading optimizer states correctly ( #19872 )
...
* Load optimizer state
* move to utility
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-05-17 17:17:32 -04:00
awaelchli
cd8acc26c3
(3/n) Support 2D Parallelism - Efficient loading of full-state checkpoints ( #19870 )
...
* memory-optimized loading of full checkpoints into dist model
* simplify
* handle buffers
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* handle strict loading, buffers, and add test
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* chlog
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-05-15 13:07:31 -04:00
awaelchli
9455871c93
(2/n) Support 2D Parallelism - Distributed Checkpoints ( #19852 )
...
* distributed checkpoints
* use decorator
* refactor if-strict
* update example
* filter non-persistent buffers (todo, add test)
* simplify checkpoint loading for model
2024-05-15 08:19:08 -04:00
thomas chaton
90d04b5b86
Update Lightning Cloud 0.5.69 ( #19857 )
2024-05-09 16:12:30 +01:00
thomas chaton
8453e31028
Reduce queue fetching ( #19856 )
...
* update
* update
2024-05-09 07:46:27 -04:00
awaelchli
e0307277a0
Add function to explicitly mark forward methods in Fabric ( #19690 )
...
Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
2024-05-08 16:58:33 -04:00
awaelchli
0c8a193d3c
(1/n) Support 2D Parallelism ( #19846 )
2024-05-07 17:02:58 -04:00
Adrian Wälchli
0f12271d7f
bump lightning cloud
2024-05-01 18:45:35 -04:00
Luca Antiga
d623708192
xfail tests for deprecated functionality
2024-05-01 17:51:51 -04:00
Luca Antiga
4219f30c96
Fix formatting
2024-05-01 17:51:51 -04:00
Luca Antiga
8103bd7e01
Make sure the HTTP client for queues retries for POST and 5xx
2024-05-01 17:51:51 -04:00
Adrian Wälchli
d1949766f8
Fix TensorBoardLogger test on Windows ( #19824 )
2024-04-29 08:51:56 -04:00
Adrian Wälchli
49ed2b102b
Add PyTorch 2.3 to CI matrix ( #19708 )
2024-04-29 07:16:13 -04:00
Adrian Wälchli
29136332d6
Avoid interactions through test artifacts ( #19821 )
2024-04-28 11:56:40 -04:00
Adrian Wälchli
5e0e02b79e
Remove support for PyTorch 1.13 ( #19706 )
2024-04-27 01:24:07 -04:00
Adrian Wälchli
b9680a364d
Update changelog after 2.2.2 release ( #19770 )
2024-04-22 13:52:43 -04:00
thomas chaton
a2b3dddf1d
Update Lightning Cloud to 0.5.67 ( #19795 )
2024-04-22 17:47:04 +01:00
awaelchli
c235f20e71
Remove the requirement for FSDPStrategy subclasses to only support GPU ( #19781 )
2024-04-17 01:28:44 +02:00
David de la Iglesia Castro
58ad56afec
Use `step` interval in `estimated_stepping_batches` docs example ( #19774 )
2024-04-15 10:16:17 -04:00
awaelchli
ce90b3898a
Sanitize hparams that can't be json-serialized in `WandbLogger.log_hyperparameters()` ( #19769 )
2024-04-14 15:01:58 +02:00
PL Ghost
67b270bd4d
Adding test for legacy checkpoint created with 2.2.2 ( #19760 )
2024-04-12 09:19:39 -04:00
Jirka Borovec
f642d68508
ci/lint: simlify prettier ( #19742 )
2024-04-12 13:11:21 +02:00
pre-commit-ci[bot]
3f97e16cd4
[pre-commit.ci] pre-commit suggestions ( #19723 )
...
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2024-04-12 06:40:25 -04:00
awaelchli
dcb91d53d2
Fix initialized weights resetting in `Fabric.setup()` when using FSDP ( #19755 )
2024-04-11 05:52:28 -04:00
awaelchli
316cc71c2b
Skip tests that cause CLI argparse errors on Python 3.11.9 ( #19756 )
2024-04-11 05:01:27 -04:00
Dominic Kerr
76b691d80c
Support pathlib.Path file paths when saving ONNX models ( #19727 )
...
Co-authored-by: dominicgkerr <dominicgkerr1@gmail.co>
2024-04-03 20:42:25 -04:00
Alexander Jipa
ce88483c6f
Add synchronous parameter to MLflowLogger ( #19639 )
...
Co-authored-by: Alexander Jipa <azzhipa@amazon.com>
2024-04-03 18:16:14 -04:00
awaelchli
8947d135d6
Skip test with compile error on torch=2.2.2 on Windows ( #19734 )
2024-04-03 17:53:46 -04:00
dependabot[bot]
d25014dbda
build(deps): bump Lightning-AI/utilities from 0.11.0 to 0.11.2 ( #19719 )
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-04-01 10:38:05 -04:00
awaelchli
438f29f07a
Relax restrictions on wrapping a custom batch sampler in predict ( #19678 )
2024-03-27 23:45:50 +01:00