Commit Graph

130 Commits

Author SHA1 Message Date
awaelchli 7d1a70752f
Update PyTorch 2.4 tests (#20079) 2024-07-13 05:09:09 -04:00
awaelchli 5829ef8ab3
Set `weights_only` in tests to avoid warnings in PyTorch 2.4 (#20057) 2024-07-08 04:38:27 -04:00
awaelchli 693c21ac1b
Add testing for PyTorch 2.4 (Fabric) (#20028) 2024-07-02 18:01:03 -04:00
awaelchli 14493c0685
Drop PyTorch 2.0 from the test matrix (#20009) 2024-06-30 18:02:00 -04:00
awaelchli e330da5870
Fix torch-numpy compatibility conflict in tests (#20004) 2024-06-21 20:20:59 -04:00
Liyang90 7668a6bf59
Flexible and easy to use HSDP setting (#19504)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2024-06-05 20:15:03 -04:00
awaelchli 896c2a656a
Error for unsupported precision types with ModelParallelStrategy (#19902) 2024-05-23 13:43:46 -04:00
awaelchli 32e241870b
(5/n) Support 2D Parallelism in Lightning Trainer (#19878)
* ModelParallelStrategy for Lightning Trainer

* mypy

* import fix

* fix torchscript errors

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix docs issue

* fix test execution

* Update src/lightning/pytorch/strategies/model_parallel.py

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
2024-05-17 19:03:31 -04:00
awaelchli 1d0c6aae96
(4/n) Support 2D Parallelism - Loading optimizer states correctly (#19872)
* Load optimizer state

* move to utility

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-05-17 17:17:32 -04:00
awaelchli cd8acc26c3
(3/n) Support 2D Parallelism - Efficient loading of full-state checkpoints (#19870)
* memory-optimized loading of full checkpoints into dist model

* simplify

* handle buffers

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* handle strict loading, buffers, and add test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* chlog

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-05-15 13:07:31 -04:00
awaelchli 9455871c93
(2/n) Support 2D Parallelism - Distributed Checkpoints (#19852)
* distributed checkpoints

* use decorator

* refactor if-strict

* update example

* filter non-persistent buffers (todo, add test)

* simplify checkpoint loading for model
2024-05-15 08:19:08 -04:00
awaelchli 0c8a193d3c
(1/n) Support 2D Parallelism (#19846) 2024-05-07 17:02:58 -04:00
Adrian Wälchli 5e0e02b79e
Remove support for PyTorch 1.13 (#19706) 2024-04-27 01:24:07 -04:00
awaelchli dcb91d53d2
Fix initialized weights resetting in `Fabric.setup()` when using FSDP (#19755) 2024-04-11 05:52:28 -04:00
Carlos Mocholí 06eb3cc28b
Pass `enabled` down to `_BackwardSyncControl` (#19577) 2024-03-08 11:48:16 +01:00
awaelchli b3c869f636
Revise checkpoint consolidation with PyTorch 2.3 (#19561) 2024-03-04 10:13:31 -05:00
awaelchli a41528c2a6
Update tests for PyTorch 2.2.1 (#19521) 2024-02-23 13:11:34 -05:00
Jirka Borovec 99fe6563ef
precommit: ruff-format (#19434)
* precommit: ruff-format

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* manual update

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* manual update

* order

* mypy

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* mypy

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-02-15 13:39:17 -05:00
awaelchli 265025bd5d
Inform the user about a missing `fabric.backward()` call (#19447) 2024-02-14 17:49:11 -05:00
Carlos Mocholí 67459944ea
Avoid FSDP deprecations during save/load with newer torch versions (#19463)
* Avoid FSDP deprecations during save/load with newer torch versions

* Refactor

* Tests
2024-02-14 19:43:59 +01:00
nik777 7a56ac5182
Support shortcut name for DeepSpeed stage 1 offload (#19075) 2024-02-05 20:53:18 -05:00
awaelchli fb0ce03a9c
Fix input validation to support passing `device_mesh` to FSDP (#19392)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2024-02-02 06:48:12 -05:00
awaelchli 01f8531c9d
Refactor BoringFabric in tests (#19364) 2024-01-30 23:32:45 +01:00
awaelchli 1a59097ab2
Drop support for PyTorch 1.12 (#19300)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2024-01-26 11:44:24 -05:00
Jirka Borovec 3bd133b107
CI: enable testing with coming PT 2.2 (#19289)
* ci: build dockers for PT 2.2
* py3.12
* --pre --extra-index-url
* typing-extensions
* bump jsonargparse
* install latest jsonargparse
* Add windows skips for Fabric
* convert to xfail
* add pytorch skips
* skip checkpoint consolidation test
* set max torch

---------

Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-01-26 16:42:09 +01:00
awaelchli 7cc79fe7ba
Reapply `torch.compile` in Fabric.setup() (#19280) 2024-01-23 21:17:41 -05:00
awaelchli b1127e3608
Utility to consolidate sharded checkpoints (#19213)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2024-01-23 17:15:22 -05:00
awaelchli 75e112f138
Support gradient clipping by value in Fabric FSDP (#19236) 2024-01-11 17:28:30 +01:00
awaelchli b5d4ee5e61
Fix XLA test for syncing module states (#19264)
Fix tpu test
2024-01-10 19:36:18 +01:00
awaelchli d10e918ce0
Rewrite gradient clipping tests (#19262) 2024-01-10 12:39:56 -05:00
Carlos Mocholí a1dd9efcf7
Drop XLA XRT support (#19232)
* Drop XLA XRT support
* update test
* set launched
* update conftest
* xla available check
---------

Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-01-10 18:39:20 +01:00
Carlos Mocholí c3e2ba52ca
`set_device` before `init_process_group` (#19184) 2023-12-21 16:28:16 +01:00
Carlos Mocholí 234ded89d4
Avoid moving the model to device if `move_to_device=False` (#19152) 2023-12-15 00:00:21 +01:00
Adrian Wälchli d4614d043e
Address test flakiness (#19022)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-11-21 17:11:00 -05:00
Adrian Wälchli e66be675d2
Refined FSDP saving logic and error messaging when path exists (#18884) 2023-10-30 10:05:28 -04:00
Carlos Mocholí 5a83f541da
Minor strategy fixes [TPU] (#18774) 2023-10-11 15:26:30 +02:00
Adrian Wälchli 5d819c91fb
Remove `fsdp_overlap_step_with_backward` in favor of native solution (#18726) 2023-10-06 08:11:41 -04:00
Adrian Wälchli c514f1cbea
Enable PyTorch 2.1 (#18718) 2023-10-06 07:17:03 -04:00
Adrian Wälchli d31ef1f7d3
Drop support for PyTorch 1.11 (#18691)
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-10-04 20:30:44 +02:00
Adrian Wälchli d05cd3fa0a
Fix KeyError when calling `Fabric.load_raw` before setting up an FSDP model (#18647) 2023-09-29 07:35:27 -04:00
Carlos Mocholí 70a11d9739
Forbid non-FSDP precision plugins with FSDP (#18664) 2023-09-29 10:07:51 +02:00
Jirka Borovec 830a62a722
ruff: replace isort with ruff +TPU (#17684)
* ruff: replace isort with ruff

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixing & imports

* lines in warning test

* docs

* fix enum import

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixing

* import

* fix lines

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* type ClusterEnvironment

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-09-26 11:54:55 -04:00
Adrian Wälchli 8094855137
Avoid passing process group to enable FSDP's hybrid-shard (#18583) 2023-09-19 13:46:24 -04:00
Jirka Borovec dbe7ed46a3
replace tests skip with soft xfail (#18486)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-09-12 23:11:03 +02:00
Adrian Wälchli c959df74b8
Support saving and loading stateful objects in Fabric (#18513)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-09-12 07:58:52 -04:00
Carlos Mocholí e1c5c5ae4a
[TPU] Set the compute_dtype with XLAFSDP (#18497) 2023-09-07 18:43:21 +02:00
Carlos Mocholí 729e833935
[TPU] XLAFSDP checkpointing fixes (#18424)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2023-08-31 05:47:11 -07:00
Carlos Mocholí fdda1968d3
[TPU] Support setting the XLAFSDP policy with a set (#18430) 2023-08-30 09:25:12 -07:00
Adrian Wälchli 2f3491a739
Fix saving FSDP checkpoint when world size = 1 and torch <= 2.0 (#18371) 2023-08-23 06:45:20 -04:00
Adrian Wälchli fcc7b00116
Improve error message when "if name == main" guard is needed (#18298) 2023-08-18 08:53:01 -04:00