Commit Graph

115 Commits

Author SHA1 Message Date
awaelchli b3c869f636
Revise checkpoint consolidation with PyTorch 2.3 (#19561) 2024-03-04 10:13:31 -05:00
awaelchli a41528c2a6
Update tests for PyTorch 2.2.1 (#19521) 2024-02-23 13:11:34 -05:00
Jirka Borovec 99fe6563ef
precommit: ruff-format (#19434)
* precommit: ruff-format

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* manual update

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* manual update

* order

* mypy

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* mypy

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-02-15 13:39:17 -05:00
awaelchli 265025bd5d
Inform the user about a missing `fabric.backward()` call (#19447) 2024-02-14 17:49:11 -05:00
Carlos Mocholí 67459944ea
Avoid FSDP deprecations during save/load with newer torch versions (#19463)
* Avoid FSDP deprecations during save/load with newer torch versions

* Refactor

* Tests
2024-02-14 19:43:59 +01:00
nik777 7a56ac5182
Support shortcut name for DeepSpeed stage 1 offload (#19075) 2024-02-05 20:53:18 -05:00
awaelchli fb0ce03a9c
Fix input validation to support passing `device_mesh` to FSDP (#19392)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2024-02-02 06:48:12 -05:00
awaelchli 01f8531c9d
Refactor BoringFabric in tests (#19364) 2024-01-30 23:32:45 +01:00
awaelchli 1a59097ab2
Drop support for PyTorch 1.12 (#19300)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2024-01-26 11:44:24 -05:00
Jirka Borovec 3bd133b107
CI: enable testing with coming PT 2.2 (#19289)
* ci: build dockers for PT 2.2
* py3.12
* --pre --extra-index-url
* typing-extensions
* bump jsonargparse
* install latest jsonargparse
* Add windows skips for Fabric
* convert to xfail
* add pytorch skips
* skip checkpoint consolidation test
* set max torch

---------

Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-01-26 16:42:09 +01:00
awaelchli 7cc79fe7ba
Reapply `torch.compile` in Fabric.setup() (#19280) 2024-01-23 21:17:41 -05:00
awaelchli b1127e3608
Utility to consolidate sharded checkpoints (#19213)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2024-01-23 17:15:22 -05:00
awaelchli 75e112f138
Support gradient clipping by value in Fabric FSDP (#19236) 2024-01-11 17:28:30 +01:00
awaelchli b5d4ee5e61
Fix XLA test for syncing module states (#19264)
Fix tpu test
2024-01-10 19:36:18 +01:00
awaelchli d10e918ce0
Rewrite gradient clipping tests (#19262) 2024-01-10 12:39:56 -05:00
Carlos Mocholí a1dd9efcf7
Drop XLA XRT support (#19232)
* Drop XLA XRT support
* update test
* set launched
* update conftest
* xla available check
---------

Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-01-10 18:39:20 +01:00
Carlos Mocholí c3e2ba52ca
`set_device` before `init_process_group` (#19184) 2023-12-21 16:28:16 +01:00
Carlos Mocholí 234ded89d4
Avoid moving the model to device if `move_to_device=False` (#19152) 2023-12-15 00:00:21 +01:00
Adrian Wälchli d4614d043e
Address test flakiness (#19022)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-11-21 17:11:00 -05:00
Adrian Wälchli e66be675d2
Refined FSDP saving logic and error messaging when path exists (#18884) 2023-10-30 10:05:28 -04:00
Carlos Mocholí 5a83f541da
Minor strategy fixes [TPU] (#18774) 2023-10-11 15:26:30 +02:00
Adrian Wälchli 5d819c91fb
Remove `fsdp_overlap_step_with_backward` in favor of native solution (#18726) 2023-10-06 08:11:41 -04:00
Adrian Wälchli c514f1cbea
Enable PyTorch 2.1 (#18718) 2023-10-06 07:17:03 -04:00
Adrian Wälchli d31ef1f7d3
Drop support for PyTorch 1.11 (#18691)
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-10-04 20:30:44 +02:00
Adrian Wälchli d05cd3fa0a
Fix KeyError when calling `Fabric.load_raw` before setting up an FSDP model (#18647) 2023-09-29 07:35:27 -04:00
Carlos Mocholí 70a11d9739
Forbid non-FSDP precision plugins with FSDP (#18664) 2023-09-29 10:07:51 +02:00
Jirka Borovec 830a62a722
ruff: replace isort with ruff +TPU (#17684)
* ruff: replace isort with ruff

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixing & imports

* lines in warning test

* docs

* fix enum import

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixing

* import

* fix lines

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* type ClusterEnvironment

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-09-26 11:54:55 -04:00
Adrian Wälchli 8094855137
Avoid passing process group to enable FSDP's hybrid-shard (#18583) 2023-09-19 13:46:24 -04:00
Jirka Borovec dbe7ed46a3
replace tests skip with soft xfail (#18486)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-09-12 23:11:03 +02:00
Adrian Wälchli c959df74b8
Support saving and loading stateful objects in Fabric (#18513)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-09-12 07:58:52 -04:00
Carlos Mocholí e1c5c5ae4a
[TPU] Set the compute_dtype with XLAFSDP (#18497) 2023-09-07 18:43:21 +02:00
Carlos Mocholí 729e833935
[TPU] XLAFSDP checkpointing fixes (#18424)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2023-08-31 05:47:11 -07:00
Carlos Mocholí fdda1968d3
[TPU] Support setting the XLAFSDP policy with a set (#18430) 2023-08-30 09:25:12 -07:00
Adrian Wälchli 2f3491a739
Fix saving FSDP checkpoint when world size = 1 and torch <= 2.0 (#18371) 2023-08-23 06:45:20 -04:00
Adrian Wälchli fcc7b00116
Improve error message when "if name == main" guard is needed (#18298) 2023-08-18 08:53:01 -04:00
Carlos Mocholí fcb8e17303
[TPU] Preserve the device with XLA's collectives (#18275) 2023-08-16 22:56:41 +02:00
Carlos Mocholí 58e21b4b74
[TPU] Add `sequential_save` to save FSDP checkpoint shards sequentially (#18283) 2023-08-16 21:07:19 +02:00
Adrian Wälchli a0ca2c8bcd
Disable memory sharing on model parameters in ddp-spawn (#18238)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-08-15 14:39:51 +02:00
Adrian Wälchli 3142ed5e44
Integration tests for XLA precision (#18286) 2023-08-13 09:20:26 -04:00
Adrian Wälchli c95dbac2e8
Validate Trainer settings against cluster environment (#18292) 2023-08-12 21:26:37 +02:00
Adrian Wälchli 03ca31c3d3
Avoid updating the device for XLA FSDP in `Fabric.setup()` [TPU] (#18276) 2023-08-11 22:00:23 -04:00
Jirka Borovec efa7b2f9ef
docformatter: config with black (#18064)
* docformatter: config with black

* additional_dependencies: [tomli]

* 119

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-08-09 10:44:20 -04:00
Adrian Wälchli 7e13eb7299
Monitor subprocesses to avoid zombies (#18218)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-08-08 09:25:21 +02:00
Gerson Kroiz d7c2e597a1
[TPU] Add Fabric support for PyTorch XLA FSDP (#18126)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-08-02 12:56:00 -04:00
Adrian Wälchli 50e01c7012
Meta device initialization for FSDP in Fabric (#18122)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-08-02 07:58:32 -04:00
Adrian Wälchli d9493545cf
Allow accessing rank information before processes are launched in XLA (#18194) 2023-07-31 10:37:35 -04:00
Adrian Wälchli 508f02a624
Remove the unused `checkpoint_io` argument from the `FSDPStrategy` in Fabric (#18192) 2023-07-31 04:07:32 -04:00
Carlos Mocholí 0e7e6b31c5
Fix [TPU] tests (#18140)
* Fix [TPU] tests

* More
2023-07-24 15:13:36 +02:00
Carlos Mocholí 01b82e4fb1
Minor miscellaneous fixes (#18077)
* Various miscellaneous fixes

* Update

* Update

* succeeded

* Comment everywhere

* hasattr
2023-07-20 14:44:51 +02:00
Adrian Wälchli d6b5f3af15
Fix "optimizer in backward" compatibility with torch 2.1 nightly (#18119)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-07-20 07:22:54 -04:00