Commit Graph

322 Commits

Author SHA1 Message Date
Corwin Joy 631911c004
Add special logic for 'step' in _optimizer_to_device (#20019)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2024-08-05 17:17:06 -04:00
awaelchli 345450b0c3
Fix parameter count in ModelSummary when parameters are DTensors (#20163) 2024-08-05 10:57:31 -04:00
awaelchli 6c70dd7cf0
Fix attribute error on `_NotYetLoadedTensor` after loading checkpoint into quantized model with `_lazy_load()` (#20121) 2024-07-24 05:39:40 -04:00
awaelchli d0a6b34ea9
Avoid printing the seed info message multiple times (#20108) 2024-07-20 20:25:11 +02:00
awaelchli e214395d31
Remove confusing warning "Missing logger folder" (#20109) 2024-07-20 20:24:38 +02:00
awaelchli bdafe5e739
Add Python 3.12 to the CPU test matrix (#20078) 2024-07-13 06:07:35 -04:00
awaelchli 7d1a70752f
Update PyTorch 2.4 tests (#20079) 2024-07-13 05:09:09 -04:00
Abhishek Singh d5ae9ec568
Make numpy an optional dependency in `utilities\seed.py` (#20055)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2024-07-12 17:24:04 -04:00
awaelchli 9987d993a0
Remove support for Python 3.8 (#20071) 2024-07-12 10:33:35 -04:00
awaelchli 5829ef8ab3
Set `weights_only` in tests to avoid warnings in PyTorch 2.4 (#20057) 2024-07-08 04:38:27 -04:00
awaelchli 693c21ac1b
Add testing for PyTorch 2.4 (Fabric) (#20028) 2024-07-02 18:01:03 -04:00
awaelchli 14493c0685
Drop PyTorch 2.0 from the test matrix (#20009) 2024-06-30 18:02:00 -04:00
awaelchli e330da5870
Fix torch-numpy compatibility conflict in tests (#20004) 2024-06-21 20:20:59 -04:00
Douwe den Blanken 4f96c83ba0
Sanitize argument-free object params before logging (#19771)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2024-06-06 14:51:48 -04:00
Liyang90 7668a6bf59
Flexible and easy to use HSDP setting (#19504)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2024-06-05 20:15:03 -04:00
awaelchli 1a6786d682
Destroy process group in atexit handler (#19931) 2024-06-05 19:31:43 -04:00
awaelchli 896c2a656a
Error for unsupported precision types with ModelParallelStrategy (#19902) 2024-05-23 13:43:46 -04:00
awaelchli 8fc7b4ae94
Remove the requirement for FSDPStrategy subclasses to only support GPU (#19894) 2024-05-22 18:31:40 +02:00
awaelchli 7e87ce05c8
Fix state dict loading in bitsandbytes plugin when checkpoint is already quantized (#19886)
* bugfix

* add test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

* add chlog

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-05-21 13:46:01 -04:00
awaelchli 32e241870b
(5/n) Support 2D Parallelism in Lightning Trainer (#19878)
* ModelParallelStrategy for Lightning Trainer

* mypy

* import fix

* fix torchscript errors

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix docs issue

* fix test execution

* Update src/lightning/pytorch/strategies/model_parallel.py

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
2024-05-17 19:03:31 -04:00
awaelchli 1d0c6aae96
(4/n) Support 2D Parallelism - Loading optimizer states correctly (#19872)
* Load optimizer state

* move to utility

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-05-17 17:17:32 -04:00
awaelchli cd8acc26c3
(3/n) Support 2D Parallelism - Efficient loading of full-state checkpoints (#19870)
* memory-optimized loading of full checkpoints into dist model

* simplify

* handle buffers

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* handle strict loading, buffers, and add test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* chlog

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-05-15 13:07:31 -04:00
awaelchli 9455871c93
(2/n) Support 2D Parallelism - Distributed Checkpoints (#19852)
* distributed checkpoints

* use decorator

* refactor if-strict

* update example

* filter non-persistent buffers (todo, add test)

* simplify checkpoint loading for model
2024-05-15 08:19:08 -04:00
awaelchli e0307277a0
Add function to explicitly mark forward methods in Fabric (#19690)
Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
2024-05-08 16:58:33 -04:00
awaelchli 0c8a193d3c
(1/n) Support 2D Parallelism (#19846) 2024-05-07 17:02:58 -04:00
Adrian Wälchli 49ed2b102b
Add PyTorch 2.3 to CI matrix (#19708) 2024-04-29 07:16:13 -04:00
Adrian Wälchli 29136332d6
Avoid interactions through test artifacts (#19821) 2024-04-28 11:56:40 -04:00
Adrian Wälchli 5e0e02b79e
Remove support for PyTorch 1.13 (#19706) 2024-04-27 01:24:07 -04:00
awaelchli ce90b3898a
Sanitize hparams that can't be json-serialized in `WandbLogger.log_hyperparameters()` (#19769) 2024-04-14 15:01:58 +02:00
awaelchli dcb91d53d2
Fix initialized weights resetting in `Fabric.setup()` when using FSDP (#19755) 2024-04-11 05:52:28 -04:00
Carlos Mocholí ca6c94c208
Fix monkeypatching of `_FabricModule` methods (#19705) 2024-03-27 11:03:21 -04:00
awaelchli 14e98ecbf2
Fix `torch.compile` patching when applied as decorator (#19627) 2024-03-15 08:12:48 -04:00
Carlos Mocholí 06eb3cc28b
Pass `enabled` down to `_BackwardSyncControl` (#19577) 2024-03-08 11:48:16 +01:00
awaelchli b3c869f636
Revise checkpoint consolidation with PyTorch 2.3 (#19561) 2024-03-04 10:13:31 -05:00
awaelchli 13f15b38fc
Support consolidating sharded checkpoints with the `fabric` CLI (#19560) 2024-03-04 08:01:33 -05:00
awaelchli 7880c110e3
Alternative mechanism to detect missing `Fabric.backward()` call (#19493) 2024-02-27 17:57:32 +01:00
awaelchli ea89133c65
Rename `fabric run model` to `fabric run` (#19527) 2024-02-27 11:36:46 -05:00
awaelchli a41528c2a6
Update tests for PyTorch 2.2.1 (#19521) 2024-02-23 13:11:34 -05:00
Jirka Borovec 99fe6563ef
precommit: ruff-format (#19434)
* precommit: ruff-format

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* manual update

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* manual update

* order

* mypy

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* mypy

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-02-15 13:39:17 -05:00
awaelchli 265025bd5d
Inform the user about a missing `fabric.backward()` call (#19447) 2024-02-14 17:49:11 -05:00
Carlos Mocholí 67459944ea
Avoid FSDP deprecations during save/load with newer torch versions (#19463)
* Avoid FSDP deprecations during save/load with newer torch versions

* Refactor

* Tests
2024-02-14 19:43:59 +01:00
awaelchli 3fbc29ba21
Fix `CSVLogger` trying to append to file from previous run in same version folder (#19446) 2024-02-13 13:59:04 -05:00
awaelchli 3c5a465cfc
Create barrier without timeout in `prepare_data()` (#19448) 2024-02-13 12:10:07 +01:00
awaelchli e950bb4828
Remove the Graphcore IPU integration (#19405)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2024-02-12 16:16:02 -05:00
Justus Schock 2ed7282f7c
Rename Lightning Fabric CLI (#19442)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2024-02-12 17:22:53 +01:00
nik777 7a56ac5182
Support shortcut name for DeepSpeed stage 1 offload (#19075) 2024-02-05 20:53:18 -05:00
awaelchli fb0ce03a9c
Fix input validation to support passing `device_mesh` to FSDP (#19392)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2024-02-02 06:48:12 -05:00
awaelchli 01f8531c9d
Refactor BoringFabric in tests (#19364) 2024-01-30 23:32:45 +01:00
awaelchli 6018b0743c
Error message to inform bitsandbytes is only supported on CUDA (#19360) 2024-01-29 19:52:28 -05:00
awaelchli 1a59097ab2
Drop support for PyTorch 1.12 (#19300)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2024-01-26 11:44:24 -05:00