Commit Graph

322 Commits

Author SHA1 Message Date
Adrian Wälchli 888466b144
Support true 16-bit precision with FSDP in Trainer (#18219) 2023-08-10 04:15:35 -04:00
Adrian Wälchli 70e31b6480
Make `all_reduce` consistent for both NCCL and GLOO (#18235)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-08-09 17:39:57 -04:00
Jirka Borovec efa7b2f9ef
docformatter: config with black (#18064)
* docformatter: config with black

* additional_dependencies: [tomli]

* 119

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-08-09 10:44:20 -04:00
pre-commit-ci[bot] 834bd61164
[pre-commit.ci] pre-commit suggestions (#17983)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
Co-authored-by: Jirka B <j.borovec+github@gmail.com>
2023-08-08 16:26:06 +02:00
Adrian Wälchli 7e13eb7299
Monitor subprocesses to avoid zombies (#18218)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-08-08 09:25:21 +02:00
Gerson Kroiz d7c2e597a1
[TPU] Add Fabric support for PyTorch XLA FSDP (#18126)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-08-02 12:56:00 -04:00
Adrian Wälchli 50e01c7012
Meta device initialization for FSDP in Fabric (#18122)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-08-02 07:58:32 -04:00
Adrian Wälchli 74dfd88090
Avoid reinstantiation of DataLoader if distributed sampler not required (#18191)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-08-01 15:27:50 -04:00
Bilel Omrani b4435bd29c
Fix Google Cloud Storage checkpointing (#18088)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2023-08-01 20:08:42 +02:00
Adrian Wälchli 1db471305d
Avoid setting the multiprocessing context when importing lightning (#18177)
* avoid import at top module

* tests

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove comment

* update docs

* changelog

* mypy

* trigger app tests

* can't import lightning on py 3.8

* Update .github/workflows/ci-tests-app.yml

Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-07-31 18:05:21 +02:00
Adrian Wälchli d9493545cf
Allow accessing rank information before processes are launched in XLA (#18194) 2023-07-31 10:37:35 -04:00
Adrian Wälchli 508f02a624
Remove the unused `checkpoint_io` argument from the `FSDPStrategy` in Fabric (#18192) 2023-07-31 04:07:32 -04:00
Adrian Wälchli 41f0425a8d
Disable auto-detection of Kubeflow environment (#18137) 2023-07-28 05:03:48 -04:00
Adrian Wälchli 220e3b8e04
Add lazy checkpoint loading for FSDP full-state checkpoints (#18150)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-07-26 18:38:15 -04:00
Carlos Mocholí 4c57c0bc07
[TPU] Do not cancel all jobs when one fails (#18052)
* Update tpu-tests.yml

* Update tpu-tests.yml

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Needs

* if:

* missed this

* Fix issue on multinode

* Latest fixes

* last fix?

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-07-25 14:24:50 +02:00
Carlos Mocholí 0e7e6b31c5
Fix [TPU] tests (#18140)
* Fix [TPU] tests

* More
2023-07-24 15:13:36 +02:00
Carlos Mocholí 3d573d5e79
Fix [TPU] tests (#18136)
* Debug [TPU] tests

* -U

* Uninstall typing extensions

* Minor simplifications

* Silly cancelling logic

* pip3?

* sudo

* More

* Revert "Silly cancelling logic"

This reverts commit ce31d874f3.
2023-07-23 13:39:00 +02:00
Carlos Mocholí 01b82e4fb1
Minor miscellaneous fixes (#18077)
* Various miscellaneous fixes

* Update

* Update

* succeeded

* Comment everywhere

* hasattr
2023-07-20 14:44:51 +02:00
Adrian Wälchli d6b5f3af15
Fix "optimizer in backward" compatibility with torch 2.1 nightly (#18119)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-07-20 07:22:54 -04:00
Adrian Wälchli ed6a48ed57
DeepSpeed precision simplifications (#18113)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-07-20 07:13:31 -04:00
Carlos Mocholí 071f85842e
Support NVIDIA's Transformer Engine as a precision plugin (#17597) 2023-07-19 18:21:58 +02:00
Carlos Mocholí d653e4e088
Relax the assumption that the root module is FSDP wrapped (#18054) 2023-07-19 15:34:03 +02:00
Adrian Wälchli dab373de54
Support loading a raw PyTorch state-dict checkpoint in Fabric (#18049)
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-07-18 14:06:17 -04:00
Ishan Dutta 7116a9f9bb
Include parent directory validation check for deepspeed (#17795)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2023-07-17 19:09:38 -04:00
Shihao Yin c31ef77510
Fix `TensorBoardLogger.log_graph` not recording the graph (#17926)
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-07-17 18:18:39 -04:00
Adrian Wälchli 080eaf38fa
Enable setting the sharding strategy as string in FSDP (#18087) 2023-07-15 18:07:09 +02:00
Carlos Mocholí c60f67e736
Support sets for policies in FSDP (#18084) 2023-07-15 17:39:28 +02:00
Carlos Mocholí e9c42ed11f
More XLA fixes for nightly support (#18085) 2023-07-15 01:16:42 +02:00
Adrian Wälchli 356f5d0c65
Fix detection of next version in Fabric's CSVLogger (#17986) 2023-07-14 16:08:16 -04:00
Carlos Mocholí 2f657ae46e
Support custom policies for activation checkpointing with FSDP (#18045) 2023-07-14 20:00:52 +02:00
Carlos Mocholí 340eecd846
Add `Trainer.init_module` and `LightningModule.configure_model` (#18004) 2023-07-14 19:15:05 +02:00
Carlos Mocholí 3a55f0c0a1
Minor miscellaneous fixes (#18068) 2023-07-13 06:01:58 -04:00
Carlos Mocholí ad74f8623f
Don't reapply activation checkpointing (#18006) 2023-07-10 13:24:09 +00:00
Justus Schock 7ca49f2cb7
Requirements update (#18014)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-07-10 13:00:20 +00:00
Adrian Wälchli acc70d0ae5
Support all half-precision modes in FSDP precision plugin (#17807)
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-07-09 18:40:46 +00:00
Adrian Wälchli b14ddd9c49
Fix state dict loading for ddp/dp in Fabric (#17997)
* fix state dict loading for ddp/dp

* test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* changelog

* update test

* move params to same device before equality test

* test strategy

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-07-06 13:47:17 +02:00
Adrian Wälchli 3f4790bd27
Validate selected device indices in `DeepSpeedStrategy` (#17952) 2023-07-04 18:58:38 +00:00
Adrian Wälchli c5fae6426e
Show CUDA matmul precision info only ever once (#17960) 2023-07-04 03:47:27 -04:00
Adrian Wälchli c03dd38c6c
Refactor more Fabric tests that use the old .run() method (#17930)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-07-03 16:26:58 +02:00
Adrian Wälchli 5d7669af46
Remove requirement to call `Fabric.launch()` with DP strategy (#17931) 2023-06-30 08:20:01 +00:00
Adrian Wälchli 7eca2a2fdd
Fix automatic step tracking in Fabric's CSVLogger (#17942) 2023-06-28 14:33:37 +02:00
Adrian Wälchli 8f7ad991ff
Reduce false positive warnings when calling module methods in Fabric (#17875) 2023-06-26 17:35:27 +02:00
Carlos Mocholí 58d2387e0c
Add `Fabric.save(filter=...)` (#17845)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2023-06-20 18:18:59 +00:00
Carlos Mocholí f78db4c674
Remove automatic sharding support with `Fabric.run` or `fabric.launch(fn)` (#17832)
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-06-15 16:02:09 +00:00
Boon 377bfd2768
Pass-through setattr for FabricModule (#17731)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-06-12 19:33:51 +00:00
Adrian Wälchli 9ff7d7120b
Add `rank_zero_first` utility (#17784)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-06-12 10:32:32 +00:00
Leng Yue a23bae39c4
Enable loading full optimizer checkpoints with FSDP (#17747)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2023-06-10 11:28:02 +00:00
Adrian Wälchli 24a3115995
Support empty weight initialization in `Fabric.init_module()` (#17627) 2023-06-07 18:33:53 +00:00
Alexander Kreuzer f111bd483b
Fix to Parameters to `MixedPrecisionPlugin` are not validated and do not match doc string (#17687)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-06-07 14:35:54 +00:00
Carlos Mocholí f3c49b8e77
Remove warning on `no_backward_sync` with XLA strategy (#17761) 2023-06-07 16:07:03 +02:00
Bas Krahmer 420eb6f248
Added configurable strict loading for Fabric strategies (#17645)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: bas <bas.krahmer@talentflyxpert.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-06-06 18:26:13 -04:00
Taylor Robie 9c07cb397c
[FSDP] utility to apply optimizer during backward (#17710)
* utility to apply optimizer during backward

* start to address CI failures

* address CI failures

* address review comments and harden test

* change union annotation syntax

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* try to debug CI

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add skip_windows and standalone to fsdp test

---------

Co-authored-by: Taylor Robie <taylor.robie@lightning.ai>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-06-06 21:41:26 +02:00
M. Fox f67031b832
Add Fabric internal hooks (#17759)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-06-06 16:04:19 +00:00
M. Fox e2986fab14
External callback registry through entry points for Fabric (#17756) 2023-06-06 11:53:19 +00:00
Adrian Wälchli 67a14795cf
Address feedback for `Fabric.init_module()` (4/4) (#17607) 2023-06-03 02:07:02 +00:00
Adrian Wälchli fd296e0605
Enable loading full state dict checkpoints with FSDP (#17623)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-05-31 11:30:07 -04:00
Adrian Wälchli e0ce34e8e0
Address feedback for `Fabric.init_module()` (3/4) (#17723)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-05-31 15:03:49 +00:00
Adrian Wälchli 41cfa33c01
Address feedback for `Fabric.init_module()` (2/4) (#17722) 2023-05-31 14:31:24 +00:00
Adrian Wälchli 88cd100369
Address feedback for `Fabric.init_module()` (1/4) (#17721)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-05-31 14:05:29 +00:00
Jirka Borovec 51b0e81105
replace local adjustment script with external (#17582) 2023-05-29 19:34:04 +00:00
Jirka Borovec 0cc458e237
runif consistency (#17686) 2023-05-25 16:56:28 +00:00
Jirka Borovec 56377d9b1f
ci: separate parity/benchmarks (#17502)
* ci: separet benchmarks

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* measure

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* conf

* isort

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* ci

* parity

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* taska

* name

* ...

* var

* ...

* ...

* ...

* cd

* reset_cudnn_benchmark

* import

* imports

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* models

* xfail

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-05-24 19:16:41 -04:00
Leng Yue 2c8758f0a8
Fix Mix Precision settings for FSDP Plugins (#17670) 2023-05-23 11:35:37 -04:00
Adrian Wälchli 00909ba3ff
Raise environment variable collision errors only when Fabric CLI is used (#17679)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-05-22 19:12:26 -04:00
Adrian Wälchli e6b7f1383c
Refactor run-method-style Fabric tests (#17669)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-05-21 09:04:01 -04:00
Bas Krahmer ca9e006681
refactor Fabric tests to use launch method (#17648)
Co-authored-by: bas <bas.krahmer@talentflyxpert.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-05-19 13:42:49 -04:00
Adrian Wälchli 7268670d1a
Support true 16-bit precision with deepspeed (#17576)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-05-12 23:21:32 +00:00
David Carreto Fidalgo 1ade737488
Allow setting the `SLURMEnvironment.main_address` via an env variable (#17596)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2023-05-12 11:31:48 +00:00
Adrian Wälchli c712ec1ba9
Add support for saving with full state-dict in Fabric's FSDP (#17526)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-05-11 13:02:30 -04:00
Zixuan Zhao a36af3f9f8
Fixes a bug that causes `CSVLogger` to overwrite `version_0` when `root_dir` is a relative path. (#17139)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-05-06 00:10:12 +00:00
Gerson Kroiz 8e6f24baa6
[TPU] For XLA Strategy, added function arg to control `broadcast_master_param()` (#17522)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-05-05 17:57:24 +00:00
Carlos Mocholí 54e8095a78
Split `init_module` into `init` + `sharded_model` (#17488) 2023-05-05 15:54:52 +02:00
Jirka Borovec 4413e98e4e
ruff: enable & fixing RET (#17540) 2023-05-05 09:34:40 +00:00
Adrian Wälchli fd5cae4635
Verify `Fabric.launch()` was called (#17570)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-05-05 06:36:21 +00:00
Jirka Borovec 384c203532
ruff: PT some more fixes (#17569) 2023-05-05 08:25:15 +02:00
Carlos Mocholí 76caa81bf2
Compose RunIf utilities (#17520) 2023-05-05 01:21:58 +02:00
Jirka Borovec f55d10f5ee
ruff: autofix PT (#17541)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-05-04 11:50:39 -04:00
Adrian Wälchli a533f68693
Support compiling a module after it was set up by Fabric (#17529)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-05-03 09:00:11 +02:00
Adrian Wälchli 249395bfe0
DDP Parity tests as standalone task (#17503) 2023-05-03 05:36:07 +02:00
Adrian Wälchli 7523dd3199
Avoid creating CUDA stream if not running on CUDA (#17499) 2023-04-29 03:13:56 +00:00
Carlos Mocholí 6ec9a6bd9e
[TPU] Rename classes to use XLA instead of TPU (#17383) 2023-04-28 12:36:22 -04:00
Jirka Borovec 77889aa6bb
fabric: upstream runif to pkg (#17504) 2023-04-28 15:32:45 +00:00
Adrian Wälchli ce3701bfc0
Update `Fabric.init_module` for FSDP (#17510) 2023-04-28 12:44:52 +00:00
Carlos Mocholí 114a6d64a3
[TPU] Call `auto_device_count` for `is_available` (#17509) 2023-04-28 12:32:23 +00:00
Carlos Mocholí abc634d17c
Fix setup_model typos in Fabric (#17498) 2023-04-28 00:31:17 +00:00
Anton Kiselev 6b6594b831
Add timeout argument for `FSDPStrategy` (#17274)
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2023-04-28 00:27:06 +00:00
Jirka Borovec db9f095b0b
Replace IPU with external implementation (#17075)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-04-27 16:09:51 +00:00
Adrian Wälchli 614dcdf502
True half-precision support in Fabric (#17287)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-04-27 12:37:33 +00:00
Jirka Borovec 156786343b
adding check for bandit vulnerabilities 1/n (#17382)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-04-27 09:43:12 +00:00
pre-commit-ci[bot] 91cb4b9b87
[pre-commit.ci] pre-commit suggestions (#17271)
* [pre-commit.ci] pre-commit suggestions

updates:
- [github.com/PyCQA/docformatter: v1.4 → v1.6.0](https://github.com/PyCQA/docformatter/compare/v1.4...v1.6.0)
- [github.com/psf/black: 22.12.0 → 23.3.0](https://github.com/psf/black/compare/22.12.0...23.3.0)
- [github.com/charliermarsh/ruff-pre-commit: v0.0.237 → v0.0.260](https://github.com/charliermarsh/ruff-pre-commit/compare/v0.0.237...v0.0.260)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

* apply

* fixing

* docs/lines

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2023-04-26 21:37:41 +02:00
Adrian Wälchli 4d17b5fe77
Improved model initialization API for Fabric (#17462)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-04-26 11:25:33 -04:00
dependabot[bot] b792c90ea7
Update deepspeed requirement support window (#16813)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2023-04-25 17:26:49 +02:00
Carlos Mocholí f4b1fc0f71
Input validation for `Fabric.launch` (#17423)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2023-04-25 00:44:48 +02:00
Jirka Borovec df97141781
add & apply flake8-simplify (#17386) 2023-04-24 21:57:08 +00:00
Adrian Wälchli d9b4ebd726
Enable precision autocast for `LightningModule` step methods in Fabric (#17439) 2023-04-24 11:50:59 +00:00
Adrian Wälchli 0631fa02ef
Handle edge case in `Fabric.setup()` when model has no parameters (#17441) 2023-04-24 10:13:36 +02:00
Adrian Wälchli 877d95f8d7
Minor Fabric backward refactor (#17433) 2023-04-21 19:36:46 +00:00
Adrian Wälchli 0ee71d6a7a
Fix LightningModule step methods bypassing DDP wrapper in Fabric (#17424)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-04-21 15:29:32 -04:00
Jirka Borovec 111d1ba088
ruff: fixing flake8-comprehensions (#17385) 2023-04-21 09:07:58 +00:00
Carlos Mocholí 8dac251273
[TPU] Fix PjRT tests (#17408) 2023-04-19 16:39:00 +02:00
Adrian Wälchli 21ae19c69f
Add dynamo RunIf skip condition (#17404)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-04-19 01:09:42 +02:00
Liyang90 47726391ad
[TPU] Add support for PJRT from PyTorch/XLA 2.0 (#17352)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2023-04-18 18:52:36 +02:00
Carlos Mocholí 90ad36795a
[TPU] Refactor availability check (#17384) 2023-04-18 17:52:13 +02:00
Ryan Smith 8d5a91a2dd
Update Fabric CPU tests to work on GPU machines (#17391) 2023-04-18 14:03:40 +00:00
Adrian Wälchli affe72cc3e
Add test for compiling FSDP model in Fabric (#17394) 2023-04-17 15:34:23 -04:00
Adrian Wälchli 0dc42f523e
Save and load sharded checkpoints with FSDP in Fabric (#17323)
Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-04-16 14:11:49 -04:00
Ishan Dutta e9d6856355
NumPy to Torch for lightning/fabric (#17291)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-04-15 15:21:56 +00:00
Carlos Mocholí 05b481e3ae
[TPU] Add testing matrix with PJRT (#17368)
* Replace GKE in CI with manual gcloud usage

* Fix XRT test

* Reduce timeout to 35 minutes

* [TPU] Run tests with PJRT

* runtime as part of the job name

* CHANGELOG

* Update for app too
2023-04-14 16:39:13 +02:00
Carlos Mocholí 856b29fc72
[TPU] Replace GKE in CI with manual gcloud usage (#17362) 2023-04-14 12:47:31 +00:00
Adrian Wälchli 50662eb078
Fixes around `Strategy.set_world_ranks` (#16966)
* don't call set_world_ranks in xla strategy

* update

* fabric and other strategies

* CHANGELOG

* Typos

* Reuse test

---------

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-04-13 17:45:42 +02:00
Carlos Mocholí 0489f2efed
[TPU] v4 support (#17227) 2023-04-11 22:24:11 +00:00
Gerson Kroiz 7b8fd85e01
[TPU] Remove error check for IterableDatasets (#17331)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-04-11 22:04:17 +00:00
Adrian Wälchli 51697a8bd6
Combined setup of model and optimizer with FSDP (#17305)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-04-11 19:58:53 +00:00
Jirka Borovec 355dd9d343
test: adjust `is_timing_close` (#17178) 2023-03-24 12:07:07 +00:00
belerico bb861cba7e
Let TorchCollective works on the `torch.distributed` WORLD process group by default (#16995)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-03-20 23:30:27 +00:00
Atharva Phatak ea708da55a
Add `is_wrapped` utility function for Fabric (#16953)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2023-03-14 13:03:38 +00:00
janEbert dd02397720
Allow frozen data classes in optimizer state dict (#16656)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2023-03-10 15:37:18 +00:00
Adrian Wälchli aa7f2522dc
Fix race condition in Fabric test (#17002) 2023-03-08 16:36:00 -05:00
Adrian Wälchli b6c693d345
Add test for `torch.compile()` with `Fabric.setup()` (#16977) 2023-03-07 10:57:31 -05:00
Adrian Wälchli 7749525cbd
Document SLURM interactive mode (#16955) 2023-03-06 20:58:46 +00:00
Adrian Wälchli 3e04353c1c
New fabric parity tests (#16899)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2023-03-06 20:19:25 +00:00
Carlos Mocholí fca69e68da
Fabric: Test PyTorch 2.0 pre-release on CPU and CUDA (#16905)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-03-03 17:48:49 +00:00
Jirka Borovec 760612fb8a
update list of fist party packages (#16859)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-03-03 16:55:48 +00:00
Carlos Mocholí 888686e72b
Fix tests on single-GPU machine (#16911) 2023-03-03 01:33:45 +01:00
Adrian Wälchli 7820a117bc
Optimize precision conversion in forward of Fabric module wrapper (#16903) 2023-03-02 23:41:37 +00:00
Justus Schock 3d1927e6bc
Adds Gradient Clipping to Fabric (#16715)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-02-27 23:44:13 +00:00
Yi Heng Lim 4444d0c37d
Fix support for passing -1 to `find_usable_cuda_devices` function (#16866)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2023-02-27 20:08:42 +00:00
Adrian Wälchli e3efbaa7f6
Incorporate pytorch's fixes in device_count_nvml (#16795)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-02-27 18:07:55 +00:00
Adrian Wälchli 462f1ee691
Fix amp ddp test in Fabric (#16862) 2023-02-23 19:05:30 -05:00
Carlos Mocholí d486f94dd2
Fabric: auto default (#16842) 2023-02-23 13:45:27 +00:00
Carlos Mocholí 235e692259
Fabric: do `set_epoch` for `batch_sampler.sampler` (#16841) 2023-02-23 00:11:29 +00:00
Carlos Mocholí 914effa04c
Rename `replace_sampler_ddp|replace_sampler` to `use_distributed_sampler` (#16829)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2023-02-22 14:07:02 +01:00
Adrian Wälchli 0e4ca7c286
Set accelerator through CLI only if set explicitly (#16818) 2023-02-20 13:45:06 +00:00
Adrian Wälchli 81b7c30291
Make DDP subprocess the default launcher for multi-device (#16780) 2023-02-20 11:20:50 +00:00
Adrian Wälchli 2844e9e246
Fix XLAEnvironment detection on TPU pod (#16806) 2023-02-20 11:01:06 +01:00
Justus Schock ac5fa03385
Introduce new precision layout in fabric (#16767) 2023-02-17 10:41:18 +00:00
Adrian Wälchli 91e692c767
Rename the TPUSpawnStrategy to XLAStrategy (#16781)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-02-17 02:06:24 +00:00
Adrian Wälchli c4c4793d56
Fix strategy type validation in connectors (#16693) 2023-02-10 10:50:56 +00:00
Adrian Wälchli 923a842e9c
Fix import from torch.distributed when distributed not available (#16658) 2023-02-07 04:51:59 -05:00
Carlos Mocholí 1b1241ceb1
Fix TPU tests (#16628) 2023-02-06 17:21:26 +00:00
Jirka Borovec 770b792925
copyright Lightning AI team (#16647)
* copyright Lightning AI team

* more...
2023-02-06 15:26:51 +01:00
Adrian Wälchli 0f75dce8b4
Add MPI cluster environment (#16570)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-02-03 10:45:11 +00:00
Liyang90 e20172d370
Avoid wrapping prediction dataloader twice on TPU (#16571)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-02-03 10:36:56 +01:00
Adrian Wälchli 85f7e1c9c8
Show tf32 info only on rank 0 (#16152)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-02-03 00:56:12 +01:00
Jirka Borovec 377210d85d
tests: switch imports for fabric (#16592) 2023-02-01 20:34:38 +00:00
Carlos Mocholí ef2a6088ff
Drop support for PyTorch 1.10 (#16492)
* Drop support for PyTorch 1.10

* CHANGELOG

* READMEs

* mypy

* ls

* New poplar version

* Fixed tests

* links

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* skip azure badges

* Table

* Matching dockerfiles

* Drop unnecessary channels and packages

* Push nightly

* Undo unrelated changes

* Revert "Push nightly"

This reverts commit 9618f737c4.

---------

Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-02-01 14:09:12 -05:00
Carlos Mocholí dc298f2340
Drop support for Python 3.7 (#16579)
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2023-02-01 01:36:42 +00:00
Carlos Mocholí b2387136ba
Fix `torch.compile` tests (#16503) 2023-01-27 02:41:45 +00:00
Adrian Wälchli 23e71a880a
Fabric checkpointing 3/n: Implement missing `get_module_state_dict` for strategies (#16487) 2023-01-26 13:10:14 +00:00
Jirka Borovec 50fd12f841
fabric: test with tbX (#16511) 2023-01-26 12:52:02 +00:00
Carlos Mocholí d78cf99176
Remove the "native" suffix from the codebase (#16490) 2023-01-25 14:09:09 +00:00
Adrian Wälchli 96b7ed77e6
Enable more shorthand strategy names in the Fabric CLI (#16485) 2023-01-25 09:52:03 +00:00
Adrian Wälchli c87bb71fa8
Add `Fabric.all_reduce` (#16459) 2023-01-24 22:35:00 +00:00
Adrian Wälchli 7603dd09cb
Fabric checkpointing 2/n: DeepSpeed implementation (#16452)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-01-24 18:53:26 +01:00
Adrian Wälchli 9faa25f86f
Test that connector defaults match the ones in Trainer/Fabric (#16463) 2023-01-23 05:09:45 -05:00
Nikhil Shenoy 81914c7167
LightningFabric: Error handling for accelerator="mps" and ddp strategy pairing (#16455)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2023-01-22 17:57:24 +00:00
Adrian Wälchli 39acb81b9b
Fabric checkpointing 1/n: base implementation (#16434)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-01-19 20:40:12 +00:00
Adrian Wälchli 285cc53738
Make subprocess launcher the default in Lite (#16388) 2023-01-17 10:16:33 +00:00
Adrian Wälchli f1e0fda879
Rename `Strategy.reduce` to `Strategy.all_reduce` in Lite (#16370) 2023-01-16 08:17:45 -05:00
Adrian Wälchli 8f1269283f
Add CSVLogger for Lightning Lite (#16346)
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-01-13 13:09:44 +00:00
Adrian Wälchli 0a2ee68ea0
Fix configuration validation error message in Lite CLI (#16334) 2023-01-12 15:09:28 +00:00
Carlos Mocholí 428844d01d
Fabric: drop FairScale's sharded implementation (#16329)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2023-01-11 17:08:18 +00:00
Carlos Mocholí 3c3bff5e6e
Fabric: Remove `_Connector.is_distributed` (#16327) 2023-01-11 16:29:51 +01:00
Carlos Mocholí 794685493d
Remove `_StrategyType` (#16328) 2023-01-10 23:05:12 +01:00
Carlos Mocholí 047b4374a5
Annotate `Fabric.log_dict` with mapping input (#16325) 2023-01-10 23:02:55 +01:00
Lightning Forever 91aaa5313a
Lite: Support `self.log` from a LightningModule (#16311)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-01-10 16:11:47 +00:00
Adrian Wälchli b085fa12d3
Rename leftover definitions in Lite tests (#16309) 2023-01-10 15:02:05 +00:00
Lightning Forever f24349bb64
Logger support in Lite (#16121)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-01-09 18:33:18 +00:00
Adrian Wälchli c656307127
Handle `set_to_none` when using DeepSpeed optimizer in Lite (#16275) 2023-01-09 09:01:11 -05:00
Adrian Wälchli 4c3ce605ad
Update precision input type annotations (#14857)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-01-06 20:08:20 +00:00
pre-commit-ci[bot] b59941cc52
[pre-commit.ci] pre-commit suggestions (#16224)
* [pre-commit.ci] pre-commit suggestions

updates:
- [github.com/pre-commit/pre-commit-hooks: v4.3.0 → v4.4.0](https://github.com/pre-commit/pre-commit-hooks/compare/v4.3.0...v4.4.0)
- [github.com/asottile/pyupgrade: v2.34.0 → v3.3.1](https://github.com/asottile/pyupgrade/compare/v2.34.0...v3.3.1)
- https://github.com/myint/docformatterhttps://github.com/PyCQA/docformatter
- [github.com/PyCQA/docformatter: v1.4 → v1.5.1](https://github.com/PyCQA/docformatter/compare/v1.4...v1.5.1)
- [github.com/asottile/yesqa: v1.3.0 → v1.4.0](https://github.com/asottile/yesqa/compare/v1.3.0...v1.4.0)
- [github.com/PyCQA/isort: 5.10.1 → 5.11.4](https://github.com/PyCQA/isort/compare/5.10.1...5.11.4)
- [github.com/psf/black: 22.6.0 → 22.12.0](https://github.com/psf/black/compare/22.6.0...22.12.0)
- [github.com/executablebooks/mdformat: 0.7.14 → 0.7.16](https://github.com/executablebooks/mdformat/compare/0.7.14...0.7.16)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-01-04 18:48:35 -05:00
Carlos Mocholí 15ef52bc73
Rename LightningLite to Fabric (#16244)
* Rename LightningLite to Fabric

* Fix introspection test

* Fix deprecated Lite tests

* Undo accidental Horovod removal

* Fixes
2023-01-04 10:57:18 -05:00