Commit Graph

251 Commits

Author SHA1 Message Date
Adrian Wälchli da79480054
PyTest random order for Fabric tests (#19040) 2023-11-22 16:41:49 -05:00
Adrian Wälchli d4614d043e
Address test flakiness (#19022)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-11-21 17:11:00 -05:00
Adrian Wälchli e3be762538
Re-enable dynamo tests that were fixed in PyTorch 2.1 (#19038) 2023-11-21 16:30:20 -05:00
Adrian Wälchli f652e6c00e
Fix `rank_zero_only` rank not set in ddp-spawn based strategies (#19030) 2023-11-20 10:49:14 -05:00
Adrian Wälchli 45c2fcb341
Add AttributeDict container for Fabric (#18943)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-11-18 09:25:26 -05:00
Adrian Wälchli 340961a6ec
Fix test interactions (#18994) 2023-11-13 12:35:46 -05:00
Carlos Mocholí 466f772e3e
Fix precision default from environment (#18928)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2023-11-10 23:03:51 +01:00
Carlos Mocholí d9aa833628
Add more CUDA card FLOPs (#18958)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-11-07 04:13:20 +01:00
Adrian Wälchli 195a3bf5b5
Fix parsing v100s in `get_available_flops` (#18952)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-11-06 21:50:11 +01:00
Jason Won 8d68607cef
Flatten dataclass hyperparameters for logging (#18906)
Co-authored-by: jaswon <jason@jwon.xyz>
2023-11-03 19:30:19 -04:00
Carlos Mocholí 2b6b594dab
Rename Throughput flops argument (#18924) 2023-11-02 16:06:40 +01:00
Carlos Mocholí 5f6669f6b3
Add batches argument to throughput (#18905)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2023-11-02 04:15:03 +01:00
Adrian Wälchli 98685c332b
Fix parsing of version in TensorBoardLogger and CSVLogger (#18897) 2023-11-01 12:48:36 -04:00
Adrian Wälchli 7a5b7f5561
Skip hanging collective test (#18908) 2023-11-01 15:45:25 +01:00
Adrian Wälchli 018a308269
Enable RUF018 rule for walrus assignments in asserts (#18886) 2023-10-30 21:16:02 -04:00
Adrian Wälchli 079544a902
Rename PrecisionPlugin -> Precision (#18840) 2023-10-30 16:53:13 -04:00
Carlos Mocholí 800b87eb46
Add throughput utilities to Fabric and the Trainer (#18848) 2023-10-30 17:10:29 +01:00
Adrian Wälchli e66be675d2
Refined FSDP saving logic and error messaging when path exists (#18884) 2023-10-30 10:05:28 -04:00
Adrian Wälchli 9e75bc9572
Fix failing lightning cli entry point (#18821)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-10-24 20:51:11 -04:00
Carlos Mocholí 78ad390b5b
Restore support for builds without distributed (#18859)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2023-10-25 02:48:44 +02:00
Adrian Wälchli 6bfde6a80c
Change dangerous default random seed selection (#18846) 2023-10-24 19:59:38 -04:00
Adrian Wälchli 97303b0168
Avoid false-positive warnings about method calls on the Fabric-wrapped module (#18819) 2023-10-22 22:26:28 -04:00
Carlos Mocholí 5a83f541da
Minor strategy fixes [TPU] (#18774) 2023-10-11 15:26:30 +02:00
Carlos Mocholí 27ad9e9243
xfail collective tests (#18779) 2023-10-11 05:54:55 +02:00
Adrian Wälchli e02bb391af
Utility to disable all instances of `PossibleUserWarning` (#18744)
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-10-10 06:53:32 -04:00
Adrian Wälchli acc0cf02cf
Refinements to the num-workers warning (#18737) 2023-10-09 22:17:47 -04:00
Adrian Wälchli 377534072b
Split `Precision.init_context` (#18734) 2023-10-09 12:34:30 -04:00
Adrian Wälchli 87dff9928e
Handle edge case for `find_usable_cuda_devices(0)` (#18722) 2023-10-06 23:44:33 -04:00
Adrian Wälchli 5d819c91fb
Remove `fsdp_overlap_step_with_backward` in favor of native solution (#18726) 2023-10-06 08:11:41 -04:00
Adrian Wälchli c514f1cbea
Enable PyTorch 2.1 (#18718) 2023-10-06 07:17:03 -04:00
Carlos Mocholí 71aed751f7
Forbid passing precision and a precision plugin (#18671) 2023-10-05 17:41:36 +02:00
Carlos Mocholí 31a1dad099
Fix BNB int8-training support (#18721) 2023-10-05 16:01:59 +02:00
Adrian Wälchli 09a0fb26d2
Set an upper limit on CPU threads in distributed training (#18677) 2023-10-04 19:57:37 -04:00
Carlos Mocholí 4c83ffd04c
Avoid importing bitsandbytes unless requested (#18680) 2023-10-05 01:10:10 +02:00
Carlos Mocholí e3960749d8
Forbid init_module on-device instantiation with bnb ignored modules (#18704) 2023-10-05 00:57:07 +02:00
Adrian Wälchli d31ef1f7d3
Drop support for PyTorch 1.11 (#18691)
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-10-04 20:30:44 +02:00
pre-commit-ci[bot] c0ec0decec
[pre-commit.ci] pre-commit suggestions (#18697)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-10-03 22:07:21 +02:00
Adrian Wälchli 256f16ed42
Enable passing `load_state_dict(..., assign=True|False)` in FabricModule (#18690) 2023-10-03 13:49:39 -04:00
Carlos Mocholí 5120ad20f2
Bitsandbytes precision plugin (#18655)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2023-09-29 19:17:18 +02:00
Adrian Wälchli 3cd463efa8
Remove outdated workaround for PyTorch autocast bug (#18634) 2023-09-29 08:33:43 -04:00
Adrian Wälchli d05cd3fa0a
Fix KeyError when calling `Fabric.load_raw` before setting up an FSDP model (#18647) 2023-09-29 07:35:27 -04:00
Carlos Mocholí 70a11d9739
Forbid non-FSDP precision plugins with FSDP (#18664) 2023-09-29 10:07:51 +02:00
Jirka Borovec 830a62a722
ruff: replace isort with ruff +TPU (#17684)
* ruff: replace isort with ruff

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixing & imports

* lines in warning test

* docs

* fix enum import

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixing

* import

* fix lines

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* type ClusterEnvironment

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-09-26 11:54:55 -04:00
Jirka Borovec 358336268f
enable codespell for docs & fixing +TPU (#18629)
* precommit/codespell

* run

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* disable

* more fixing

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Apply suggestions from code review

* more fixing

* json

* note

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-09-26 11:54:44 -04:00
Adrian Wälchli 894952d33e
Avoid redundant input-type casting in FSDP precision (#18630)
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-09-26 08:55:13 -04:00
Adrian Wälchli 38764f0746
Enable launching via torchrun in slurm environment (#18618) 2023-09-26 07:40:22 -04:00
Adrian Wälchli f83ad093e5
Utility function to check shared filesystem (#18586)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-09-25 15:49:52 -04:00
Adrian Wälchli 57f5268eb3
Improve the suggested `num_workers` warning (#18591)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-09-21 09:38:25 -04:00
Adrian Wälchli 66f15cf327
Input validation for `num_nodes` argument (#18598) 2023-09-20 11:09:50 -04:00
Adrian Wälchli 8094855137
Avoid passing process group to enable FSDP's hybrid-shard (#18583) 2023-09-19 13:46:24 -04:00