Commit Graph

77 Commits

Author SHA1 Message Date
awaelchli 1a59097ab2
Drop support for PyTorch 1.12 (#19300)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2024-01-26 11:44:24 -05:00
Jirka Borovec 3bd133b107
CI: enable testing with coming PT 2.2 (#19289)
* ci: build dockers for PT 2.2
* py3.12
* --pre --extra-index-url
* typing-extensions
* bump jsonargparse
* install latest jsonargparse
* Add windows skips for Fabric
* convert to xfail
* add pytorch skips
* skip checkpoint consolidation test
* set max torch

---------

Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-01-26 16:42:09 +01:00
Carlos Mocholí a1dd9efcf7
Drop XLA XRT support (#19232)
* Drop XLA XRT support
* update test
* set launched
* update conftest
* xla available check
---------

Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-01-10 18:39:20 +01:00
Carlos Mocholí 6dfa5cca9d
Support 4bit BNB layers meta-device materialization (#19150) 2023-12-20 22:13:18 +01:00
Carlos Mocholí 97469c600f
TransformerEngine fallback compute dtype (#19082)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2023-12-14 03:02:09 +01:00
Adrian Wälchli 197b22586a
Fix comm initialization in `MPIEnvironment` (#19074) 2023-11-28 16:14:46 -05:00
Adrian Wälchli 7a5b7f5561
Skip hanging collective test (#18908) 2023-11-01 15:45:25 +01:00
Adrian Wälchli 018a308269
Enable RUF018 rule for walrus assignments in asserts (#18886) 2023-10-30 21:16:02 -04:00
Adrian Wälchli 079544a902
Rename PrecisionPlugin -> Precision (#18840) 2023-10-30 16:53:13 -04:00
Carlos Mocholí 78ad390b5b
Restore support for builds without distributed (#18859)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2023-10-25 02:48:44 +02:00
Carlos Mocholí 5a83f541da
Minor strategy fixes [TPU] (#18774) 2023-10-11 15:26:30 +02:00
Carlos Mocholí 27ad9e9243
xfail collective tests (#18779) 2023-10-11 05:54:55 +02:00
Adrian Wälchli 377534072b
Split `Precision.init_context` (#18734) 2023-10-09 12:34:30 -04:00
Carlos Mocholí 4c83ffd04c
Avoid importing bitsandbytes unless requested (#18680) 2023-10-05 01:10:10 +02:00
Carlos Mocholí e3960749d8
Forbid init_module on-device instantiation with bnb ignored modules (#18704) 2023-10-05 00:57:07 +02:00
Adrian Wälchli d31ef1f7d3
Drop support for PyTorch 1.11 (#18691)
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-10-04 20:30:44 +02:00
pre-commit-ci[bot] c0ec0decec
[pre-commit.ci] pre-commit suggestions (#18697)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-10-03 22:07:21 +02:00
Carlos Mocholí 5120ad20f2
Bitsandbytes precision plugin (#18655)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2023-09-29 19:17:18 +02:00
Adrian Wälchli 3cd463efa8
Remove outdated workaround for PyTorch autocast bug (#18634) 2023-09-29 08:33:43 -04:00
Jirka Borovec 830a62a722
ruff: replace isort with ruff +TPU (#17684)
* ruff: replace isort with ruff

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixing & imports

* lines in warning test

* docs

* fix enum import

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixing

* import

* fix lines

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* type ClusterEnvironment

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-09-26 11:54:55 -04:00
Adrian Wälchli 894952d33e
Avoid redundant input-type casting in FSDP precision (#18630)
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-09-26 08:55:13 -04:00
Carlos Mocholí d8e9eba606
[Fabric] Replace `@contextlib.contextmanager` (#18557) 2023-09-15 17:27:29 +02:00
Carlos Mocholí eb3b96d8bd
Avoid modifying the default dtype on exception (#18500) 2023-09-14 15:32:32 +02:00
Jirka Borovec dbe7ed46a3
replace tests skip with soft xfail (#18486)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-09-12 23:11:03 +02:00
Dan Dale c081b48324
Accommodate FSDP full-precision `param_dtype` training with PyTorch < 2.0 (#18278) 2023-08-14 12:22:26 +02:00
Adrian Wälchli 3142ed5e44
Integration tests for XLA precision (#18286) 2023-08-13 09:20:26 -04:00
Adrian Wälchli c95dbac2e8
Validate Trainer settings against cluster environment (#18292) 2023-08-12 21:26:37 +02:00
Adrian Wälchli 7fe8756917
[TPU] Proper half-precision implementation for XLA (#18213)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-08-11 11:37:41 -04:00
Adrian Wälchli 888466b144
Support true 16-bit precision with FSDP in Trainer (#18219) 2023-08-10 04:15:35 -04:00
Jirka Borovec efa7b2f9ef
docformatter: config with black (#18064)
* docformatter: config with black

* additional_dependencies: [tomli]

* 119

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-08-09 10:44:20 -04:00
Adrian Wälchli 41f0425a8d
Disable auto-detection of Kubeflow environment (#18137) 2023-07-28 05:03:48 -04:00
Carlos Mocholí 4c57c0bc07
[TPU] Do not cancel all jobs when one fails (#18052)
* Update tpu-tests.yml

* Update tpu-tests.yml

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Needs

* if:

* missed this

* Fix issue on multinode

* Latest fixes

* last fix?

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-07-25 14:24:50 +02:00
Carlos Mocholí 0e7e6b31c5
Fix [TPU] tests (#18140)
* Fix [TPU] tests

* More
2023-07-24 15:13:36 +02:00
Carlos Mocholí 3d573d5e79
Fix [TPU] tests (#18136)
* Debug [TPU] tests

* -U

* Uninstall typing extensions

* Minor simplifications

* Silly cancelling logic

* pip3?

* sudo

* More

* Revert "Silly cancelling logic"

This reverts commit ce31d874f3.
2023-07-23 13:39:00 +02:00
Carlos Mocholí e9c42ed11f
More XLA fixes for nightly support (#18085) 2023-07-15 01:16:42 +02:00
Carlos Mocholí 3a55f0c0a1
Minor miscellaneous fixes (#18068) 2023-07-13 06:01:58 -04:00
Adrian Wälchli acc70d0ae5
Support all half-precision modes in FSDP precision plugin (#17807)
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-07-09 18:40:46 +00:00
Adrian Wälchli c03dd38c6c
Refactor more Fabric tests that use the old .run() method (#17930)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-07-03 16:26:58 +02:00
Carlos Mocholí f78db4c674
Remove automatic sharding support with `Fabric.run` or `fabric.launch(fn)` (#17832)
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-06-15 16:02:09 +00:00
Alexander Kreuzer f111bd483b
Fix to Parameters to `MixedPrecisionPlugin` are not validated and do not match doc string (#17687)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-06-07 14:35:54 +00:00
Leng Yue 2c8758f0a8
Fix Mix Precision settings for FSDP Plugins (#17670) 2023-05-23 11:35:37 -04:00
Adrian Wälchli 7268670d1a
Support true 16-bit precision with deepspeed (#17576)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-05-12 23:21:32 +00:00
David Carreto Fidalgo 1ade737488
Allow setting the `SLURMEnvironment.main_address` via an env variable (#17596)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2023-05-12 11:31:48 +00:00
Carlos Mocholí 54e8095a78
Split `init_module` into `init` + `sharded_model` (#17488) 2023-05-05 15:54:52 +02:00
Jirka Borovec 4413e98e4e
ruff: enable & fixing RET (#17540) 2023-05-05 09:34:40 +00:00
Jirka Borovec 384c203532
ruff: PT some more fixes (#17569) 2023-05-05 08:25:15 +02:00
Jirka Borovec f55d10f5ee
ruff: autofix PT (#17541)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-05-04 11:50:39 -04:00
Carlos Mocholí 6ec9a6bd9e
[TPU] Rename classes to use XLA instead of TPU (#17383) 2023-04-28 12:36:22 -04:00
Carlos Mocholí abc634d17c
Fix setup_model typos in Fabric (#17498) 2023-04-28 00:31:17 +00:00
Adrian Wälchli 614dcdf502
True half-precision support in Fabric (#17287)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-04-27 12:37:33 +00:00