Commit Graph

849 Commits

Author SHA1 Message Date
Carlos Mocholí 67459944ea
Avoid FSDP deprecations during save/load with newer torch versions (#19463)
* Avoid FSDP deprecations during save/load with newer torch versions

* Refactor

* Tests
2024-02-14 19:43:59 +01:00
awaelchli 3fbc29ba21
Fix `CSVLogger` trying to append to file from previous run in same version folder (#19446) 2024-02-13 13:59:04 -05:00
awaelchli 3c5a465cfc
Create barrier without timeout in `prepare_data()` (#19448) 2024-02-13 12:10:07 +01:00
awaelchli e950bb4828
Remove the Graphcore IPU integration (#19405)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2024-02-12 16:16:02 -05:00
awaelchli 8d4768f2ae
Remove the Bagua integration (#19445) 2024-02-12 20:58:52 +01:00
Carlos Mocholí 45103516ad
Delay `Precision.convert_module` until `configure_model` has run (#19061) 2024-02-07 16:27:19 -05:00
awaelchli 9624aae07e
Support non-strict loading in Trainer (#19404) 2024-02-05 19:57:43 -05:00
awaelchli 277869205a
Update return type of `LightningModule.configure_optimizers()` (#19408) 2024-02-05 17:59:56 -05:00
awaelchli fb0ce03a9c
Fix input validation to support passing `device_mesh` to FSDP (#19392)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2024-02-02 06:48:12 -05:00
awaelchli 34a34a0754
Enable saving and loading stateful DataLoaders in Trainer (#19361) 2024-01-31 21:11:19 -05:00
Wouter Zwerink 5d178d07b7
Support TQDM_MINITERS env variable (#19381)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2024-01-31 20:17:35 -05:00
Jirka Borovec 6421dd8d4f
precommit: drop Black in favor of Ruff (#19380) 2024-01-31 17:09:39 +00:00
awaelchli 6018b0743c
Error message to inform bitsandbytes is only supported on CUDA (#19360) 2024-01-29 19:52:28 -05:00
awaelchli 1a59097ab2
Drop support for PyTorch 1.12 (#19300)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2024-01-26 11:44:24 -05:00
Jirka Borovec 3bd133b107
CI: enable testing with coming PT 2.2 (#19289)
* ci: build dockers for PT 2.2
* py3.12
* --pre --extra-index-url
* typing-extensions
* bump jsonargparse
* install latest jsonargparse
* Add windows skips for Fabric
* convert to xfail
* add pytorch skips
* skip checkpoint consolidation test
* set max torch

---------

Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-01-26 16:42:09 +01:00
Laurits Fredsgaard Larsen 3044e83d11
`_restricted_classmethod`: add wrapper, to allow inspection (#19332) 2024-01-23 18:23:06 -05:00
awaelchli b1127e3608
Utility to consolidate sharded checkpoints (#19213)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2024-01-23 17:15:22 -05:00
shenmishajing d02009af76
Fix saving relative symlink for ModelCheckpoint callback (#19303)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2024-01-20 09:32:08 -05:00
awaelchli 6dfaebabe5
Avoid deprecated `load_state_dict` for distributed checkpoints in PyTorch 2.2+ (#19298) 2024-01-16 21:09:20 -05:00
awaelchli 23c3454edc
Assert job id when requeuing SLURM job (#19283) 2024-01-15 16:25:50 +01:00
awaelchli 6bc27d54a0
Request `torch.cuda` RNG states only if CUDA is available (#19234) 2024-01-10 16:16:29 -05:00
pre-commit-ci[bot] f120c91e9f
[pre-commit.ci] pre-commit suggestions (#19229)
* [pre-commit.ci] pre-commit suggestions

updates:
- [github.com/pre-commit/pre-commit-hooks: v4.4.0 → v4.5.0](https://github.com/pre-commit/pre-commit-hooks/compare/v4.4.0...v4.5.0)
- [github.com/asottile/pyupgrade: v3.14.0 → v3.15.0](https://github.com/asottile/pyupgrade/compare/v3.14.0...v3.15.0)
- [github.com/astral-sh/ruff-pre-commit: v0.1.3 → v0.1.9](https://github.com/astral-sh/ruff-pre-commit/compare/v0.1.3...v0.1.9)
- [github.com/psf/black: 23.9.1 → 23.12.1](https://github.com/psf/black/compare/23.9.1...23.12.1)
- [github.com/pre-commit/mirrors-prettier: v3.0.3 → v4.0.0-alpha.8](https://github.com/pre-commit/mirrors-prettier/compare/v3.0.3...v4.0.0-alpha.8)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update .pre-commit-config.yaml

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* drop unused

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2024-01-10 13:11:48 -05:00
Carlos Mocholí a1dd9efcf7
Drop XLA XRT support (#19232)
* Drop XLA XRT support
* update test
* set launched
* update conftest
* xla available check
---------

Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-01-10 18:39:20 +01:00
Shubhashis Roy Dipta 8663460423
Fix warning for Dataloader if num_workers = cpu count = 1 (#19224)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2024-01-08 09:45:02 -05:00
awaelchli f75f3bc1c6
Simplify `_get_rank()` utility function (#19220) 2024-01-02 16:24:52 +01:00
awaelchli e040ef2f82
Ignore pytest cleanup warning (#19164) 2023-12-29 07:52:19 +01:00
awaelchli 3518f9e092
Delay DeepSpeed config setup (#19209) 2023-12-24 17:04:04 -05:00
awaelchli 858803236e
Fix ModelCheckpoint tests from incomplete PR (#19205)
* Update src/lightning/pytorch/trainer/trainer.py

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-12-22 08:49:09 +01:00
awaelchli 59d2600acb
Make saving 'last' checkpoint as symbolic link opt-in (#19191) 2023-12-21 11:38:48 -05:00
Carlos Mocholí c3e2ba52ca
`set_device` before `init_process_group` (#19184) 2023-12-21 16:28:16 +01:00
awaelchli 9d25e9aad3
Handle more of the flaky tests (#19193)
handle more of the flaky tests
2023-12-21 14:04:24 +01:00
Ryan Smith 002a465f84
Fix filtering test names in `run_standalone_tests.sh` when checking for errors (#19176) 2023-12-20 21:22:25 -05:00
Abhinav Singh 6d47bf1fac
Fix expanding home directory for Trainer's `default_root_dir` (#19179) 2023-12-20 17:08:03 -05:00
Rafał Jankowski 37952fe87d
Use `step` parameter when logging metrics with NeptuneLogger (#19126) 2023-12-14 09:55:37 -05:00
Carlos Mocholí 11bac946ff
Drop nvtx test (#19154) 2023-12-14 15:43:23 +01:00
Carlos Mocholí 97469c600f
TransformerEngine fallback compute dtype (#19082)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2023-12-14 03:02:09 +01:00
Jirka Borovec 2a8789e1c6
ci/tests: cleaning standalone script (#19141)
* tests: cleaning standalone script

* switch

* from tests

* -m

* collect

* array

* tests_fabric/

* ..

* path prefix

* pl

* cleaning

* test_pytorch_profiler_nested_emit_nvtx

* Apply suggestions from code review

* Apply suggestions from code review

* todo
2023-12-13 13:27:49 -06:00
Carlos Mocholí 7d04de697e
Reorder `configure_model` (#19060) 2023-12-05 02:29:32 +01:00
Adrian Wälchli 9bcb983d26
Fix `item_per_sec` metric in ThroughputMonitor (#19080) 2023-11-28 21:48:29 -05:00
Adrian Wälchli 482da0a140
Fix ModelCheckpoint alternating between versioned and unversioned file (#19064) 2023-11-27 10:18:05 -05:00
AleksanderWWW af852ff590
Handle checkpoint dirpath suffix in NeptuneLogger (#18863)
Co-authored-by: Siddhant Sadangi <siddhant.sadangi@gmail.com>
Co-authored-by: Sabine <sabine.nyholm@neptune.ai>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2023-11-25 08:39:46 -05:00
Adrian Wälchli 58c905b940
Fix ModelCheckpoint dirpath expanding home prefix (#19058) 2023-11-23 09:11:43 -05:00
Adrian Wälchli 9a26da8081
Make `ModelCheckpoint._format_checkpoint_name` an instance method (#19054) 2023-11-22 19:05:48 -05:00
Yasser Souri 67d3844818
Fix last checkpoint finding in filtered files with correct extension (#17072)
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2023-11-21 17:12:02 -05:00
Adrian Wälchli d4614d043e
Address test flakiness (#19022)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-11-21 17:11:00 -05:00
Adrian Wälchli e3be762538
Re-enable dynamo tests that were fixed in PyTorch 2.1 (#19038) 2023-11-21 16:30:20 -05:00
Adrian Wälchli 49caddde6e
Call `configure_model()` in `LM.load_from_checkpoint()` (#19036) 2023-11-21 09:44:18 -05:00
Adrian Wälchli f652e6c00e
Fix `rank_zero_only` rank not set in ddp-spawn based strategies (#19030) 2023-11-20 10:49:14 -05:00
Adrian Wälchli 4f4c890cd7
Improve handling the positional encoding in Transformer example (#18987) 2023-11-19 14:37:31 +01:00
Adrian Wälchli 45c2fcb341
Add AttributeDict container for Fabric (#18943)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-11-18 09:25:26 -05:00