Carlos Mocholí
67459944ea
Avoid FSDP deprecations during save/load with newer torch versions ( #19463 )
...
* Avoid FSDP deprecations during save/load with newer torch versions
* Refactor
* Tests
2024-02-14 19:43:59 +01:00
awaelchli
3fbc29ba21
Fix `CSVLogger` trying to append to file from previous run in same version folder ( #19446 )
2024-02-13 13:59:04 -05:00
awaelchli
3c5a465cfc
Create barrier without timeout in `prepare_data()` ( #19448 )
2024-02-13 12:10:07 +01:00
awaelchli
e950bb4828
Remove the Graphcore IPU integration ( #19405 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2024-02-12 16:16:02 -05:00
awaelchli
8d4768f2ae
Remove the Bagua integration ( #19445 )
2024-02-12 20:58:52 +01:00
Carlos Mocholí
45103516ad
Delay `Precision.convert_module` until `configure_model` has run ( #19061 )
2024-02-07 16:27:19 -05:00
awaelchli
9624aae07e
Support non-strict loading in Trainer ( #19404 )
2024-02-05 19:57:43 -05:00
awaelchli
277869205a
Update return type of `LightningModule.configure_optimizers()` ( #19408 )
2024-02-05 17:59:56 -05:00
awaelchli
fb0ce03a9c
Fix input validation to support passing `device_mesh` to FSDP ( #19392 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2024-02-02 06:48:12 -05:00
awaelchli
34a34a0754
Enable saving and loading stateful DataLoaders in Trainer ( #19361 )
2024-01-31 21:11:19 -05:00
Wouter Zwerink
5d178d07b7
Support TQDM_MINITERS env variable ( #19381 )
...
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2024-01-31 20:17:35 -05:00
Jirka Borovec
6421dd8d4f
precommit: drop Black in favor of Ruff ( #19380 )
2024-01-31 17:09:39 +00:00
awaelchli
6018b0743c
Error message to inform bitsandbytes is only supported on CUDA ( #19360 )
2024-01-29 19:52:28 -05:00
awaelchli
1a59097ab2
Drop support for PyTorch 1.12 ( #19300 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2024-01-26 11:44:24 -05:00
Jirka Borovec
3bd133b107
CI: enable testing with coming PT 2.2 ( #19289 )
...
* ci: build dockers for PT 2.2
* py3.12
* --pre --extra-index-url
* typing-extensions
* bump jsonargparse
* install latest jsonargparse
* Add windows skips for Fabric
* convert to xfail
* add pytorch skips
* skip checkpoint consolidation test
* set max torch
---------
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-01-26 16:42:09 +01:00
Laurits Fredsgaard Larsen
3044e83d11
`_restricted_classmethod`: add wrapper, to allow inspection ( #19332 )
2024-01-23 18:23:06 -05:00
awaelchli
b1127e3608
Utility to consolidate sharded checkpoints ( #19213 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2024-01-23 17:15:22 -05:00
shenmishajing
d02009af76
Fix saving relative symlink for ModelCheckpoint callback ( #19303 )
...
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2024-01-20 09:32:08 -05:00
awaelchli
6dfaebabe5
Avoid deprecated `load_state_dict` for distributed checkpoints in PyTorch 2.2+ ( #19298 )
2024-01-16 21:09:20 -05:00
awaelchli
23c3454edc
Assert job id when requeuing SLURM job ( #19283 )
2024-01-15 16:25:50 +01:00
awaelchli
6bc27d54a0
Request `torch.cuda` RNG states only if CUDA is available ( #19234 )
2024-01-10 16:16:29 -05:00
pre-commit-ci[bot]
f120c91e9f
[pre-commit.ci] pre-commit suggestions ( #19229 )
...
* [pre-commit.ci] pre-commit suggestions
updates:
- [github.com/pre-commit/pre-commit-hooks: v4.4.0 → v4.5.0](https://github.com/pre-commit/pre-commit-hooks/compare/v4.4.0...v4.5.0 )
- [github.com/asottile/pyupgrade: v3.14.0 → v3.15.0](https://github.com/asottile/pyupgrade/compare/v3.14.0...v3.15.0 )
- [github.com/astral-sh/ruff-pre-commit: v0.1.3 → v0.1.9](https://github.com/astral-sh/ruff-pre-commit/compare/v0.1.3...v0.1.9 )
- [github.com/psf/black: 23.9.1 → 23.12.1](https://github.com/psf/black/compare/23.9.1...23.12.1 )
- [github.com/pre-commit/mirrors-prettier: v3.0.3 → v4.0.0-alpha.8](https://github.com/pre-commit/mirrors-prettier/compare/v3.0.3...v4.0.0-alpha.8 )
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update .pre-commit-config.yaml
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* drop unused
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2024-01-10 13:11:48 -05:00
Carlos Mocholí
a1dd9efcf7
Drop XLA XRT support ( #19232 )
...
* Drop XLA XRT support
* update test
* set launched
* update conftest
* xla available check
---------
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-01-10 18:39:20 +01:00
Shubhashis Roy Dipta
8663460423
Fix warning for Dataloader if num_workers = cpu count = 1 ( #19224 )
...
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2024-01-08 09:45:02 -05:00
awaelchli
f75f3bc1c6
Simplify `_get_rank()` utility function ( #19220 )
2024-01-02 16:24:52 +01:00
awaelchli
e040ef2f82
Ignore pytest cleanup warning ( #19164 )
2023-12-29 07:52:19 +01:00
awaelchli
3518f9e092
Delay DeepSpeed config setup ( #19209 )
2023-12-24 17:04:04 -05:00
awaelchli
858803236e
Fix ModelCheckpoint tests from incomplete PR ( #19205 )
...
* Update src/lightning/pytorch/trainer/trainer.py
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-12-22 08:49:09 +01:00
awaelchli
59d2600acb
Make saving 'last' checkpoint as symbolic link opt-in ( #19191 )
2023-12-21 11:38:48 -05:00
Carlos Mocholí
c3e2ba52ca
`set_device` before `init_process_group` ( #19184 )
2023-12-21 16:28:16 +01:00
awaelchli
9d25e9aad3
Handle more of the flaky tests ( #19193 )
...
handle more of the flaky tests
2023-12-21 14:04:24 +01:00
Ryan Smith
002a465f84
Fix filtering test names in `run_standalone_tests.sh` when checking for errors ( #19176 )
2023-12-20 21:22:25 -05:00
Abhinav Singh
6d47bf1fac
Fix expanding home directory for Trainer's `default_root_dir` ( #19179 )
2023-12-20 17:08:03 -05:00
Rafał Jankowski
37952fe87d
Use `step` parameter when logging metrics with NeptuneLogger ( #19126 )
2023-12-14 09:55:37 -05:00
Carlos Mocholí
11bac946ff
Drop nvtx test ( #19154 )
2023-12-14 15:43:23 +01:00
Carlos Mocholí
97469c600f
TransformerEngine fallback compute dtype ( #19082 )
...
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2023-12-14 03:02:09 +01:00
Jirka Borovec
2a8789e1c6
ci/tests: cleaning standalone script ( #19141 )
...
* tests: cleaning standalone script
* switch
* from tests
* -m
* collect
* array
* tests_fabric/
* ..
* path prefix
* pl
* cleaning
* test_pytorch_profiler_nested_emit_nvtx
* Apply suggestions from code review
* Apply suggestions from code review
* todo
2023-12-13 13:27:49 -06:00
Carlos Mocholí
7d04de697e
Reorder `configure_model` ( #19060 )
2023-12-05 02:29:32 +01:00
Adrian Wälchli
9bcb983d26
Fix `item_per_sec` metric in ThroughputMonitor ( #19080 )
2023-11-28 21:48:29 -05:00
Adrian Wälchli
482da0a140
Fix ModelCheckpoint alternating between versioned and unversioned file ( #19064 )
2023-11-27 10:18:05 -05:00
AleksanderWWW
af852ff590
Handle checkpoint dirpath suffix in NeptuneLogger ( #18863 )
...
Co-authored-by: Siddhant Sadangi <siddhant.sadangi@gmail.com>
Co-authored-by: Sabine <sabine.nyholm@neptune.ai>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2023-11-25 08:39:46 -05:00
Adrian Wälchli
58c905b940
Fix ModelCheckpoint dirpath expanding home prefix ( #19058 )
2023-11-23 09:11:43 -05:00
Adrian Wälchli
9a26da8081
Make `ModelCheckpoint._format_checkpoint_name` an instance method ( #19054 )
2023-11-22 19:05:48 -05:00
Yasser Souri
67d3844818
Fix last checkpoint finding in filtered files with correct extension ( #17072 )
...
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2023-11-21 17:12:02 -05:00
Adrian Wälchli
d4614d043e
Address test flakiness ( #19022 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-11-21 17:11:00 -05:00
Adrian Wälchli
e3be762538
Re-enable dynamo tests that were fixed in PyTorch 2.1 ( #19038 )
2023-11-21 16:30:20 -05:00
Adrian Wälchli
49caddde6e
Call `configure_model()` in `LM.load_from_checkpoint()` ( #19036 )
2023-11-21 09:44:18 -05:00
Adrian Wälchli
f652e6c00e
Fix `rank_zero_only` rank not set in ddp-spawn based strategies ( #19030 )
2023-11-20 10:49:14 -05:00
Adrian Wälchli
4f4c890cd7
Improve handling the positional encoding in Transformer example ( #18987 )
2023-11-19 14:37:31 +01:00
Adrian Wälchli
45c2fcb341
Add AttributeDict container for Fabric ( #18943 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-11-18 09:25:26 -05:00