elmuz
cec6ae123d
Fix typo `scrict` -> `strict` in types.py ( #19998 )
2024-06-20 10:57:35 -04:00
Etay Livne
1e83a1bd32
Check if CometLogger experiment is alive ( #19915 )
...
Co-authored-by: Etay Livne <etay.livne@mobileye.com>
2024-06-18 13:15:12 -04:00
liambsmith
394c42aaf6
Fix callback call in Fabric Trainer example ( #19986 )
2024-06-18 13:14:32 -04:00
awaelchli
c1af4d0527
Better graceful shutdown for KeyboardInterrupt ( #19976 )
2024-06-16 10:43:42 -04:00
PL Ghost
b16e998a6e
Adding test for legacy checkpoint created with 2.3.0 ( #19974 )
2024-06-16 09:37:39 -04:00
Samuel Larkin
bb511b0baf
Fix minor typo in Trainer's documentation ( #19969 )
2024-06-13 18:26:46 -04:00
awaelchli
a42484cf8e
Fix failing app tests ( #19971 )
2024-06-13 20:58:34 +01:00
awaelchli
f6fd046552
Release 2.3.0 ( #19954 )
2024-06-11 12:38:56 -04:00
William Falcon
a97814af13
Update README.md
2024-06-11 11:01:22 -04:00
William Falcon
fa5da26e39
Update README.md ( #19968 )
2024-06-11 10:04:51 -04:00
Alexander Jipa
06ea3a0571
Fix resetting epoch loop restarting flag in LearningRateFinder ( #19819 )
2024-06-07 10:52:58 -04:00
Björn Barz
5fa32d95e3
Ignore parameters causing ValueError when dumping to YAML ( #19804 )
2024-06-06 18:36:28 -04:00
Douwe den Blanken
4f96c83ba0
Sanitize argument-free object params before logging ( #19771 )
...
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2024-06-06 14:51:48 -04:00
Bhavay Malhotra
a611de0c15
Removing numpy requirement from all files in examples/pytorch/domain_templates ( #19947 )
2024-06-06 11:02:01 -04:00
Mario Vasilev
812ffdec84
Fix `save_last` type annotation for ModelCheckpoint ( #19808 )
2024-06-05 20:24:45 -04:00
Liyang90
7668a6bf59
Flexible and easy to use HSDP setting ( #19504 )
...
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2024-06-05 20:15:03 -04:00
awaelchli
1a6786d682
Destroy process group in atexit handler ( #19931 )
2024-06-05 19:31:43 -04:00
Gilles Peiffer
b9f215d7fd
Replace usage of `grep -P` with `perl` in `run_standalone_tests.sh` ( #19942 )
2024-06-05 12:32:56 -04:00
Jirka Borovec
e0b7c04e63
ci/docs: enable dispatch build without warning as errors ( #19948 )
2024-06-05 12:32:36 -04:00
Yurij Mikhalevich
5aadfa6250
fix(docs): fix broken link to ensure the docs can be built ( #19941 )
...
* fix(docs): fix broken link to ensure the docs can be built
* nit
2024-06-04 22:11:20 -04:00
awaelchli
8bfbe0c908
Fix strict loading from distributed checkpoints vs PyTorch nightly ( #19946 )
...
* strict loading
* docstring
2024-06-04 22:09:01 -04:00
Federico Berto
19f0fb978c
Set `_choose_auto_accelerator` to `staticmethod` ( #19822 )
2024-06-04 21:12:27 -04:00
Alex Spies
351bec7625
Fix typo on `estimated_stepping_batches` property ( #19847 )
2024-06-04 21:06:16 -04:00
Gilles Peiffer
785f15d148
Remove `numpy` dependencies in `src/lightning/pytorch` ( #19841 )
2024-06-04 19:45:05 -04:00
Matthew Hoffman
bac82b83a8
Remove unknown `[metadata]` table from `pyproject.toml` ( #19904 )
2024-06-04 19:43:18 -04:00
Gilles Peiffer
fd86ea7356
Fix typos in CONTRIBUTING.md ( #19937 )
2024-06-03 21:20:01 +02:00
PL Ghost
a99a6d3af1
Adding test for legacy checkpoint created with 2.2.5 ( #19806 )
2024-05-31 12:53:54 -04:00
awaelchli
427fdfaf6e
Update docstring for `self.log` about keys in distributed training ( #19917 )
2024-05-30 19:47:48 +02:00
Ivan Yashchuk
dffc0f96ec
Update FlopCounterMode usage in throughput.py ( #19926 )
...
`mods` argument is not needed anymore for `FlopCounterMode`:
ffe506e853/torch/utils/flop_counter.py (L595-L596)
2024-05-30 12:14:56 -04:00
awaelchli
95d6b6b9da
Disable skipping training step in distributed training ( #19918 )
2024-05-30 11:54:48 -04:00
awaelchli
5d7932546d
Update code owners file ( #19925 )
...
update
2024-05-30 11:50:02 -04:00
awaelchli
014cdd84ed
Update code owners file ( #19922 )
...
* update code owners
* update
* Update .github/CODEOWNERS
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
---------
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2024-05-30 06:12:41 -04:00
awaelchli
98005bbed0
Add Studio badge to tensor parallel docs ( #19913 )
2024-05-28 09:04:55 -04:00
awaelchli
896c2a656a
Error for unsupported precision types with ModelParallelStrategy ( #19902 )
2024-05-23 13:43:46 -04:00
awaelchli
c09356db1e
(10/10) Support 2D Parallelism - Port Fabric docs to PL ( #19899 )
2024-05-23 08:55:52 -04:00
awaelchli
7874cd08ec
[TPU] Fix test assertion error from artifacts ( #19825 )
2024-05-23 07:11:28 -04:00
Jirka Borovec
e0d7ede643
docs: prune unused `linkcode` ( #19897 )
2024-05-23 11:35:53 +02:00
awaelchli
414c86332e
(9/n) Support 2D Parallelism - Remaining Checkpoint Logic ( #19888 )
...
Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
2024-05-22 18:13:41 -04:00
Jirka Borovec
fa1126ea53
docs: fix link to CLIP ( #19896 )
...
* docs: fix link to CLIP
* www
* ignore
2024-05-22 17:46:51 -04:00
awaelchli
341474aaac
(8/n) Support 2D Parallelism - 2D Parallel Fabric Docs ( #19887 )
2024-05-22 13:47:55 -04:00
awaelchli
8fc7b4ae94
Remove the requirement for FSDPStrategy subclasses to only support GPU ( #19894 )
2024-05-22 18:31:40 +02:00
awaelchli
987c2c4093
(7/n) Support 2D Parallelism - TP Fabric Docs ( #19884 )
...
Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2024-05-22 06:20:40 -04:00
awaelchli
7e87ce05c8
Fix state dict loading in bitsandbytes plugin when checkpoint is already quantized ( #19886 )
...
* bugfix
* add test
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update
* add chlog
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-05-21 13:46:01 -04:00
Gilles Peiffer
b1bb3f3173
Update `LearningRateMonitor` docs and tests for `log_weight_decay` ( #19805 )
2024-05-21 13:31:54 -04:00
awaelchli
d76feef0d6
Enable loss-parallel in example ( #19882 )
2024-05-20 13:19:38 +02:00
awaelchli
82e6e61bea
Remove redundant code to set the device on the LightningModule ( #19877 )
...
Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
2024-05-20 06:29:37 +02:00
Luca Antiga
d5bf4b9ed3
[App] Extend retry to 4xx except 400, 401, 403, 404 ( #19842 )
...
* Extend retry to 4xx except 400, 401, 403, 404
* Remove unused intersphinx mapping for app
---------
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2024-05-18 22:03:16 -04:00
awaelchli
c8059d7bfd
(6/n) Support 2D Parallelism - Trainer example ( #19879 )
...
* Add 2D parallel example
* replace with torchtitan code
2024-05-18 20:35:58 -04:00
awaelchli
32e241870b
(5/n) Support 2D Parallelism in Lightning Trainer ( #19878 )
...
* ModelParallelStrategy for Lightning Trainer
* mypy
* import fix
* fix torchscript errors
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix docs issue
* fix test execution
* Update src/lightning/pytorch/strategies/model_parallel.py
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
2024-05-17 19:03:31 -04:00
awaelchli
1d0c6aae96
(4/n) Support 2D Parallelism - Loading optimizer states correctly ( #19872 )
...
* Load optimizer state
* move to utility
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-05-17 17:17:32 -04:00