Commit Graph

9136 Commits

Author SHA1 Message Date
Carlos Mocholí 5edb3b5126
[TPU] Do not delete jobs with "keepalive" in the name (#17411)
[TPU] Do not delete jobs with "keepalive" in the name
2023-04-19 14:02:02 +02:00
Jirka Borovec 26a549e87f
drop failing e2e quick app (#17409)
* drop failing e2e quick app

* codeowners

* Apply suggestions from code review
2023-04-19 12:31:56 +02:00
Adrian Wälchli 21ae19c69f
Add dynamo RunIf skip condition (#17404)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-04-19 01:09:42 +02:00
Liyang90 47726391ad
[TPU] Add support for PJRT from PyTorch/XLA 2.0 (#17352)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2023-04-18 18:52:36 +02:00
Carlos Mocholí b0b5b9ce57
[TPU] Fix workflow (#17406)
* [TPU] Fix workflow

* Whitespace
2023-04-18 18:11:03 +02:00
Carlos Mocholí 90ad36795a
[TPU] Refactor availability check (#17384) 2023-04-18 17:52:13 +02:00
Dmitry Frolov 7893d22b6f
Bump default E2E tests image version (#17403)
Bump default E2E image version

Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-04-18 10:54:59 -04:00
Carlos Mocholí 3fd4d96cf3
Fix PyTorch MPS test failure in master (#17405) 2023-04-18 16:53:53 +02:00
Ryan Smith 8d5a91a2dd
Update Fabric CPU tests to work on GPU machines (#17391) 2023-04-18 14:03:40 +00:00
Carlos Mocholí 369ae61f85
[TPU] Fix workflow condition (#17379) 2023-04-18 08:30:56 -04:00
Carlos Mocholí ae8838b0a1
GPU suggestion does not require devices anymore (#17217) 2023-04-18 13:12:16 +02:00
Carlos Mocholí 3511de3693
Replace range with RandomDataset in example (#17325) 2023-04-18 13:10:31 +02:00
belerico cf5e63dcfa
Fabric PPO example `share_data` flag (#17397) 2023-04-18 12:56:12 +02:00
Ruslan Mukhametshin 7312b3f321
Add `header_style` argument to `RichModelSummary` callback (#16788) 2023-04-18 12:55:55 +02:00
Jirka Borovec 6da4b0f490
skip some App tests (#17401)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-04-18 11:20:07 +02:00
Carlos Mocholí c22276f945
Update `from_datasets` to support arbitrary iterables (#17402) 2023-04-17 23:27:24 +00:00
Egor Spirin bb4e495627
Wraps sharded model for proper access to it `state_dict` in `FSDP` strategy (#16558)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2023-04-18 01:03:43 +02:00
Adrian Wälchli affe72cc3e
Add test for compiling FSDP model in Fabric (#17394) 2023-04-17 15:34:23 -04:00
Carlos Mocholí 16377339cb
Remove reference to outdated Kaggle tutorial (#17390) 2023-04-17 19:12:44 +02:00
Adrian Wälchli 71df9962b6
Update pip upgrade command in CI (#17395) 2023-04-17 17:05:57 +02:00
Carlos Mocholí b02d9c1590
Cleanup the `is_distributed` property [TPU] (#17381) 2023-04-17 16:14:07 +02:00
Adrian Wälchli 0dc42f523e
Save and load sharded checkpoints with FSDP in Fabric (#17323)
Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-04-16 14:11:49 -04:00
Carlos Mocholí 13905a3464
Support all `CombinedLoader` modes during evaluation (#17163) 2023-04-16 20:01:52 +02:00
Ishan Dutta e9d6856355
NumPy to Torch for lightning/fabric (#17291)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-04-15 15:21:56 +00:00
Carlos Mocholí 97a61868fb
Sync module states during non-fit (#17370)
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-04-15 02:35:51 +00:00
Ishan Dutta 9becc15ddf
Remove numpy from src/lightning/pytorch and use torch only (#17278)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-04-15 02:26:56 +00:00
Arturo Ghinassi b5d6bb6e8d
Added return in `convert_zero_checkpoint_to_fp32_state_dict` (#17342)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-04-15 02:18:16 +00:00
OneSixth a69fbcf71b
Add a warning to the detect_anomaly flag. (#17380)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-04-15 02:08:48 +00:00
Ryan Smith e1ce887fde
Fix `load_from_checkpoint` to return model on correct device (#17308)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-04-15 02:08:00 +00:00
Alican Bozkurt 84eb82a595
change STEP_OUTPUT type from dict to mapping (#17387) 2023-04-15 02:02:11 +00:00
Jirka Borovec 8e7b949ef3
App: drop flaky doctest example (#17366)
* skip app dcotest

* Proposed change

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* delete

---------

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-04-14 23:48:03 +02:00
Carlos Mocholí da883e73b9
[TPU] Use `pull_request_target` event (#17377) 2023-04-14 17:10:41 +02:00
Carlos Mocholí 05b481e3ae
[TPU] Add testing matrix with PJRT (#17368)
* Replace GKE in CI with manual gcloud usage

* Fix XRT test

* Reduce timeout to 35 minutes

* [TPU] Run tests with PJRT

* runtime as part of the job name

* CHANGELOG

* Update for app too
2023-04-14 16:39:13 +02:00
Carlos Mocholí 856b29fc72
[TPU] Replace GKE in CI with manual gcloud usage (#17362) 2023-04-14 12:47:31 +00:00
Adrian Wälchli 50662eb078
Fixes around `Strategy.set_world_ranks` (#16966)
* don't call set_world_ranks in xla strategy

* update

* fabric and other strategies

* CHANGELOG

* Typos

* Reuse test

---------

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-04-13 17:45:42 +02:00
Adrian Wälchli 17548d5cfd
Update link to image in Fabric PPO example (#17360) 2023-04-13 13:08:55 +02:00
Jirka Borovec f14ee9edbc
docker: fix building PL image (#17353) 2023-04-12 17:52:42 -04:00
Carlos Mocholí afd3123486
Update CHANGELOG after the 2.0.1 release (#17235) 2023-04-12 17:44:17 +02:00
Carlos Mocholí 04fb30bd97
Update fastapi dependency pins (#17173)
* Update fastapi dependency pins

* Apply suggestions from code review

* Update test.txt

* Update requirements/app/base.txt

* Revert "Update requirements/app/base.txt"

This reverts commit 59918ffc6c.

* cloud update

* Bad merge

* fastapi 0.69.0 which pins starlette 0.15.0

* https://github.com/pydantic/pydantic/issues/1985

* Avoid CVE: https://github.com/tiangolo/fastapi/pull/3213

* Strict trio

* Skip windows test

---------

Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-04-12 17:33:56 +02:00
Carlos Mocholí 472333c9b8
Replace codecov pip package with codecov uploader (#17349) 2023-04-12 17:24:48 +02:00
Carlos Mocholí 1aa23267ab
Various Fabric documentation updates (#17236) 2023-04-11 23:05:57 +00:00
Carlos Mocholí 0489f2efed
[TPU] v4 support (#17227) 2023-04-11 22:24:11 +00:00
Gerson Kroiz 7b8fd85e01
[TPU] Remove error check for IterableDatasets (#17331)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2023-04-11 22:04:17 +00:00
Adrian Wälchli 0c02c44c6d
Simplified setup of optimizers in FSDP (#17309)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-04-11 20:13:41 +00:00
Adrian Wälchli 51697a8bd6
Combined setup of model and optimizer with FSDP (#17305)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-04-11 19:58:53 +00:00
Jirka Borovec 86dbe38913
fix missing tutorials (#17311) 2023-04-11 15:18:18 -04:00
dependabot[bot] 78cd51a2ce
Bump peter-evans/create-pull-request from 4 to 5 (#17313)
Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 4 to 5.
- [Release notes](https://github.com/peter-evans/create-pull-request/releases)
- [Commits](https://github.com/peter-evans/create-pull-request/compare/v4...v5)

---
updated-dependencies:
- dependency-name: peter-evans/create-pull-request
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-04-11 20:56:40 +02:00
Ethan Harris 621644d0bc
App: Fix frontends when using multiprocessing in the cloud (#17324)
* App: Fix frontends when using multiprocessing in the cloud

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update CHANGELOG.md

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-04-11 12:28:44 -04:00
Carlos Mocholí b2717f6878
[TPU] Improve TPU workflow (#17237)
* Trigger TPU tests if [TPU] is in the PR title

* Remove TODO

* checkgroup

* DEBUG

* Update

* 1h timeout

* Update

* Update

* Update

* Update

* Remove DEBUG
2023-04-11 16:33:32 +02:00
Carlos Mocholí ebb62214c6 checkgroup 2023-04-11 16:30:58 +02:00