Commit Graph

222 Commits

Author SHA1 Message Date
Adrian Wälchli e87c11a592
Upgrade GPU CI to PyTorch 1.13 (#15583)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2022-11-12 14:58:37 +00:00
Carlos Mocholí a3edbec501
Delete unused TPU CI files (#15611)
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2022-11-11 18:30:02 +00:00
Carlos Mocholí 6ba00af1e0
Drop PyTorch 1.9 support (#15347)
* Drop 1.9

* Everything else

* READMEs

* Missed some

* IPU skips

* Remove exception type

* Add back
2022-11-10 08:59:13 -05:00
Jerome Anand e79a69a9ee
Upgrade to HPU release 1.7.0 (#15616)
Signed-off-by: Jerome <janand@habana.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-11-10 10:47:17 +01:00
Jirka Borovec fb9dae8df3
ci: update install lite & cut pkg dependency (#14517)
* ci: update install lite

* try without lite in req file

* ci: install

* app

* init

* Revert "app"

This reverts commit f3f09e7888.

* ci: cpu

* ci: gpu

* pkg

* env

* bench

* trigger

* notes

* prune

* set version

* fix version

* git reset

* hpu, ipu

* adjust

* --hard

* git checkout

* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>

* rc2

* L

* docs

* hpu

Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
2022-10-31 20:50:51 +01:00
Carlos Mocholí 7f3e9de726
Fix TPU tests on master builds (#15349) 2022-10-31 15:58:02 +00:00
Jirka Borovec 95ae393ca8
LAI: creating mirror package (#15105)
* placeholder

* mirror + prune

* makedir

* setup

* ci

* ci

* name

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* ci clean

* empty

* py

* parallel

* doctest

* flake8

* ci

* typo

* replace

* clean

* Apply suggestions from code review

* re.sub

* fix UI path

* full replace

* ui path?

* replace

* updates

* regex

* ci

* fix

* ci

* path

* ci

* replace

* Update .actions/setup_tools.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* also convert lightning_lite tests for PL tests to adapt mocking paths

* fix app example test

* update logger propagation for PL tests

* update logger propagation for PL tests

* Apply suggestions from code review

* Revert "update logger propagation for PL tests"

This reverts commit c1a5e119c7.

* playwright

* py

* update import in tests

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* try edit import in overwrite

* debug code

* rev playwright

* Revert "try edit import in overwrite"

This reverts commit c02f766521.

* ci: adjust examples

* adjust examples cloud

* mock lightning_app

* Install assistant dependencies

* lightning

* setup

* Apply suggestions from code review

* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Apply suggestions from code review

* disable cache

* move doctest to install

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* )

* echo ./

* ci

* lru

* revert disabling cache, prints

* ci

* prune ci jobs

* prune ci jobs

* training loop standalone tests

* add sys modules cleanup fixture

* make use of fixture

* revert standalone

* ci e2e

* fix imports in lightning

* fix imports of lightning in tests

* Revert "make use of fixture"

This reverts commit c15efdd205.

* Revert other commits for fixtures

* revert use of fixture

* py3.9

* fix mocking

* fix paths

* hack mocking

* docs

* Apply suggestions from code review

* rev suggestion

* Minor changes to the parametrizations

* Update checkgroup with the new and changed jobs

* include frontend dir

* cli

* fix imports and entry point

* Revert standalone

* rc1

* e2e on staging

* Revert "Revert standalone"

This reverts commit 9df96685b8.

* groups

* to

* ci: pt ver

* docker

* Apply suggestions from code review

* Copy over changes from previous commit to other groups

* Add back changes from bad merge

* Uppercase step name everywhere

* update

* ci

* ci: lai oldest

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Justus Schock <justus.schock@posteo.de>
Co-authored-by: manskx <ahmed.mansy156@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
2022-10-27 12:32:49 +02:00
Carlos Mocholí 375ab53861
Migrate TPU tests to GitHub actions (#14687)
* Migrate TPU tests to GitHub actions

* No working dir

* Keep _target

* Dont skip draft

* CHECK_SLEEP

* Not yet

* Remove recurrent cleanup script

* Set secrets

* a step cannot have both the `uses` and `run` keys

* Version $PYTHON_VER was not found in the local cache

* can't load package ... ($GOPATH not set)

* The `set-env` command is disabled

* Try updating go

* Match timeout

* simplify path

* More cleanup

* Install coverage. Unmark draft

* Update .github/workflows/ci-pytorch-test-tpu.yml

* DEBUG echo

* Revert "DEBUG echo"

This reverts commit 4011856e6e.

* More debug

* SSH

* Im stupid

* Remove always()

* Forgot some

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
2022-10-21 20:01:39 +02:00
otaj 099580cf2b
Assistant fixes (#15221)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-10-20 18:23:47 +00:00
Justus Schock 775e9ebc0f
Assistant for Unified Package (#15207)
* Update assistant and workflow files
* Update .actions/assistant.py

Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: otaj <ota@lightning.ai>
2022-10-20 14:17:27 +00:00
Jirka Borovec 4b9d028541
CI: enable CI run for PT 1.13 (#15128)
* Apply suggestions from code review
* enable CI to run for PT 1.13

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-10-20 08:33:56 +00:00
ver217 2fef6d9403
Add ColossalAI strategy (#14224)
Co-authored-by: HELSON <c2h214748@gmail.com>
Co-authored-by: rohitgr7 <rohitgr1998@gmail.com>
Co-authored-by: otaj <ota@lightning.ai>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-10-11 13:59:09 +02:00
Jirka Borovec 5f106957f7
CI: Use self-hosted Azure GPU runners (#14632)
* move config
* Apply suggestions from code review

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
2022-10-05 10:43:54 +00:00
Carlos Mocholí 7ef87464dd
Refactor XLA and TPU checks across codebase (#14550) 2022-10-04 22:54:14 +00:00
Carlos Mocholí 3028fd287d
Fix TPU test CI (#14926)
* Fix TPU test CI

* +x first

* Lite first to uncovert errors faster

* Fixes

* One more

* Simplify XLALauncher wrapping to avoid pickle error

* debug

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Debug commit successful. Trying local definitions

* Require tpu for mock test

* ValueError: The number of devices must be either 1 or 8, got 4 instead

* Fix mock test

* Simplify call, rely on defaults

* Skip OSError for now. Maybe upgrading will help

* Simplify launch tests, move some to lite

* Stricter typing

* RuntimeError: Accessing the XLA device before processes have spawned is not allowed.

* Revert "RuntimeError: Accessing the XLA device before processes have spawned is not allowed."

This reverts commit f65107ebf3.

* Alternative boring solution to the reverted commit

* Fix failing test on CUDA machine

* Workarounds

* Try latest mkl

* Revert "Try latest mkl"

This reverts commit d06813aa67.

* Wrong exception

* xfail

* Mypy

* Comment change

* Spawn launch refactor

* Accept that we cannot lazy init now

* Fix mypy and launch test failures

* The base dockerfile already includes mkl-2022.1.0 - what if we use it?

* try a different mkl version

* Revert mkl version changes

Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2022-10-03 09:13:33 -04:00
Akihiro Nitta e47d5a2376
CI: Combine conda and full testing into a single workflow (#14387)
* Remove conda job

* Remove conda job from readme

* Remove conda jobs from checkgroup

* Remove conda from docker builds

* Remove base-conda dockerfile

* Rewrite the strategy matrix while keeping equivalent

* Run the workflow on this branch

* Revert "Rewrite the strategy matrix while keeping equivalent"

This reverts commit e54298d60e57cffbf8107890987be3fe4a006c77.

* Add PyTorch versions

* Run on draft and disable unrelated costly CI

* Revert "Run the workflow on this branch"

This reverts commit 51ed8b905d8926b630dce4817124bd486135d3ec.

* tmp: Lightweight relevant CI

* Fix CI pathfilter

* Update matrix

* Drop skipping logic

* pip list

* reorder pip list

* tmp: lightweight ci

* Install specified pytorch

* Fix torch installation

* Uncomment steps

* Increase timeout

* bad merge

* Revert "Run on draft and disable unrelated costly CI"

This reverts commit eb5dc5e6bd.

* Update checkgroup

* Update docs and remove Python/PyTorch versions

* Remove pip-list

* Fail if wrong pytorch version installed

* Add Python 3.8, PyTorch 1.9 job

* tmp: remove azure jobs

* tmp: remove dockers

* tmp: remove others

* Run all combinations

* Include oldest

* Exclude no Python 3.10 distributions

* tmp: no concurrency

* tmp: double timeout

* Add pytest log reporter

* Add pytest-reportlog

* Fewer jobs

* Revert "tmp: no concurrency"

This reverts commit 4a7978dcb3.

* fix artifact name

* Revert test reports

* Revert unrelated changes

* Revert unrelated changes

* Add the combination of ex-conda jobs

* Update checkgroup

* revert timeout

* remove conda job

* revert docker build workflow file

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-09-29 22:39:04 -04:00
Jerome Anand 136d57312d
Upgrade HPU image to release 1.6.1 (#14932) 2022-09-29 11:22:27 +00:00
otaj b06f9b7468
Improve building times of IPU docker image (#14934) 2022-09-29 09:55:12 +00:00
Akarsha Rao f167d76508
CI: HPU support v1.6.0 release (#14794)
* Update hpu-tests.yml to support v1.6.0 release
* Update Dockerfile
2022-09-20 12:26:27 +02:00
Carlos Mocholí dfa570ef9f
Run CircleCI with the HEAD sha, not the base (#14625)
* Run CircleCI with the HEAD sha, not the base
* Different solution
2022-09-12 11:25:54 -04:00
Rui Wang 40868f7f43
Add bagua support for CUDA 11.6 images (#14529)
* Add support for bagua-cuda116

* Remove bagua-cuda115 from installation

Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2022-09-09 20:07:25 +00:00
Adrian Wälchli 291dc1b615
Standalone Lite CI setup (#14451)
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-09-01 22:13:12 +00:00
Carlos Mocholí 00aefa82b7
Cleanup TPU CI script error management (#14389) 2022-08-31 11:38:54 +00:00
Jirka Borovec 74304db6f8
CI: update TPU docker (#14448) 2022-08-31 00:47:38 +05:30
Carlos Mocholí 3ba0f56b18
Remove support for the deprecated torchtext legacy (#14375) 2022-08-26 20:01:51 +00:00
otaj 1ae14ca754
[CI] fix horovod tests (#14382) 2022-08-25 17:30:06 +00:00
Adrian Wälchli 34f98836fb
Fix silent TPU CI failures (#14034)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-08-24 13:24:24 +00:00
otaj 0bd5703b81
[CI] Trick Bagua into installing appropriate wheel in GPU tests (#14380)
Bagua trick needs to be replicated on everywhere applicable
2022-08-24 08:59:49 +00:00
otaj bb634310e7
[CI] Bump CUDA in Docker images to 11.6.1 (#14348)
* bump cuda in docker images to 11.6.1

* PUSH TO HUB. REVERT THIS!

* conda forge for 11.6

* cuda 11.5

* revert conda changes

* 11.6 back again

* 11.6 back again, all of them

* maybe all passes now

* maybe all passes now

* final push

* Revert "PUSH TO HUB. REVERT THIS!"

This reverts commit 602bfce224.

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-08-23 12:10:52 -04:00
Akihiro Nitta d5f35ece72
CI/CD: Add CUDA version to docker image tags (#13831)
* append cuda version to tags

* revertme: push to hub

* Update docker readme

* Build base-conda-py3.9-torch1.12-cuda11.3.1

* Use new images in conda tests

* revertme: push to hub

* Revert "revertme: push to hub"

This reverts commit 0f7d534b2a.

* Revert "revertme: push to hub"

This reverts commit 46a05fccbb.

* Run conda if workflow edited

* Run gpu testing if workflow edited

* Use new tags in release/Dockerfile

* Build base-cuda and PL release images with all combinations

* Update release docker

* Update conda from py3.9-torch1.12 to py3.10-torch.1.12

* Fix ubuntu version

* Revert conda

* revertme: push to hub

* Don't build Python 3.10 for now...

* Fix pl release builder

* updating version contribute to the error? https://github.com/docker/buildx/issues/456

* Update actions' versions

* Update slack user to notify

* Don't use 11.6.0 to avoid bagua incompatibility

* Don't use 11.1, and use 11.1.1

* Update .github/workflows/ci-pytorch_test-conda.yml

Co-authored-by: Luca Medeiros <67411094+luca-medeiros@users.noreply.github.com>

* Update trigger

* Ignore artfacts from tutorials

* Trim docker images to distribute

* Add an image for tutorials

* Update conda image 3.8x1.10

* Try different conda variants

* No need to set cuda for conda jobs

* Update who to notify ipu failure

* Don't push

* update filenaem

Co-authored-by: Luca Medeiros <67411094+luca-medeiros@users.noreply.github.com>
2022-08-10 10:37:50 +00:00
Akihiro Nitta 0883971ccb
CI: Update XLA from 1.9 to 1.12 (#14013) 2022-08-05 05:04:45 -04:00
Adrian Wälchli caaf35689c
Improvements to standalone scripts (#13840)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-07-28 23:33:22 +00:00
Carlos Mocholí 1299e4f984
Run GPU tests with PyTorch 1.12 (#13716)
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2022-07-28 19:37:57 +05:30
Adrian Wälchli fff62f0ae5
Fix TPU testing and collect all tests (#11098)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2022-07-27 15:40:40 +00:00
Adrian Wälchli a8d7b4476c
Fix PyTorch spelling errors (#13774)
* Fix PyTorch spelling errors

* more
2022-07-25 12:51:16 -04:00
Jirka Borovec 64e8e8eb4b
CI: debug HPU flow (#13419)
* Update the hpu-tests.yml to pull docker from vault
* fire & sudo
* habana-gaudi-hpus
* Check the driver status on gaudi server (#13718)

Co-authored-by: arao <arao@habana.ai>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Akarsha Rao <94624926+raoakarsha@users.noreply.github.com>
2022-07-20 12:35:01 +02:00
Jirka Borovec e23756b15d
CI: debug TPU failing tests (#13679)
* list pytest

* docs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* list

* test

* fix GK

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-07-15 17:40:04 -04:00
Jirka Borovec 954fd7e5a3
bump base NGC image (#13346) 2022-07-15 21:36:19 +00:00
Jirka Borovec aa62fe36df
add testing PT 1.12 (#13386)
* add testing PT 1.12
* Fix quantization tests
* Fix another set of tests
* Fix check since https://github.com/pytorch/pytorch/pull/80139 is only going to be available for 1.13
* Skip this test for now for 1.12

Co-authored-by: SeanNaren <sean@grid.ai>
2022-07-15 19:41:23 +02:00
Adrian Wälchli bb5e8be2e8
Simplify TPUSpawn rank management (#11163)
Co-authored-by: Kaushik B <kaushikbokka@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2022-07-14 15:43:41 +00:00
Kaushik B 56ff89743b
Fix TPU circleci tests (#13432)
* Fix TPU circleci tests

* Fix TPU circleci tests

* Fix TPU circleci tests

* Fix TPU circleci tests

* Fix TPU circleci tests

* Fix rank issue

* Fix rank issue

* debug alternative fix

* Revert properties

Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
2022-07-11 13:25:32 -04:00
Jirka Borovec 30dce29005
fix PL release docker (#13439) 2022-06-29 19:36:36 +02:00
Jirka Borovec b137ef7134
CI: fix requirements freeze (#13441)
* allow freeze

* ci

* typo

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* ipu

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-06-29 09:35:57 -04:00
awaelchli 511f1a6515 Reroute profiler to profilers (#12308)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2022-06-22 20:55:39 -04:00
Adrian Wälchli b08259d536
Add `XLAEnvironment` plugin (#11330)
* add xla environment class
* add api reference
* integrate
* use xenv
* remove properties

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kaushik B <kaushikbokka@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2022-06-22 10:57:50 +02:00
Carlos Mocholí ad87d2cad0
Future 5/n: Move requirements (#13306)
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2022-06-21 17:11:33 +02:00
Akarsha Rao 388ea92386
Update HPU Dockerfile to latest version (#13344) 2022-06-21 17:08:44 +02:00
Jirka Borovec 8ceab223c0
Fix repository links (#13304)
* GH org rename Lightning-AI

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* repo name

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-06-15 19:33:43 -04:00
Jirka Borovec ab59f308b1
Future 4/n: test & legacy in test/ folder (#13295)
* move: legacy >> test/

* move: tests >> test/

* rename unittests

* update CI

* tests4pl

* tests_pytorch

* proxi

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* ci

* link

* cli

* standalone

* fixing

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* .

* Apply suggestions from code review

Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* alone

* test -> tests

* Standalone fixes

* ci

* Update

* More fixes

* Fix coverage

* Fix mypy

* mypy

* Empty-Commit

* Fix

* mypy just for pl

* Fix standalone

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-06-15 18:10:49 -04:00
Jirka Borovec 9cc714cdd1
Future 2/n: stand-alone examples (#13294)
* move: pl_examples >> src/

* convert pl_examples package to plain examples

* update CI for examples

* ci

* missing

* install
2022-06-15 08:53:51 -04:00
Jirka Borovec 759e89df21
Future 1/n: package in src/ folder (#13293)
* move: pytorch_lightning >> src/

* update setup & install

* update CI

* ci

* update CI for examples

* Self review

* mypy

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* ci

* make

* docs

* typo

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* ci: gpu

* .

* hpu

* typing

* docs

* tpu

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-06-14 20:54:55 -04:00
Carlos Mocholí 0cf9d73d28
Drop PyTorch 1.8 support (#13155)
* Drop PyTorch 1.8 support

* Missed update

* Skip profiler test until supported

* Upgrade ipu dockerfile pytorch version

* Update XLA version
2022-06-14 20:46:44 -04:00
Jirka Borovec 78ff201c7e
Update CI setup (#13291)
* drop mamba
* use legacy GPU machines
2022-06-14 17:11:54 +00:00
Akarsha Rao bfa8b7be2d
Create hpu-ci-runner Dockerfile (#13239)
* Create  hpu-ci-runner Dockerfile

* Add ENTRYPOINT script 'start.sh' to hpu-ci-runner

* rename dirs

* ci

* add docker

* Fix build failure

* Fix build failure

* Fix title of nightly ci runner build

* Fix comments

* Fix comments

Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2022-06-08 16:02:16 -04:00
Akihiro Nitta 3c5a8a833e
Decouple pulling legacy checkpoints from existing GHA workflows and docker files (#13185)
* Add pull-legacy-checkpoints action
* Replace pulls with the new action and script
* Simplify
2022-06-02 15:39:14 +02:00
Jirka Borovec de4ab1c027
update NGC docker (#13136)
* update docker
* Apply suggestions from code review

Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-06-02 12:54:13 +00:00
Jirka Borovec fab2ff35ad
CI: Azure - multiple configs (#12984)
* CI: Azure - multiple configs
* names
* benchmark
* Apply suggestions from code review

Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-05-14 01:59:03 +00:00
Jirka Borovec fec9a09672
add freeze for development and full range for install (#12994)
* freeze versions

* unfreeze

* dependabot

* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* fix all req

* ...

* use base

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix refs

* Apply suggestions from code review

Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>

* Apply suggestions from code review

* dockers

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2022-05-12 09:14:18 -04:00
Eric Wiener 3f78c4ca7a
Track CPU stats with DeviceStatsMonitor (#11795)
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kaushik B <kaushikbokka@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-05-10 10:57:38 +00:00
Jirka Borovec 783ec43a85
parse strategies as own extras (#12975)
* parse strategies as own extras

* prune devel

* Update Makefile

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* revert parse_requirements

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-05-09 09:25:53 -04:00
Jirka Borovec 7ce948edb6
Unpin CUDA docker image for GPU CI (#12373)
* unpin CUDA docker image for GPU CI
* Apply suggestions from code review

Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Akihiro Nitta <akihiro@pytorchlightning.ai>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-05-06 02:56:57 +00:00
Jirka Borovec bb51e2a55b
Merge pull request #12723 from PyTorchLightning/req/strategies
Separate strategies' requirements
2022-05-04 10:06:02 -04:00
Akihiro Nitta ecd135e939
Update nvidia gpg key to fix nightly docker builds (#12930)
* Update gpg key
* Use curl instead of wget
* Install key manually
2022-05-02 09:00:44 +02:00
Akihiro Nitta 98b206e836
Use cmake installed with apt (#12907) 2022-04-28 07:44:52 +00:00
Akihiro Nitta ace6a5827b
Update building docker images (#12837)
Co-authored-by: Akihiro Nitta <akihiro@pytorchlightning.ai>
2022-04-21 22:10:42 +00:00
Jirka Borovec 16b9580958
build more dockers & slack fails (#12675)
* build dockers
* add slack
* Apply suggestions from code review

Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2022-04-13 17:24:08 +02:00
Jirka Borovec f9b69ce5b0
CI: check docker requires (#12677)
* check docker requires
* ci update
* bagua
* conda
* cuda
2022-04-12 00:29:54 +09:00
Kaushik B bd035af78a
Fix TPU CI (#12419) 2022-03-23 11:35:38 +05:30
Jirka Borovec fe940e195d
CI: update prune_pkgs (#12382) 2022-03-21 12:50:50 +00:00
four4fish 1eff3b53c1
Update fairscale version (#11567)
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-03-21 11:38:55 +00:00
Jirka Borovec efa870eebc
Docker: fix NCCL building Horovod (#12318)
* Horovod w. MPI
* nccl_built
* fix
2022-03-18 14:23:19 +00:00
Jirka Borovec 7ee690758c
CI: fix running PT 1.11 (#12304)
* fix fire
* horovod
* assistant
* cmake
* u20
* cuda
* -j2
* fix mypy

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-03-12 09:00:20 +00:00
Jirka Borovec bc8172856f
aggregate multiple helper scripts to single CLI (#11147)
* nightly release
* min version
* fire
2022-03-11 11:13:43 +00:00
Jirka Borovec 1144673cd9
CI: sanity check for req. pkgs (#11819)
* CI: sanity check for req. pkgs
* scripts
* rename
* gcsfs ?
* rich !
* install extra
* move
* set -e

Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-03-11 09:20:47 +00:00
Jirka Borovec 3b4061f39a
CI: enable testing for PT 1.11 (#11792)
* enable PT 1.11
* horovod
* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
2022-03-10 18:38:47 +00:00
Jirka Borovec 8577ef7bba
Skip horovod 0.24.0 only (#12248)
* try skip horovod 0.24.0 only
* HOROVOD_BUILD_CUDA_CC_LIST
* fix test

Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-03-10 16:01:08 +00:00
wangraying a0655611de
Add bagua installation in dockerfile (#11283)
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2022-02-24 15:17:31 +01:00
Jirka Borovec 7bc87015ea
Unblock GPU CI (#11934)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2022-02-16 21:15:44 +01:00
Aki Nitta 0a1b8b880d
Fix horovod installation `base-cuda` Dockerfile (#11811)
* pip install --user

* add checks

* rm unrelated comment

* consistent format

* Fail if horovod not found

Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-02-10 16:48:33 +09:00
Aki Nitta 86b177ebe5
Fix `apex` installation path in Dockerfile (#11596)
* empty commit

* Specify apex installation target directory

* pip install --user
2022-01-27 20:14:16 -05:00
Kaushik B 650c710efa
Rename training plugin test files & names to strategy (#11303) 2022-01-04 14:32:45 +01:00
Carlos Mocholí 3692eba807
Drop Python 3.6 support (#11117) 2021-12-21 17:06:15 +00:00
Kaushik B 2a5d05b562
Fix tpu spawn plugin test (#11131) 2021-12-18 02:53:37 +00:00
Sean Naren c66cd12445
Remove partitioning of model in ZeRO 3 (#10655) 2021-12-17 12:36:53 +00:00
Jirka Borovec e8659bd40e
update NGC (#10770) 2021-11-29 14:14:37 +00:00
Carlos Mocholí d2aaf6b4cc
Upgrade CI after the 1.10 release (#10075) 2021-11-10 17:59:10 +01:00
Carlos Mocholí 939a861853
Update Python testing (#10269) 2021-11-04 18:26:24 +01:00
Carlos Mocholí 70570f9eaa
Minimize the number of docker jobs (#10202)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-10-29 07:48:05 +01:00
Carlos Mocholí 3a4e9970d6
Pin fairscale version (#10200) 2021-10-27 23:24:17 +00:00
Carlos Mocholí a0e45dc071
Some minor CI cleanup (#10088) 2021-10-26 13:58:20 +02:00
Kaushik B af4a8f1950
Refactor tests for TPU Accelerator (#9718)
Co-authored-by: tchaton <thomas@grid.ai>
2021-10-14 19:45:15 +00:00
Danielle Pintz 940b910d27
[2/4] Add DeviceStatsMonitor callback (#9712)
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Kaushik B <kaushikbokka@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-10-13 18:29:36 +00:00
edwardpwtsoi 7c6efbc8a8
Resolved wrong mv usage for extracted directory (#9678)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-10-05 12:56:33 +00:00
Jirka Borovec 0e6ee9c39d
CI: add mdformat (#8673)
* add mdformat
* exclude chlog
* fix ***

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-08-03 18:19:09 +00:00
Jirka Borovec 66cc505339
update NGC (#8652)
* update NGC

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-08-02 16:05:36 +00:00
Jirka Borovec abbcfa1ab7
fix CI for PT 1.10 (#8526)
* fix CI for PT 1.10
* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-07-23 19:24:31 +02:00
thomas chaton 8d0df6fad2
[Feat] Improve TPU CI (#6078)
* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* update

* update ci

* i

* i

* i

* i
2021-07-19 19:43:21 +05:30
Jirka Borovec 74a09a23f1
CI: support PT 1.10 (#8133)
* prepare PT 1.10

* dockers

* fixes

* readme
2021-07-14 18:04:33 +03:00
Carlos Mocholí 6ce77a102b
Set minimum PyTorch version to 1.6 (#8288)
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2021-07-13 17:12:49 +00:00
Jirka Borovec ed6d4baea2
ngc (#8242) 2021-07-02 13:12:45 +01:00