* Migrate TPU tests to GitHub actions
* No working dir
* Keep _target
* Dont skip draft
* CHECK_SLEEP
* Not yet
* Remove recurrent cleanup script
* Set secrets
* a step cannot have both the `uses` and `run` keys
* Version $PYTHON_VER was not found in the local cache
* can't load package ... ($GOPATH not set)
* The `set-env` command is disabled
* Try updating go
* Match timeout
* simplify path
* More cleanup
* Install coverage. Unmark draft
* Update .github/workflows/ci-pytorch-test-tpu.yml
* DEBUG echo
* Revert "DEBUG echo"
This reverts commit 4011856e6e.
* More debug
* SSH
* Im stupid
* Remove always()
* Forgot some
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
* Apply suggestions from code review
* enable CI to run for PT 1.13
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Fix TPU test CI
* +x first
* Lite first to uncovert errors faster
* Fixes
* One more
* Simplify XLALauncher wrapping to avoid pickle error
* debug
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Debug commit successful. Trying local definitions
* Require tpu for mock test
* ValueError: The number of devices must be either 1 or 8, got 4 instead
* Fix mock test
* Simplify call, rely on defaults
* Skip OSError for now. Maybe upgrading will help
* Simplify launch tests, move some to lite
* Stricter typing
* RuntimeError: Accessing the XLA device before processes have spawned is not allowed.
* Revert "RuntimeError: Accessing the XLA device before processes have spawned is not allowed."
This reverts commit f65107ebf3.
* Alternative boring solution to the reverted commit
* Fix failing test on CUDA machine
* Workarounds
* Try latest mkl
* Revert "Try latest mkl"
This reverts commit d06813aa67.
* Wrong exception
* xfail
* Mypy
* Comment change
* Spawn launch refactor
* Accept that we cannot lazy init now
* Fix mypy and launch test failures
* The base dockerfile already includes mkl-2022.1.0 - what if we use it?
* try a different mkl version
* Revert mkl version changes
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
* bump cuda in docker images to 11.6.1
* PUSH TO HUB. REVERT THIS!
* conda forge for 11.6
* cuda 11.5
* revert conda changes
* 11.6 back again
* 11.6 back again, all of them
* maybe all passes now
* maybe all passes now
* final push
* Revert "PUSH TO HUB. REVERT THIS!"
This reverts commit 602bfce224.
* Apply suggestions from code review
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* append cuda version to tags
* revertme: push to hub
* Update docker readme
* Build base-conda-py3.9-torch1.12-cuda11.3.1
* Use new images in conda tests
* revertme: push to hub
* Revert "revertme: push to hub"
This reverts commit 0f7d534b2a.
* Revert "revertme: push to hub"
This reverts commit 46a05fccbb.
* Run conda if workflow edited
* Run gpu testing if workflow edited
* Use new tags in release/Dockerfile
* Build base-cuda and PL release images with all combinations
* Update release docker
* Update conda from py3.9-torch1.12 to py3.10-torch.1.12
* Fix ubuntu version
* Revert conda
* revertme: push to hub
* Don't build Python 3.10 for now...
* Fix pl release builder
* updating version contribute to the error? https://github.com/docker/buildx/issues/456
* Update actions' versions
* Update slack user to notify
* Don't use 11.6.0 to avoid bagua incompatibility
* Don't use 11.1, and use 11.1.1
* Update .github/workflows/ci-pytorch_test-conda.yml
Co-authored-by: Luca Medeiros <67411094+luca-medeiros@users.noreply.github.com>
* Update trigger
* Ignore artfacts from tutorials
* Trim docker images to distribute
* Add an image for tutorials
* Update conda image 3.8x1.10
* Try different conda variants
* No need to set cuda for conda jobs
* Update who to notify ipu failure
* Don't push
* update filenaem
Co-authored-by: Luca Medeiros <67411094+luca-medeiros@users.noreply.github.com>
* Update the hpu-tests.yml to pull docker from vault
* fire & sudo
* habana-gaudi-hpus
* Check the driver status on gaudi server (#13718)
Co-authored-by: arao <arao@habana.ai>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Akarsha Rao <94624926+raoakarsha@users.noreply.github.com>
* list pytest
* docs
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* list
* test
* fix GK
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* add testing PT 1.12
* Fix quantization tests
* Fix another set of tests
* Fix check since https://github.com/pytorch/pytorch/pull/80139 is only going to be available for 1.13
* Skip this test for now for 1.12
Co-authored-by: SeanNaren <sean@grid.ai>
* allow freeze
* ci
* typo
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* ipu
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* GH org rename Lightning-AI
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* repo name
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>