lightning/dockers/base-xla/tpu_workflow_lite.jsonnet

local base = import 'templates/base.libsonnet';
local tpus = import 'templates/tpus.libsonnet';
local utils = import "templates/utils.libsonnet";

local tputests = base.BaseTest {
  frameworkPrefix: 'pl',
  modelName: 'tpu-tests',
  mode: 'postsubmit',
  configMaps: [],

  timeout: 6000, # 100 minutes, in seconds.

  image: 'pytorchlightning/pytorch_lightning',
  imageTag: 'base-xla-py{PYTHON_VERSION}-torch{PYTORCH_VERSION}',

  tpuSettings+: {
    softwareVersion: 'pytorch-{PYTORCH_VERSION}',
  },
  accelerator: tpus.v3_8,

  command: utils.scriptCommand(
    |||
      set +x  # turn off tracing, spammy
      set -e  # exit on error

      source ~/.bashrc
      conda activate lightning

      echo "--- Cloning lightning repo ---"
      git clone --single-branch --depth 1 https://github.com/Lightning-AI/lightning.git
      cd lightning
      # PR triggered it, check it out
      if [ -n "{PR_NUMBER}" ]; then  # if PR number is not empty
        echo "--- Fetch the PR changes ---"
        git fetch origin --depth 1 pull/{PR_NUMBER}/head:test/{PR_NUMBER}
        echo "--- Checkout PR changes ---"
        git -c advice.detachedHead=false checkout {SHA}
      fi

      echo "--- Install packages ---"
      PACKAGE_NAME=lite pip install .[dev]
      pip list

      echo $KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS
      export XRT_TPU_CONFIG="tpu_worker;0;${KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS:7}"

      echo "--- Sanity check TPU availability ---"
      python -c "from lightning_lite.accelerators import TPUAccelerator; assert TPUAccelerator.is_available()"
      echo "Sanity check passed!"

      echo "--- Running Lite tests ---"
      cd tests/tests_lite
      PL_RUN_TPU_TESTS=1 coverage run --source=lightning_lite -m pytest -vv --durations=0 ./

      echo "--- Running standalone Lite tests ---"
      PL_STANDALONE_TESTS_SOURCE=lightning_lite PL_STANDALONE_TESTS_BATCH_SIZE=1 bash run_standalone_tests.sh

      echo "--- Generating coverage ---"
      coverage xml
      cat coverage.xml | tr -d '\t'
    |||
  ),
};

tputests.oneshotJob
Add Github Action to run TPU tests. (#2376) * Add Github Action to run TPU tests. * Trigger new Github Actions run. * Clean up more comments. * Use different fixed version of ml-testing-accelerators and update config to match. * use cluster in us-central1-a * Run 'gcloud logging read' directly without 'echo' to preserve newlines. * cat coverage.xml on the TPU VM side and upload xml on the Github Action side * Use new commit on ml-testing-accelerators so command runs fully. * Preserve newlines in the xml and use if: always() temporarily to upload codecov * Use pytorch_lightning for coverage instead of pytorch-lightning * Remove the debug cat of coverage xml * Apply suggestions from code review * jsonnet rename * name * add codecov flags * add codecov flags * codecov * codecov * revert codecov * Clean up after apt-get and remove old TODOs. * More codefactor cleanups. * drone * drone * disable codecov * cleaning * docker py versions * docker py 3.7 * readme * bash * docker * freeze conda * py3.6 * Stop using apt-get clean. * Dont rm pytorch-lightning * Update docker/tpu/Dockerfile * Longer timeout in the Github Action to wait for GKE to finish. * job1 * job2 * job3 Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka <jirka@pytorchlightning.ai> 2020-07-02 01:44:19 +00:00			`local base = import 'templates/base.libsonnet';`
			`local tpus = import 'templates/tpus.libsonnet';`
			`local utils = import "templates/utils.libsonnet";`

			`local tputests = base.BaseTest {`
			`frameworkPrefix: 'pl',`
			`modelName: 'tpu-tests',`
			`mode: 'postsubmit',`
			`configMaps: [],`

Fix TPU testing and collect all tests (#11098) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> 2022-07-27 15:40:40 +00:00			`timeout: 6000, # 100 minutes, in seconds.`
Add Github Action to run TPU tests. (#2376) * Add Github Action to run TPU tests. * Trigger new Github Actions run. * Clean up more comments. * Use different fixed version of ml-testing-accelerators and update config to match. * use cluster in us-central1-a * Run 'gcloud logging read' directly without 'echo' to preserve newlines. * cat coverage.xml on the TPU VM side and upload xml on the Github Action side * Use new commit on ml-testing-accelerators so command runs fully. * Preserve newlines in the xml and use if: always() temporarily to upload codecov * Use pytorch_lightning for coverage instead of pytorch-lightning * Remove the debug cat of coverage xml * Apply suggestions from code review * jsonnet rename * name * add codecov flags * add codecov flags * codecov * codecov * revert codecov * Clean up after apt-get and remove old TODOs. * More codefactor cleanups. * drone * drone * disable codecov * cleaning * docker py versions * docker py 3.7 * readme * bash * docker * freeze conda * py3.6 * Stop using apt-get clean. * Dont rm pytorch-lightning * Update docker/tpu/Dockerfile * Longer timeout in the Github Action to wait for GKE to finish. * job1 * job2 * job3 Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka <jirka@pytorchlightning.ai> 2020-07-02 01:44:19 +00:00
[Feat] Improve TPU CI (#6078) * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * update * update ci * i * i * i * i 2021-07-19 14:13:21 +00:00			`image: 'pytorchlightning/pytorch_lightning',`
			`imageTag: 'base-xla-py{PYTHON_VERSION}-torch{PYTORCH_VERSION}',`
Add Github Action to run TPU tests. (#2376) * Add Github Action to run TPU tests. * Trigger new Github Actions run. * Clean up more comments. * Use different fixed version of ml-testing-accelerators and update config to match. * use cluster in us-central1-a * Run 'gcloud logging read' directly without 'echo' to preserve newlines. * cat coverage.xml on the TPU VM side and upload xml on the Github Action side * Use new commit on ml-testing-accelerators so command runs fully. * Preserve newlines in the xml and use if: always() temporarily to upload codecov * Use pytorch_lightning for coverage instead of pytorch-lightning * Remove the debug cat of coverage xml * Apply suggestions from code review * jsonnet rename * name * add codecov flags * add codecov flags * codecov * codecov * revert codecov * Clean up after apt-get and remove old TODOs. * More codefactor cleanups. * drone * drone * disable codecov * cleaning * docker py versions * docker py 3.7 * readme * bash * docker * freeze conda * py3.6 * Stop using apt-get clean. * Dont rm pytorch-lightning * Update docker/tpu/Dockerfile * Longer timeout in the Github Action to wait for GKE to finish. * job1 * job2 * job3 Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka <jirka@pytorchlightning.ai> 2020-07-02 01:44:19 +00:00
			`tpuSettings+: {`
[Feat] Improve TPU CI (#6078) * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * update * update ci * i * i * i * i 2021-07-19 14:13:21 +00:00			`softwareVersion: 'pytorch-{PYTORCH_VERSION}',`
Add Github Action to run TPU tests. (#2376) * Add Github Action to run TPU tests. * Trigger new Github Actions run. * Clean up more comments. * Use different fixed version of ml-testing-accelerators and update config to match. * use cluster in us-central1-a * Run 'gcloud logging read' directly without 'echo' to preserve newlines. * cat coverage.xml on the TPU VM side and upload xml on the Github Action side * Use new commit on ml-testing-accelerators so command runs fully. * Preserve newlines in the xml and use if: always() temporarily to upload codecov * Use pytorch_lightning for coverage instead of pytorch-lightning * Remove the debug cat of coverage xml * Apply suggestions from code review * jsonnet rename * name * add codecov flags * add codecov flags * codecov * codecov * revert codecov * Clean up after apt-get and remove old TODOs. * More codefactor cleanups. * drone * drone * disable codecov * cleaning * docker py versions * docker py 3.7 * readme * bash * docker * freeze conda * py3.6 * Stop using apt-get clean. * Dont rm pytorch-lightning * Update docker/tpu/Dockerfile * Longer timeout in the Github Action to wait for GKE to finish. * job1 * job2 * job3 Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka <jirka@pytorchlightning.ai> 2020-07-02 01:44:19 +00:00			`},`
			`accelerator: tpus.v3_8,`

			`command: utils.scriptCommand(`
			`\|\|\|`
Fix TPU test CI (#14926) * Fix TPU test CI * +x first * Lite first to uncovert errors faster * Fixes * One more * Simplify XLALauncher wrapping to avoid pickle error * debug * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug commit successful. Trying local definitions * Require tpu for mock test * ValueError: The number of devices must be either 1 or 8, got 4 instead * Fix mock test * Simplify call, rely on defaults * Skip OSError for now. Maybe upgrading will help * Simplify launch tests, move some to lite * Stricter typing * RuntimeError: Accessing the XLA device before processes have spawned is not allowed. * Revert "RuntimeError: Accessing the XLA device before processes have spawned is not allowed." This reverts commit f65107ebf3e062d497f1033bfbbd59774f2d253f. * Alternative boring solution to the reverted commit * Fix failing test on CUDA machine * Workarounds * Try latest mkl * Revert "Try latest mkl" This reverts commit d06813aa67cc161879775e24be24b735e2925555. * Wrong exception * xfail * Mypy * Comment change * Spawn launch refactor * Accept that we cannot lazy init now * Fix mypy and launch test failures * The base dockerfile already includes mkl-2022.1.0 - what if we use it? * try a different mkl version * Revert mkl version changes Co-authored-by: awaelchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> 2022-10-03 13:13:33 +00:00			`set +x # turn off tracing, spammy`
			`set -e # exit on error`

[Feat] Improve TPU CI (#6078) * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * update * update ci * i * i * i * i 2021-07-19 14:13:21 +00:00			`source ~/.bashrc`
			`conda activate lightning`
Fix TPU test CI (#14926) * Fix TPU test CI * +x first * Lite first to uncovert errors faster * Fixes * One more * Simplify XLALauncher wrapping to avoid pickle error * debug * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug commit successful. Trying local definitions * Require tpu for mock test * ValueError: The number of devices must be either 1 or 8, got 4 instead * Fix mock test * Simplify call, rely on defaults * Skip OSError for now. Maybe upgrading will help * Simplify launch tests, move some to lite * Stricter typing * RuntimeError: Accessing the XLA device before processes have spawned is not allowed. * Revert "RuntimeError: Accessing the XLA device before processes have spawned is not allowed." This reverts commit f65107ebf3e062d497f1033bfbbd59774f2d253f. * Alternative boring solution to the reverted commit * Fix failing test on CUDA machine * Workarounds * Try latest mkl * Revert "Try latest mkl" This reverts commit d06813aa67cc161879775e24be24b735e2925555. * Wrong exception * xfail * Mypy * Comment change * Spawn launch refactor * Accept that we cannot lazy init now * Fix mypy and launch test failures * The base dockerfile already includes mkl-2022.1.0 - what if we use it? * try a different mkl version * Revert mkl version changes Co-authored-by: awaelchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> 2022-10-03 13:13:33 +00:00
CI: parameterize TPU tests (#15876) * update * param * Apply suggestions from code review 2022-12-06 17:00:15 +00:00			`echo "--- Cloning lightning repo ---"`
Migrate TPU tests to GitHub actions (#14687) * Migrate TPU tests to GitHub actions * No working dir * Keep _target * Dont skip draft * CHECK_SLEEP * Not yet * Remove recurrent cleanup script * Set secrets * a step cannot have both the `uses` and `run` keys * Version $PYTHON_VER was not found in the local cache * can't load package ... ($GOPATH not set) * The `set-env` command is disabled * Try updating go * Match timeout * simplify path * More cleanup * Install coverage. Unmark draft * Update .github/workflows/ci-pytorch-test-tpu.yml * DEBUG echo * Revert "DEBUG echo" This reverts commit 4011856e6ea076e45fe40b942c20ee63ed7433f3. * More debug * SSH * Im stupid * Remove always() * Forgot some Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Luca Antiga <luca.antiga@gmail.com> 2022-10-21 18:01:39 +00:00			`git clone --single-branch --depth 1 https://github.com/Lightning-AI/lightning.git`
			`cd lightning`
CI: parameterize TPU tests (#15876) * update * param * Apply suggestions from code review 2022-12-06 17:00:15 +00:00			`# PR triggered it, check it out`
Fix TPU tests on master builds (#15349) 2022-10-31 15:58:02 +00:00			`if [ -n "{PR_NUMBER}" ]; then # if PR number is not empty`
CI: parameterize TPU tests (#15876) * update * param * Apply suggestions from code review 2022-12-06 17:00:15 +00:00			`echo "--- Fetch the PR changes ---"`
Fix TPU tests on master builds (#15349) 2022-10-31 15:58:02 +00:00			`git fetch origin --depth 1 pull/{PR_NUMBER}/head:test/{PR_NUMBER}`
CI: parameterize TPU tests (#15876) * update * param * Apply suggestions from code review 2022-12-06 17:00:15 +00:00			`echo "--- Checkout PR changes ---"`
Fix TPU tests on master builds (#15349) 2022-10-31 15:58:02 +00:00			`git -c advice.detachedHead=false checkout {SHA}`
			`fi`

			`echo "--- Install packages ---"`
CI: parameterize TPU tests (#15876) * update * param * Apply suggestions from code review 2022-12-06 17:00:15 +00:00			`PACKAGE_NAME=lite pip install .[dev]`
Fix TPU test CI (#14926) * Fix TPU test CI * +x first * Lite first to uncovert errors faster * Fixes * One more * Simplify XLALauncher wrapping to avoid pickle error * debug * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug commit successful. Trying local definitions * Require tpu for mock test * ValueError: The number of devices must be either 1 or 8, got 4 instead * Fix mock test * Simplify call, rely on defaults * Skip OSError for now. Maybe upgrading will help * Simplify launch tests, move some to lite * Stricter typing * RuntimeError: Accessing the XLA device before processes have spawned is not allowed. * Revert "RuntimeError: Accessing the XLA device before processes have spawned is not allowed." This reverts commit f65107ebf3e062d497f1033bfbbd59774f2d253f. * Alternative boring solution to the reverted commit * Fix failing test on CUDA machine * Workarounds * Try latest mkl * Revert "Try latest mkl" This reverts commit d06813aa67cc161879775e24be24b735e2925555. * Wrong exception * xfail * Mypy * Comment change * Spawn launch refactor * Accept that we cannot lazy init now * Fix mypy and launch test failures * The base dockerfile already includes mkl-2022.1.0 - what if we use it? * try a different mkl version * Revert mkl version changes Co-authored-by: awaelchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> 2022-10-03 13:13:33 +00:00			`pip list`

[Feat] Improve TPU CI (#6078) * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * update * update ci * i * i * i * i 2021-07-19 14:13:21 +00:00			`echo $KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS`
			`export XRT_TPU_CONFIG="tpu_worker;0;${KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS:7}"`
Fix TPU test CI (#14926) * Fix TPU test CI * +x first * Lite first to uncovert errors faster * Fixes * One more * Simplify XLALauncher wrapping to avoid pickle error * debug * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug commit successful. Trying local definitions * Require tpu for mock test * ValueError: The number of devices must be either 1 or 8, got 4 instead * Fix mock test * Simplify call, rely on defaults * Skip OSError for now. Maybe upgrading will help * Simplify launch tests, move some to lite * Stricter typing * RuntimeError: Accessing the XLA device before processes have spawned is not allowed. * Revert "RuntimeError: Accessing the XLA device before processes have spawned is not allowed." This reverts commit f65107ebf3e062d497f1033bfbbd59774f2d253f. * Alternative boring solution to the reverted commit * Fix failing test on CUDA machine * Workarounds * Try latest mkl * Revert "Try latest mkl" This reverts commit d06813aa67cc161879775e24be24b735e2925555. * Wrong exception * xfail * Mypy * Comment change * Spawn launch refactor * Accept that we cannot lazy init now * Fix mypy and launch test failures * The base dockerfile already includes mkl-2022.1.0 - what if we use it? * try a different mkl version * Revert mkl version changes Co-authored-by: awaelchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> 2022-10-03 13:13:33 +00:00
Refactor XLA and TPU checks across codebase (#14550) 2022-10-04 22:54:14 +00:00			`echo "--- Sanity check TPU availability ---"`
			`python -c "from lightning_lite.accelerators import TPUAccelerator; assert TPUAccelerator.is_available()"`
			`echo "Sanity check passed!"`

Fix TPU test CI (#14926) * Fix TPU test CI * +x first * Lite first to uncovert errors faster * Fixes * One more * Simplify XLALauncher wrapping to avoid pickle error * debug * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug commit successful. Trying local definitions * Require tpu for mock test * ValueError: The number of devices must be either 1 or 8, got 4 instead * Fix mock test * Simplify call, rely on defaults * Skip OSError for now. Maybe upgrading will help * Simplify launch tests, move some to lite * Stricter typing * RuntimeError: Accessing the XLA device before processes have spawned is not allowed. * Revert "RuntimeError: Accessing the XLA device before processes have spawned is not allowed." This reverts commit f65107ebf3e062d497f1033bfbbd59774f2d253f. * Alternative boring solution to the reverted commit * Fix failing test on CUDA machine * Workarounds * Try latest mkl * Revert "Try latest mkl" This reverts commit d06813aa67cc161879775e24be24b735e2925555. * Wrong exception * xfail * Mypy * Comment change * Spawn launch refactor * Accept that we cannot lazy init now * Fix mypy and launch test failures * The base dockerfile already includes mkl-2022.1.0 - what if we use it? * try a different mkl version * Revert mkl version changes Co-authored-by: awaelchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> 2022-10-03 13:13:33 +00:00			`echo "--- Running Lite tests ---"`
			`cd tests/tests_lite`
			`PL_RUN_TPU_TESTS=1 coverage run --source=lightning_lite -m pytest -vv --durations=0 ./`

			`echo "--- Running standalone Lite tests ---"`
			`PL_STANDALONE_TESTS_SOURCE=lightning_lite PL_STANDALONE_TESTS_BATCH_SIZE=1 bash run_standalone_tests.sh`

			`echo "--- Generating coverage ---"`
Add Github Action to run TPU tests. (#2376) * Add Github Action to run TPU tests. * Trigger new Github Actions run. * Clean up more comments. * Use different fixed version of ml-testing-accelerators and update config to match. * use cluster in us-central1-a * Run 'gcloud logging read' directly without 'echo' to preserve newlines. * cat coverage.xml on the TPU VM side and upload xml on the Github Action side * Use new commit on ml-testing-accelerators so command runs fully. * Preserve newlines in the xml and use if: always() temporarily to upload codecov * Use pytorch_lightning for coverage instead of pytorch-lightning * Remove the debug cat of coverage xml * Apply suggestions from code review * jsonnet rename * name * add codecov flags * add codecov flags * codecov * codecov * revert codecov * Clean up after apt-get and remove old TODOs. * More codefactor cleanups. * drone * drone * disable codecov * cleaning * docker py versions * docker py 3.7 * readme * bash * docker * freeze conda * py3.6 * Stop using apt-get clean. * Dont rm pytorch-lightning * Update docker/tpu/Dockerfile * Longer timeout in the Github Action to wait for GKE to finish. * job1 * job2 * job3 Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka <jirka@pytorchlightning.ai> 2020-07-02 01:44:19 +00:00			`coverage xml`
			`cat coverage.xml \| tr -d '\t'`
			`\|\|\|`
			`),`
			`};`

			`tputests.oneshotJob`