History

Jirka Borovec b1eceb1516 bump: Torch `2.5` (#20351 ) * bump: Torch `2.5.0` * push docker * docker * 2.5.1 and mypy * update USE_DISTRIBUTED=0 test * also for pytorch lightning no distributed * set USE_LIBUV=0 on windows * try drop pickle warning * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * disable compiling update_metrics * bump 2.2.x to bugfix * disable also log in logger connector (also calls metric) * more point release bumps * remove unloved type ignore and print some more on exit * update checkgroup * minor versions * shortened version in build-pl * pytorch 2.4 is with python 3.11 * 2.1 and 2.3 without patch release * for 2.4.1: docker with 3.11 test with 3.12 --------- Co-authored-by: Thomas Viehmann <tv.code@beamnet.de> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> (cherry picked from commit `61a403a512`)		2024-11-12 22:05:41 +01:00
..
README.md	ci: add description how to clean machines (#18553 )	2023-09-14 23:57:47 +02:00
gpu-benchmarks.yml	bump: Torch `2.5` (#20351 )	2024-11-12 22:05:41 +01:00
gpu-tests-fabric.yml	bump: Torch `2.5` (#20351 )	2024-11-12 22:05:41 +01:00
gpu-tests-pytorch.yml	bump: Torch `2.5` (#20351 )	2024-11-12 22:05:41 +01:00
start.sh	CI: Use self-hosted Azure GPU runners (#14632 )	2022-10-05 10:43:54 +00:00

README.md

Creation GPU self-hosted agent pool

Prepare the machine

This is a slightly modified version of the script from https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/docker

apt-get update
apt-get install -y --no-install-recommends \
    ca-certificates \
    curl \
    jq \
    git \
    iputils-ping \
    libcurl4 \
    libunwind8 \
    netcat \
    libssl1.0

curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
mkdir /azp

Stating the agents

export TARGETARCH=linux-x64
export AZP_URL="https://dev.azure.com/Lightning-AI"
export AZP_TOKEN="xxxxxxxxxxxxxxxxxxxxxxxxxx"
export AZP_POOL="lit-rtx-3090"

for i in {0..7..2}
do
     nohup bash .azure/start.sh \
        "AZP_AGENT_NAME=litGPU-YX_$i,$((i+1))" \
        "CUDA_VISIBLE_DEVICES=$i,$((i+1))" \
     > "agent-$i.log" &
done

Check running agents

ps aux | grep start.sh

Machine maintenance

Since most of our jobs/checks are running in a Docker container, the OS/machine can become polluted and fail to run with errors such as:

No space left on device : '/azp/agent-litGPU-21_0,1/_diag/pages/8bb191f4-a8c2-419a-8788-66e3f0522bea_1.log'

In such cases, you need to log in to the machine and run docker system prune.

Automated ways

Let's explore adding a cron job for periodically removing all Docker caches:

Open your user's cron tab for editing: crontab -e
Schedule/add the command with the --force flag to force pruning without interactive confirmation:
```
# every day at 2:00 AM clean docker caches
0 2 * * * docker system prune --force
```
Verify the entry: crontab -l

Note: You may need to add yourself to the Docker group by running sudo usermod -aG docker <your_username> to have permission to execute this command without needing sudo and entering the password.