b1eceb1516
* bump: Torch `2.5.0`
* push docker
* docker
* 2.5.1 and mypy
* update USE_DISTRIBUTED=0 test
* also for pytorch lightning no distributed
* set USE_LIBUV=0 on windows
* try drop pickle warning
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* disable compiling update_metrics
* bump 2.2.x to bugfix
* disable also log in logger connector (also calls metric)
* more point release bumps
* remove unloved type ignore and print some more on exit
* update checkgroup
* minor versions
* shortened version in build-pl
* pytorch 2.4 is with python 3.11
* 2.1 and 2.3 without patch release
* for 2.4.1: docker with 3.11 test with 3.12
---------
Co-authored-by: Thomas Viehmann <tv.code@beamnet.de>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
(cherry picked from commit
|
||
---|---|---|
.. | ||
README.md | ||
gpu-benchmarks.yml | ||
gpu-tests-fabric.yml | ||
gpu-tests-pytorch.yml | ||
start.sh |
README.md
Creation GPU self-hosted agent pool
Prepare the machine
This is a slightly modified version of the script from https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/docker
apt-get update
apt-get install -y --no-install-recommends \
ca-certificates \
curl \
jq \
git \
iputils-ping \
libcurl4 \
libunwind8 \
netcat \
libssl1.0
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
mkdir /azp
Stating the agents
export TARGETARCH=linux-x64
export AZP_URL="https://dev.azure.com/Lightning-AI"
export AZP_TOKEN="xxxxxxxxxxxxxxxxxxxxxxxxxx"
export AZP_POOL="lit-rtx-3090"
for i in {0..7..2}
do
nohup bash .azure/start.sh \
"AZP_AGENT_NAME=litGPU-YX_$i,$((i+1))" \
"CUDA_VISIBLE_DEVICES=$i,$((i+1))" \
> "agent-$i.log" &
done
Check running agents
ps aux | grep start.sh
Machine maintenance
Since most of our jobs/checks are running in a Docker container, the OS/machine can become polluted and fail to run with errors such as:
No space left on device : '/azp/agent-litGPU-21_0,1/_diag/pages/8bb191f4-a8c2-419a-8788-66e3f0522bea_1.log'
In such cases, you need to log in to the machine and run docker system prune
.
Automated ways
Let's explore adding a cron job for periodically removing all Docker caches:
- Open your user's cron tab for editing:
crontab -e
- Schedule/add the command with the
--force
flag to force pruning without interactive confirmation:# every day at 2:00 AM clean docker caches 0 2 * * * docker system prune --force
- Verify the entry:
crontab -l
Note: You may need to add yourself to the Docker group by running sudo usermod -aG docker <your_username>
to have permission to execute this command without needing sudo
and entering the password.