lightning/docs/source-pytorch/clouds/cluster_advanced.rst

178 lines
5.2 KiB
ReStructuredText

####################################
Run on an on-prem cluster (advanced)
####################################
.. _slurm:
----
******************************
Run on a SLURM-managed cluster
******************************
Lightning automates the details behind training on a SLURM-powered cluster. In contrast to the general purpose
cluster above, the user does not start the jobs manually on each node and instead submits it to SLURM which
schedules the resources and time for which the job is allowed to run.
----
***************************
Design your training script
***************************
To train a model using multiple nodes, do the following:
1. Design your :ref:`lightning_module` (no need to add anything specific here).
2. Enable DDP in the trainer
.. code-block:: python
# train on 32 GPUs across 4 nodes
trainer = Trainer(accelerator="gpu", devices=8, num_nodes=4, strategy="ddp")
3. It's a good idea to structure your training script like this:
.. testcode::
# train.py
def main(args):
model = YourLightningModule(args)
trainer = Trainer(accelerator="gpu", devices=8, num_nodes=4, strategy="ddp")
trainer.fit(model)
if __name__ == "__main__":
args = ... # you can use your CLI parser of choice, or the `LightningCLI`
# TRAIN
main(args)
4. Create the appropriate SLURM job:
.. code-block:: bash
# (submit.sh)
#!/bin/bash -l
# SLURM SUBMIT SCRIPT
#SBATCH --nodes=4 # This needs to match Trainer(num_nodes=...)
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8 # This needs to match Trainer(devices=...)
#SBATCH --mem=0
#SBATCH --time=0-02:00:00
# activate conda env
source activate $1
# debugging flags (optional)
export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1
# on your cluster you might need these:
# set the network interface
# export NCCL_SOCKET_IFNAME=^docker0,lo
# might need the latest CUDA
# module load NCCL/2.4.7-1-cuda.10.0
# run script from above
srun python3 train.py
5. If you want to auto-resubmit (read below), add this line to the submit.sh script
.. code-block:: bash
#SBATCH --signal=SIGUSR1@90
6. Submit the SLURM job
.. code-block:: bash
sbatch submit.sh
----
***********************************
Enable auto wall-time resubmissions
***********************************
When you use Lightning in a SLURM cluster, it automatically detects when it is about
to run into the wall time and does the following:
1. Saves a temporary checkpoint.
2. Requeues the job.
3. When the job starts, it loads the temporary checkpoint.
To get this behavior make sure to add the correct signal to your SLURM script
.. code-block:: bash
# 90 seconds before training ends
SBATCH --signal=SIGUSR1@90
You can change this signal if your environment requires the use of a different one, for example
.. code-block:: bash
#SBATCH --signal=SIGHUP@90
Then, when you make your trainer, pass the `requeue_signal` option to the :class:`~lightning.pytorch.plugins.environments.slurm_environment.SLURMEnvironment` plugin:
.. code-block:: python
trainer = Trainer(plugins=[SLURMEnvironment(requeue_signal=signal.SIGHUP)])
If auto-resubmit is not desired, it can be turned off in the :class:`~lightning.pytorch.plugins.environments.slurm_environment.SLURMEnvironment` plugin:
.. code-block:: python
from lightning.pytorch.plugins.environments import SLURMEnvironment
trainer = Trainer(plugins=[SLURMEnvironment(auto_requeue=False)])
----
****************
Interactive Mode
****************
You can also let SLURM schedule a machine for you and then log in to the machine to run scripts manually.
This is useful for development and debugging.
If you set the job name to *bash* or *interactive*, and then log in and run scripts, Lightning's SLURM auto-detection will get bypassed and it can launch processes normally:
.. code-block:: bash
# make sure to set `--job-name "interactive"`
srun --account <your-account> --pty bash --job-name "interactive" ...
# now run scripts normally
python train.py ...
----
***************
Troubleshooting
***************
**The Trainer is stuck initializing at startup, what is causing this?**
You are seeing a message like this in the logs but nothing happens:
.. code-block::
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
The most likely reasons and how to fix it:
- You forgot to run the ``python train.py`` command with ``srun``:
Please have a look at the SLURM template script above which includes the ``srun`` at the bottom of the script.
- The number of nodes or number of devices per node is configured incorrectly:
There are two parameters in the SLURM submission script that determine how many processes will run your training, the ``#SBATCH --nodes=X`` setting and ``#SBATCH --ntasks-per-node=Y`` settings.
The numbers there need to match what is configured in your Trainer in the code: ``Trainer(num_nodes=X, devices=Y)``.
If you change the numbers, update them in BOTH places.