diff --git a/docs/source/slurm.rst b/docs/source/slurm.rst index db49aaa38c..6c0bf190cf 100644 --- a/docs/source/slurm.rst +++ b/docs/source/slurm.rst @@ -46,10 +46,11 @@ To train a model using multiple-nodes do the following: # TRAIN main(hyperparams) -4. Submit the appropriate SLURM job +4. Create the appropriate SLURM job .. code-block:: bash + # (submit.sh) #!/bin/bash -l # SLURM SUBMIT SCRIPT @@ -78,15 +79,26 @@ To train a model using multiple-nodes do the following: # run script from above srun python3 train.py +5. If you want auto-resubmit (read below), add this line to the submit.sh script + +.. code-block:: bash + + #SBATCH --signal=SIGUSR1@90 + +6. Submit the SLURM job + +.. code-block:: bash + + sbatch submit.sh Walltime auto-resubmit ----------------------------------- When you use Lightning in a SLURM cluster, lightning automatically detects when it is about to run into the walltime, and it does the following: -1. Saves a temporary checkpoint. -2. Requeues the job. -3. When the job starts, it loads the temporary checkpoint. + 1. Saves a temporary checkpoint. + 2. Requeues the job. + 3. When the job starts, it loads the temporary checkpoint. To get this behavior make sure to add the correct signal to your SLURM script