parent
1e2c9eaf89
commit
e684fdf60b
|
@ -46,10 +46,11 @@ To train a model using multiple-nodes do the following:
|
|||
# TRAIN
|
||||
main(hyperparams)
|
||||
|
||||
4. Submit the appropriate SLURM job
|
||||
4. Create the appropriate SLURM job
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# (submit.sh)
|
||||
#!/bin/bash -l
|
||||
|
||||
# SLURM SUBMIT SCRIPT
|
||||
|
@ -78,15 +79,26 @@ To train a model using multiple-nodes do the following:
|
|||
# run script from above
|
||||
srun python3 train.py
|
||||
|
||||
5. If you want auto-resubmit (read below), add this line to the submit.sh script
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
#SBATCH --signal=SIGUSR1@90
|
||||
|
||||
6. Submit the SLURM job
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sbatch submit.sh
|
||||
|
||||
Walltime auto-resubmit
|
||||
-----------------------------------
|
||||
When you use Lightning in a SLURM cluster, lightning automatically detects when it is about
|
||||
to run into the walltime, and it does the following:
|
||||
|
||||
1. Saves a temporary checkpoint.
|
||||
2. Requeues the job.
|
||||
3. When the job starts, it loads the temporary checkpoint.
|
||||
1. Saves a temporary checkpoint.
|
||||
2. Requeues the job.
|
||||
3. When the job starts, it loads the temporary checkpoint.
|
||||
|
||||
To get this behavior make sure to add the correct signal to your SLURM script
|
||||
|
||||
|
|
Loading…
Reference in New Issue