Clean docs (#1604)

* spacing

* slurm docs
This commit is contained in:
William Falcon 2020-04-25 13:21:53 -04:00 committed by GitHub
parent 1e2c9eaf89
commit e684fdf60b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 16 additions and 4 deletions

View File

@ -46,10 +46,11 @@ To train a model using multiple-nodes do the following:
# TRAIN
main(hyperparams)
4. Submit the appropriate SLURM job
4. Create the appropriate SLURM job
.. code-block:: bash
# (submit.sh)
#!/bin/bash -l
# SLURM SUBMIT SCRIPT
@ -78,15 +79,26 @@ To train a model using multiple-nodes do the following:
# run script from above
srun python3 train.py
5. If you want auto-resubmit (read below), add this line to the submit.sh script
.. code-block:: bash
#SBATCH --signal=SIGUSR1@90
6. Submit the SLURM job
.. code-block:: bash
sbatch submit.sh
Walltime auto-resubmit
-----------------------------------
When you use Lightning in a SLURM cluster, lightning automatically detects when it is about
to run into the walltime, and it does the following:
1. Saves a temporary checkpoint.
2. Requeues the job.
3. When the job starts, it loads the temporary checkpoint.
1. Saves a temporary checkpoint.
2. Requeues the job.
3. When the job starts, it loads the temporary checkpoint.
To get this behavior make sure to add the correct signal to your SLURM script