docs: Fix typos and wording in cluster_advanced.rst (#18465)

This commit is contained in:
Alex Morehead 2023-09-03 08:06:33 -05:00 committed by GitHub
parent 129e18df6f
commit 095d9cf279
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 7 additions and 7 deletions

View File

@ -7,7 +7,7 @@ Run on an on-prem cluster (advanced)
---- ----
****************************** ******************************
Run on a SLURM managed cluster Run on a SLURM-managed cluster
****************************** ******************************
Lightning automates the details behind training on a SLURM-powered cluster. In contrast to the general purpose Lightning automates the details behind training on a SLURM-powered cluster. In contrast to the general purpose
cluster above, the user does not start the jobs manually on each node and instead submits it to SLURM which cluster above, the user does not start the jobs manually on each node and instead submits it to SLURM which
@ -79,7 +79,7 @@ To train a model using multiple nodes, do the following:
# run script from above # run script from above
srun python3 train.py srun python3 train.py
5. If you want auto-resubmit (read below), add this line to the submit.sh script 5. If you want to auto-resubmit (read below), add this line to the submit.sh script
.. code-block:: bash .. code-block:: bash
@ -93,9 +93,9 @@ To train a model using multiple nodes, do the following:
---- ----
********************************** ***********************************
Enable auto wall-time resubmitions Enable auto wall-time resubmissions
********************************** ***********************************
When you use Lightning in a SLURM cluster, it automatically detects when it is about When you use Lightning in a SLURM cluster, it automatically detects when it is about
to run into the wall time and does the following: to run into the wall time and does the following:
@ -169,9 +169,9 @@ You are seeing a message like this in the logs but nothing happens:
The most likely reasons and how to fix it: The most likely reasons and how to fix it:
- You forgot to run the ``python train.py`` command with ``srun``: - You forgot to run the ``python train.py`` command with ``srun``:
Please have a look at the SLURM template script above which includes the ``srun`` at the botton of the script. Please have a look at the SLURM template script above which includes the ``srun`` at the bottom of the script.
- The number of nodes or number of devices per node is configured incorrectly: - The number of nodes or number of devices per node is configured incorrectly:
There are two parametres in the SLURM submission script that determine how many processes will run your training, the ``#SBATCH --nodes=X`` setting and ``#SBATCH --ntasks-per-node=Y`` settings. There are two parameters in the SLURM submission script that determine how many processes will run your training, the ``#SBATCH --nodes=X`` setting and ``#SBATCH --ntasks-per-node=Y`` settings.
The numbers there need to match what is configured in your Trainer in the code: ``Trainer(num_nodes=X, devices=Y)``. The numbers there need to match what is configured in your Trainer in the code: ``Trainer(num_nodes=X, devices=Y)``.
If you change the numbers, update them in BOTH places. If you change the numbers, update them in BOTH places.