docs: Fix typos and wording in cluster_advanced.rst (#18465)
This commit is contained in:
parent
129e18df6f
commit
095d9cf279
|
@ -7,7 +7,7 @@ Run on an on-prem cluster (advanced)
|
||||||
----
|
----
|
||||||
|
|
||||||
******************************
|
******************************
|
||||||
Run on a SLURM managed cluster
|
Run on a SLURM-managed cluster
|
||||||
******************************
|
******************************
|
||||||
Lightning automates the details behind training on a SLURM-powered cluster. In contrast to the general purpose
|
Lightning automates the details behind training on a SLURM-powered cluster. In contrast to the general purpose
|
||||||
cluster above, the user does not start the jobs manually on each node and instead submits it to SLURM which
|
cluster above, the user does not start the jobs manually on each node and instead submits it to SLURM which
|
||||||
|
@ -79,7 +79,7 @@ To train a model using multiple nodes, do the following:
|
||||||
# run script from above
|
# run script from above
|
||||||
srun python3 train.py
|
srun python3 train.py
|
||||||
|
|
||||||
5. If you want auto-resubmit (read below), add this line to the submit.sh script
|
5. If you want to auto-resubmit (read below), add this line to the submit.sh script
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
|
@ -93,9 +93,9 @@ To train a model using multiple nodes, do the following:
|
||||||
|
|
||||||
----
|
----
|
||||||
|
|
||||||
**********************************
|
***********************************
|
||||||
Enable auto wall-time resubmitions
|
Enable auto wall-time resubmissions
|
||||||
**********************************
|
***********************************
|
||||||
When you use Lightning in a SLURM cluster, it automatically detects when it is about
|
When you use Lightning in a SLURM cluster, it automatically detects when it is about
|
||||||
to run into the wall time and does the following:
|
to run into the wall time and does the following:
|
||||||
|
|
||||||
|
@ -169,9 +169,9 @@ You are seeing a message like this in the logs but nothing happens:
|
||||||
The most likely reasons and how to fix it:
|
The most likely reasons and how to fix it:
|
||||||
|
|
||||||
- You forgot to run the ``python train.py`` command with ``srun``:
|
- You forgot to run the ``python train.py`` command with ``srun``:
|
||||||
Please have a look at the SLURM template script above which includes the ``srun`` at the botton of the script.
|
Please have a look at the SLURM template script above which includes the ``srun`` at the bottom of the script.
|
||||||
|
|
||||||
- The number of nodes or number of devices per node is configured incorrectly:
|
- The number of nodes or number of devices per node is configured incorrectly:
|
||||||
There are two parametres in the SLURM submission script that determine how many processes will run your training, the ``#SBATCH --nodes=X`` setting and ``#SBATCH --ntasks-per-node=Y`` settings.
|
There are two parameters in the SLURM submission script that determine how many processes will run your training, the ``#SBATCH --nodes=X`` setting and ``#SBATCH --ntasks-per-node=Y`` settings.
|
||||||
The numbers there need to match what is configured in your Trainer in the code: ``Trainer(num_nodes=X, devices=Y)``.
|
The numbers there need to match what is configured in your Trainer in the code: ``Trainer(num_nodes=X, devices=Y)``.
|
||||||
If you change the numbers, update them in BOTH places.
|
If you change the numbers, update them in BOTH places.
|
||||||
|
|
Loading…
Reference in New Issue