100 lines
3.1 KiB
ReStructuredText
100 lines
3.1 KiB
ReStructuredText
![]() |
:orphan:
|
||
|
|
||
|
########################################
|
||
|
Run on an on-prem cluster (intermediate)
|
||
|
########################################
|
||
|
**Audience**: Users who need to run on an academic or enterprise private cluster.
|
||
|
|
||
|
----
|
||
|
|
||
|
.. _non-slurm:
|
||
|
|
||
|
*****************
|
||
|
Setup the cluster
|
||
|
*****************
|
||
|
This guide shows how to run a training job on a general purpose cluster. We recommend beginners to try this method
|
||
|
first because it requires the least amount of configuration and changes to the code.
|
||
|
To setup a multi-node computing cluster you need:
|
||
|
|
||
|
1) Multiple computers with PyTorch Lightning installed
|
||
|
2) A network connectivity between them with firewall rules that allow traffic flow on a specified *MASTER_PORT*.
|
||
|
3) Defined environment variables on each node required for the PyTorch Lightning multi-node distributed training
|
||
|
|
||
|
PyTorch Lightning follows the design of `PyTorch distributed communication package <https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization>`_. and requires the following environment variables to be defined on each node:
|
||
|
|
||
|
- *MASTER_PORT* - required; has to be a free port on machine with NODE_RANK 0
|
||
|
- *MASTER_ADDR* - required (except for NODE_RANK 0); address of NODE_RANK 0 node
|
||
|
- *WORLD_SIZE* - required; how many nodes are in the cluster
|
||
|
- *NODE_RANK* - required; id of the node in the cluster
|
||
|
|
||
|
.. _training_script_setup:
|
||
|
|
||
|
----
|
||
|
|
||
|
*************************
|
||
|
Setup the training script
|
||
|
*************************
|
||
|
To train a model using multiple nodes, do the following:
|
||
|
|
||
|
1. Design your :ref:`lightning_module` (no need to add anything specific here).
|
||
|
|
||
|
2. Enable DDP in the trainer
|
||
|
|
||
|
.. code-block:: python
|
||
|
|
||
|
# train on 32 GPUs across 4 nodes
|
||
|
trainer = Trainer(accelerator="gpu", devices=8, num_nodes=4, strategy="ddp")
|
||
|
|
||
|
----
|
||
|
|
||
|
***************************
|
||
|
Submit a job to the cluster
|
||
|
***************************
|
||
|
To submit a training job to the cluster you need to run the same training script on each node of the cluster.
|
||
|
This means that you need to:
|
||
|
|
||
|
1. Copy all third-party libraries to each node (usually means - distribute requirements.txt file and install it).
|
||
|
2. Copy all your import dependencies and the script itself to each node.
|
||
|
3. Run the script on each node.
|
||
|
|
||
|
----
|
||
|
|
||
|
******************
|
||
|
Debug on a cluster
|
||
|
******************
|
||
|
When running in DDP mode, some errors in your code can show up as an NCCL issue.
|
||
|
Set the ``NCCL_DEBUG=INFO`` environment variable to see the ACTUAL error.
|
||
|
|
||
|
.. code-block:: bash
|
||
|
|
||
|
NCCL_DEBUG=INFO python train.py ...
|
||
|
|
||
|
----
|
||
|
|
||
|
********
|
||
|
Get help
|
||
|
********
|
||
|
Setting up a cluster for distributed training is not trivial. Lightning offers lightning-grid which allows you to configure a cluster easily and run experiments via the CLI and web UI.
|
||
|
|
||
|
Try it out for free today:
|
||
|
|
||
|
.. raw:: html
|
||
|
|
||
|
<div class="display-card-container">
|
||
|
<div class="row">
|
||
|
|
||
|
.. Add callout items below this line
|
||
|
|
||
|
.. displayitem::
|
||
|
:header: Train models on the cloud
|
||
|
:description: Learn to run a model in the background on a cloud machine.
|
||
|
:col_css: col-md-6
|
||
|
:button_link: cloud_training.html
|
||
|
:height: 150
|
||
|
:tag: intermediate
|
||
|
|
||
|
.. raw:: html
|
||
|
|
||
|
</div>
|
||
|
</div
|