**Audience**: Users who need to run on an academic or enterprise private cluster.
----
.._non-slurm:
*****************
Setup the cluster
*****************
This guide shows how to run a training job on a general purpose cluster. We recommend beginners to try this method
first because it requires the least amount of configuration and changes to the code.
To setup a multi-node computing cluster you need:
1) Multiple computers with PyTorch Lightning installed
2) A network connectivity between them with firewall rules that allow traffic flow on a specified *MASTER_PORT*.
3) Defined environment variables on each node required for the PyTorch Lightning multi-node distributed training
PyTorch Lightning follows the design of `PyTorch distributed communication package <https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization>`_. and requires the following environment variables to be defined on each node:
-*MASTER_PORT* - required; has to be a free port on machine with NODE_RANK 0
-*MASTER_ADDR* - required (except for NODE_RANK 0); address of NODE_RANK 0 node
To submit a training job to the cluster you need to run the same training script on each node of the cluster.
This means that you need to:
1. Copy all third-party libraries to each node (usually means - distribute requirements.txt file and install it).
2. Copy all your import dependencies and the script itself to each node.
3. Run the script on each node.
----
******************
Debug on a cluster
******************
When running in DDP mode, some errors in your code can show up as an NCCL issue.
Set the ``NCCL_DEBUG=INFO`` environment variable to see the ACTUAL error.
..code-block:: bash
NCCL_DEBUG=INFO python train.py ...
----
********
Get help
********
Setting up a cluster for distributed training is not trivial. Lightning offers lightning-grid which allows you to configure a cluster easily and run experiments via the CLI and web UI.
Try it out for free today:
..raw:: html
<div class="display-card-container">
<div class="row">
.. Add callout items below this line
..displayitem::
:header:Train models on the cloud
:description:Learn to run a model in the background on a cloud machine.