lightning/docs/source-pytorch/clouds/cluster_intermediate_1.rst

:orphan:

########################################
Run on an on-prem cluster (intermediate)
########################################
**Audience**: Users who need to run on an academic or enterprise private cluster.

----

.. _non-slurm:

*****************
Setup the cluster
*****************
This guide shows how to run a training job on a general purpose cluster. We recommend beginners to try this method
first because it requires the least amount of configuration and changes to the code.
To setup a multi-node computing cluster you need:

1) Multiple computers with PyTorch Lightning installed
2) A network connectivity between them with firewall rules that allow traffic flow on a specified *MASTER_PORT*.
3) Defined environment variables on each node required for the PyTorch Lightning multi-node distributed training

PyTorch Lightning follows the design of `PyTorch distributed communication package <https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization>`_. and requires the following environment variables to be defined on each node:

- *MASTER_PORT* - required; has to be a free port on machine with NODE_RANK 0
- *MASTER_ADDR* - required (except for NODE_RANK 0); address of NODE_RANK 0 node
- *WORLD_SIZE* - required; how many nodes are in the cluster
- *NODE_RANK* - required; id of the node in the cluster

.. _training_script_setup:

----

*************************
Setup the training script
*************************
To train a model using multiple nodes, do the following:

1.  Design your :ref:`lightning_module` (no need to add anything specific here).

2.  Enable DDP in the trainer

    .. code-block:: python

       # train on 32 GPUs across 4 nodes
       trainer = Trainer(accelerator="gpu", devices=8, num_nodes=4, strategy="ddp")

----

***************************
Submit a job to the cluster
***************************
To submit a training job to the cluster you need to run the same training script on each node of the cluster.
This means that you need to:

1. Copy all third-party libraries to each node (usually means - distribute requirements.txt file and install it).
2. Copy all your import dependencies and the script itself to each node.
3. Run the script on each node.

----

******************
Debug on a cluster
******************
When running in DDP mode, some errors in your code can show up as an NCCL issue.
Set the ``NCCL_DEBUG=INFO`` environment variable to see the ACTUAL error.

.. code-block:: bash

    NCCL_DEBUG=INFO python train.py ...

----

********
Get help
********
Setting up a cluster for distributed training is not trivial. Lightning offers lightning-grid which allows you to configure a cluster easily and run experiments via the CLI and web UI.

Try it out for free today:

.. raw:: html

    <div class="display-card-container">
        <div class="row">

.. Add callout items below this line

.. displayitem::
   :header: Train models on the cloud
   :description: Learn to run a model in the background on a cloud machine.
   :col_css: col-md-6
   :button_link: cloud_training.html
   :height: 150
   :tag: intermediate

.. raw:: html

        </div>
    </div
docs refactor 3/n (#12795) * updated titles + css * updated titles + css * levels structure * levels structure * levels structure * adding level indexes * finished intro guide layout * finished intro guide layout * general titles * general titles * added movie * added movie * finished 15 mins * levels * added core levels * added core levels * fixed api reference on the left * gpu guides * gpu guides * gpu guides * gpu guides * precision * hpu guide * added ipu * added ipu * added ipu * added ckpt docs * finished basic logging * intermediate * intermediate * intermediate * fixed * fixed margins * fixed margins * fixed margins * fixed margins * fixed margins * fixed margins * fixed margins * fixed margins * fixed margins * added logger stuff * added logger stuff * added logger stuff * added logger stuff * added logger stuff * ic * added inconsolata * added inconsolata * added inconsolata * added inconsolata * added inconsolata * added inconsolata * added inconsolata * updated menu * added basic cloud docs * added basic cloud docs * added basic cloud docs * added basic cloud docs * ic * ic * ic * ic * ic * ic * ic * ic * ic * ic * ic * ic * added demos folder * added demos folder * added demos folder * added demos folder * added demos folder * added demos folder * twocolumns directive * twocols * twocols * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * cleaning up * cleaning up * cleaning up * cleaning up * cleaning up * cleaning up * cleaning up * cleaning up * cleaning up * updated titles + css * levels structure * adding level indexes * finished intro guide layout * general titles * added movie * finished 15 mins * levels * added core levels * fixed api reference on the left * gpu guides * precision * hpu guide * added ipu * added ckpt docs * finished basic logging * intermediate * fixed margins * added logger stuff * ic * added inconsolata * updated menu * added basic cloud docs * ic * added demos folder * twocolumns directive * registry * cleaning up * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * deconflict * deconflict * deconflict * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add testsetup sections wherever needed; fix errors in building docs * pre-commit fixes * Fix duplicate label * minor nit with pre-commit * Fix labels * More changes... * require * debug & cli * prec & model & visu * fix references * fix references * fix refs * fix refs - model_parallel * fix references * prune testsetup with global * refs in index * Fix duplicate label errors * Update orphan docs * Update orphan docs * Update orphan docs * fix links * Fix genindex and search index * fix refs * fix refs * Fix index rst related issues * fix refs * inc to rst * Fix links ref * fix more references * fix refs * deconflict * errors * errors * errors * fix refs * fix refs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix warnings * Fix LightningCLI errors * Fix LightningCLI errors * Fix LightningCLI errors * Fix LightningCLI errors * fix doc build * Duplicate Label fix (docs) (#12800) Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * ignore typing in demo folder * Ignore demos for mypy Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kushashwa Ravi Shrimali <kushashwaravishrimali@gmail.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz> Co-authored-by: rohitgr7 <rohitgr1998@gmail.com> Co-authored-by: Kaushik B <kaushikbokka@gmail.com> Co-authored-by: otaj <ota@grid.ai> 2022-04-19 18:15:47 +00:00			`:orphan:`

			`########################################`
			`Run on an on-prem cluster (intermediate)`
			`########################################`
			`Audience: Users who need to run on an academic or enterprise private cluster.`

			`----`

			`.. _non-slurm:`

			`*****************`
			`Setup the cluster`
			`*****************`
			`This guide shows how to run a training job on a general purpose cluster. We recommend beginners to try this method`
			`first because it requires the least amount of configuration and changes to the code.`
			`To setup a multi-node computing cluster you need:`

			`1) Multiple computers with PyTorch Lightning installed`
			`2) A network connectivity between them with firewall rules that allow traffic flow on a specified MASTER_PORT.`
			`3) Defined environment variables on each node required for the PyTorch Lightning multi-node distributed training`

			PyTorch Lightning follows the design of `PyTorch distributed communication package <https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization>`_. and requires the following environment variables to be defined on each node:

			`- MASTER_PORT - required; has to be a free port on machine with NODE_RANK 0`
			`- MASTER_ADDR - required (except for NODE_RANK 0); address of NODE_RANK 0 node`
			`- WORLD_SIZE - required; how many nodes are in the cluster`
			`- NODE_RANK - required; id of the node in the cluster`

			`.. _training_script_setup:`

			`----`

			`*************************`
			`Setup the training script`
			`*************************`
			`To train a model using multiple nodes, do the following:`

			1. Design your :ref:`lightning_module` (no need to add anything specific here).

			`2. Enable DDP in the trainer`

			`.. code-block:: python`

			`# train on 32 GPUs across 4 nodes`
			`trainer = Trainer(accelerator="gpu", devices=8, num_nodes=4, strategy="ddp")`

			`----`

			`***************************`
			`Submit a job to the cluster`
			`***************************`
			`To submit a training job to the cluster you need to run the same training script on each node of the cluster.`
			`This means that you need to:`

			`1. Copy all third-party libraries to each node (usually means - distribute requirements.txt file and install it).`
			`2. Copy all your import dependencies and the script itself to each node.`
			`3. Run the script on each node.`

			`----`

			`******************`
			`Debug on a cluster`
			`******************`
			`When running in DDP mode, some errors in your code can show up as an NCCL issue.`
			Set the ``NCCL_DEBUG=INFO`` environment variable to see the ACTUAL error.

			`.. code-block:: bash`

			`NCCL_DEBUG=INFO python train.py ...`

			`----`

			`********`
			`Get help`
			`********`
			`Setting up a cluster for distributed training is not trivial. Lightning offers lightning-grid which allows you to configure a cluster easily and run experiments via the CLI and web UI.`

			`Try it out for free today:`

			`.. raw:: html`

			`<div class="display-card-container">`
			`<div class="row">`

			`.. Add callout items below this line`

			`.. displayitem::`
			`:header: Train models on the cloud`
			`:description: Learn to run a model in the background on a cloud machine.`
			`:col_css: col-md-6`
			`:button_link: cloud_training.html`
			`:height: 150`
			`:tag: intermediate`

			`.. raw:: html`

			`</div>`
			`</div`