lightning/docs/source-pytorch/clouds/cluster_intermediate_2.rst

########################################
Run on an on-prem cluster (intermediate)
########################################

.. _torch_distributed_run:

*************************
Run with TorchDistributed
*************************
`Torch Distributed Run <https://pytorch.org/docs/stable/elastic/run.html>`__ provides helper functions to setup distributed environment variables from the `PyTorch distributed communication package <https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization>`__ that need to be defined on each node.

Once the script is setup like described in :ref:` Training Script Setup<training_script_setup>`, you can run the below command across your nodes to start multi-node training.

Like a custom cluster, you have to ensure that there is network connectivity between the nodes with firewall rules that allow traffic flow on a specified *MASTER_PORT*.

Finally, you'll need to decide which node you'd like to be the main node (*MASTER_ADDR*), and the ranks of each node (*NODE_RANK*).

For example:

* *MASTER_ADDR* 10.10.10.16
* *MASTER_PORT* 29500
* *NODE_RANK* 0 for the first node, 1 for the second node

Run the below command with the appropriate variables set on each node.

.. code-block:: bash

    python -m torch.distributed.run
        --nnodes=2 # number of nodes you'd like to run with
        --master_addr <MASTER_ADDR>
        --master_port <MASTER_PORT>
        --node_rank <NODE_RANK>
        train.py (--arg1 ... train script args...)

.. note::

    ``torch.distributed.run`` assumes that you'd like to spawn a process per GPU if GPU devices are found on the node. This can be adjusted with ``-nproc_per_node``.

----

********
Get help
********
Setting up a cluster for distributed training is not trivial. Lightning offers lightning-grid which allows you to configure a cluster easily and run experiments via the CLI and web UI.

Try it out for free today:

.. raw:: html

    <div class="display-card-container">
        <div class="row">

.. Add callout items below this line

.. displayitem::
   :header: Train models on the cloud
   :description: Learn to run a model in the background on a cloud machine.
   :col_css: col-md-6
   :button_link: cloud_training.html
   :height: 150
   :tag: intermediate

.. raw:: html

        </div>
    </div
docs refactor 3/n (#12795) * updated titles + css * updated titles + css * levels structure * levels structure * levels structure * adding level indexes * finished intro guide layout * finished intro guide layout * general titles * general titles * added movie * added movie * finished 15 mins * levels * added core levels * added core levels * fixed api reference on the left * gpu guides * gpu guides * gpu guides * gpu guides * precision * hpu guide * added ipu * added ipu * added ipu * added ckpt docs * finished basic logging * intermediate * intermediate * intermediate * fixed * fixed margins * fixed margins * fixed margins * fixed margins * fixed margins * fixed margins * fixed margins * fixed margins * fixed margins * added logger stuff * added logger stuff * added logger stuff * added logger stuff * added logger stuff * ic * added inconsolata * added inconsolata * added inconsolata * added inconsolata * added inconsolata * added inconsolata * added inconsolata * updated menu * added basic cloud docs * added basic cloud docs * added basic cloud docs * added basic cloud docs * ic * ic * ic * ic * ic * ic * ic * ic * ic * ic * ic * ic * added demos folder * added demos folder * added demos folder * added demos folder * added demos folder * added demos folder * twocolumns directive * twocols * twocols * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * cleaning up * cleaning up * cleaning up * cleaning up * cleaning up * cleaning up * cleaning up * cleaning up * cleaning up * updated titles + css * levels structure * adding level indexes * finished intro guide layout * general titles * added movie * finished 15 mins * levels * added core levels * fixed api reference on the left * gpu guides * precision * hpu guide * added ipu * added ckpt docs * finished basic logging * intermediate * fixed margins * added logger stuff * ic * added inconsolata * updated menu * added basic cloud docs * ic * added demos folder * twocolumns directive * registry * cleaning up * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * deconflict * deconflict * deconflict * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add testsetup sections wherever needed; fix errors in building docs * pre-commit fixes * Fix duplicate label * minor nit with pre-commit * Fix labels * More changes... * require * debug & cli * prec & model & visu * fix references * fix references * fix refs * fix refs - model_parallel * fix references * prune testsetup with global * refs in index * Fix duplicate label errors * Update orphan docs * Update orphan docs * Update orphan docs * fix links * Fix genindex and search index * fix refs * fix refs * Fix index rst related issues * fix refs * inc to rst * Fix links ref * fix more references * fix refs * deconflict * errors * errors * errors * fix refs * fix refs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix warnings * Fix LightningCLI errors * Fix LightningCLI errors * Fix LightningCLI errors * Fix LightningCLI errors * fix doc build * Duplicate Label fix (docs) (#12800) Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * ignore typing in demo folder * Ignore demos for mypy Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kushashwa Ravi Shrimali <kushashwaravishrimali@gmail.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz> Co-authored-by: rohitgr7 <rohitgr1998@gmail.com> Co-authored-by: Kaushik B <kaushikbokka@gmail.com> Co-authored-by: otaj <ota@grid.ai> 2022-04-19 18:15:47 +00:00			`########################################`
			`Run on an on-prem cluster (intermediate)`
			`########################################`

			`.. _torch_distributed_run:`

			`*************************`
			`Run with TorchDistributed`
			`*************************`
			`Torch Distributed Run <https://pytorch.org/docs/stable/elastic/run.html>`__ provides helper functions to setup distributed environment variables from the `PyTorch distributed communication package <https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization>`__ that need to be defined on each node.

			Once the script is setup like described in :ref:` Training Script Setup<training_script_setup>`, you can run the below command across your nodes to start multi-node training.

			`Like a custom cluster, you have to ensure that there is network connectivity between the nodes with firewall rules that allow traffic flow on a specified MASTER_PORT.`

			`Finally, you'll need to decide which node you'd like to be the main node (MASTER_ADDR), and the ranks of each node (NODE_RANK).`

			`For example:`

			`* MASTER_ADDR 10.10.10.16`
			`* MASTER_PORT 29500`
			`* NODE_RANK 0 for the first node, 1 for the second node`

			`Run the below command with the appropriate variables set on each node.`

			`.. code-block:: bash`

			`python -m torch.distributed.run`
			`--nnodes=2 # number of nodes you'd like to run with`
			`--master_addr <MASTER_ADDR>`
			`--master_port <MASTER_PORT>`
			`--node_rank <NODE_RANK>`
			`train.py (--arg1 ... train script args...)`

			`.. note::`

			``torch.distributed.run`` assumes that you'd like to spawn a process per GPU if GPU devices are found on the node. This can be adjusted with ``-nproc_per_node``.

			`----`

			`********`
			`Get help`
			`********`
			`Setting up a cluster for distributed training is not trivial. Lightning offers lightning-grid which allows you to configure a cluster easily and run experiments via the CLI and web UI.`

			`Try it out for free today:`

			`.. raw:: html`

			`<div class="display-card-container">`
			`<div class="row">`

			`.. Add callout items below this line`

			`.. displayitem::`
			`:header: Train models on the cloud`
			`:description: Learn to run a model in the background on a cloud machine.`
			`:col_css: col-md-6`
			`:button_link: cloud_training.html`
			`:height: 150`
			`:tag: intermediate`

			`.. raw:: html`

			`</div>`
			`</div`