From e461e90f845310b0866ebe7c1046c706d362e78b Mon Sep 17 00:00:00 2001
From: awaelchli <aedu.waelchli@gmail.com>
Date: Tue, 27 Feb 2024 04:29:26 +0100
Subject: [PATCH] Update the Multi-GPU docs (#19525)

---
 .gitignore                                    |   1 +
 .../accelerators/gpu_intermediate.rst         | 168 +++++-------------
 .../clouds/cluster_advanced.rst               |   1 +
 .../clouds/cluster_intermediate_1.rst         |  20 ++-
 .../clouds/cluster_intermediate_2.rst         |  65 +++++--
 docs/source-pytorch/glossary/index.rst        |   2 +-
 6 files changed, 105 insertions(+), 152 deletions(-)

diff --git a/.gitignore b/.gitignore
index 0420135b02..de1de44fec 100644
--- a/.gitignore
+++ b/.gitignore
@@ -23,6 +23,7 @@ docs/source-pytorch/_static/images/course_UvA-DL
 docs/source-pytorch/_static/images/lightning_examples
 docs/source-pytorch/_static/fetched-s3-assets
 docs/source-pytorch/integrations/hpu
+docs/source-pytorch/integrations/strategies/Hivemind.rst
 
 docs/source-fabric/*/generated
 
diff --git a/docs/source-pytorch/accelerators/gpu_intermediate.rst b/docs/source-pytorch/accelerators/gpu_intermediate.rst
index 90d4256324..023fd02c18 100644
--- a/docs/source-pytorch/accelerators/gpu_intermediate.rst
+++ b/docs/source-pytorch/accelerators/gpu_intermediate.rst
@@ -8,18 +8,19 @@ GPU training (Intermediate)
 
 ----
 
-Distributed Training strategies
+
+Distributed training strategies
 -------------------------------
 Lightning supports multiple ways of doing distributed training.
 
+- Regular (``strategy='ddp'``)
+- Spawn (``strategy='ddp_spawn'``)
+- Notebook/Fork (``strategy='ddp_notebook'``)
+
 .. video:: https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/yt/Trainer+flags+4-+multi+node+training_3.mp4
     :poster: https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/yt_thumbs/thumb_multi_gpus.png
     :width: 400
 
-- DistributedDataParallel (multiple-gpus across many machines)
-    - Regular (``strategy='ddp'``)
-    - Spawn (``strategy='ddp_spawn'``)
-    - Notebook/Fork (``strategy='ddp_notebook'``)
 
 .. note::
     If you request multiple GPUs or nodes without setting a strategy, DDP will be automatically used.
@@ -28,22 +29,22 @@ For a deeper understanding of what Lightning is doing, feel free to read this
 `guide <https://medium.com/@_willfalcon/9-tips-for-training-lightning-fast-neural-networks-in-pytorch-8e63a502f565>`_.
 
 
+----
+
+
 Distributed Data Parallel
 ^^^^^^^^^^^^^^^^^^^^^^^^^
 :class:`~torch.nn.parallel.DistributedDataParallel` (DDP) works as follows:
 
 1. Each GPU across each node gets its own process.
-
 2. Each GPU gets visibility into a subset of the overall dataset. It will only ever see that subset.
-
 3. Each process inits the model.
-
 4. Each process performs a full forward and backward pass in parallel.
-
 5. The gradients are synced and averaged across all processes.
-
 6. Each process updates its optimizer.
 
+|
+
 .. code-block:: python
 
     # train on 8 GPUs (same machine (ie: node))
@@ -59,34 +60,31 @@ variables:
 
     # example for 3 GPUs DDP
     MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=0 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc
-    MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=1 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc
-    MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=2 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc
+    MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=0 LOCAL_RANK=1 python my_file.py --accelerator 'gpu' --devices 3 --etc
+    MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=0 LOCAL_RANK=2 python my_file.py --accelerator 'gpu' --devices 3 --etc
 
-We use DDP this way because `ddp_spawn` has a few limitations (due to Python and PyTorch):
+Using DDP this way has a few disadvantages over ``torch.multiprocessing.spawn()``:
 
-1. Since `.spawn()` trains the model in subprocesses, the model on the main process does not get updated.
-2. Dataloader(num_workers=N), where N is large, bottlenecks training with DDP... ie: it will be VERY slow or won't work at all. This is a PyTorch limitation.
-3. Forces everything to be picklable.
+1. All processes (including the main process) participate in training and have the updated state of the model and Trainer state.
+2. No multiprocessing pickle errors
+3. Easily scales to multi-node training
 
-There are cases in which it is NOT possible to use DDP. Examples are:
+|
 
-- Jupyter Notebook, Google COLAB, Kaggle, etc.
-- You have a nested script without a root package
+It is NOT possible to use DDP in interactive environments like Jupyter Notebook, Google COLAB, Kaggle, etc.
+In these situations you should use `ddp_notebook`.
+
+
+----
 
-In these situations you should use `ddp_notebook` or `dp` instead.
 
 Distributed Data Parallel Spawn
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-`ddp_spawn` is exactly like `ddp` except that it uses .spawn to start the training processes.
 
-.. warning:: It is STRONGLY recommended to use `DDP` for speed and performance.
+.. warning:: It is STRONGLY recommended to use DDP for speed and performance.
 
-.. code-block:: python
-
-    mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
-
-If your script does not support being called from the command line (ie: it is nested without a root
-project module) you can use the following method:
+The `ddp_spawn` strategy is similar to `ddp` except that it uses ``torch.multiprocessing.spawn()`` to start the training processes.
+Use this for debugging only, or if you are converting a code base to Lightning that relies on spawn.
 
 .. code-block:: python
 
@@ -95,54 +93,12 @@ project module) you can use the following method:
 
 We STRONGLY discourage this use because it has limitations (due to Python and PyTorch):
 
-1. The model you pass in will not update. Please save a checkpoint and restore from there.
-2. Set Dataloader(num_workers=0) or it will bottleneck training.
+1. After ``.fit()``, only the model's weights get restored to the main process, but no other state of the Trainer.
+2. Does not support multi-node training.
+3. It is generally slower than DDP.
 
-`ddp` is MUCH faster than `ddp_spawn`. We recommend you
 
-1. Install a top-level module for your project using setup.py
-
-.. code-block:: python
-
-    # setup.py
-    #!/usr/bin/env python
-
-    from setuptools import setup, find_packages
-
-    setup(
-        name="src",
-        version="0.0.1",
-        description="Describe Your Cool Project",
-        author="",
-        author_email="",
-        url="https://github.com/YourSeed",  # REPLACE WITH YOUR OWN GITHUB PROJECT LINK
-        install_requires=["lightning"],
-        packages=find_packages(),
-    )
-
-2. Setup your project like so:
-
-.. code-block:: bash
-
-    /project
-        /src
-            some_file.py
-            /or_a_folder
-        setup.py
-
-3. Install as a root-level package
-
-.. code-block:: bash
-
-    cd /project
-    pip install -e .
-
-You can then call your scripts anywhere
-
-.. code-block:: bash
-
-    cd /project/src
-    python some_file.py --accelerator 'gpu' --devices 8 --strategy 'ddp'
+----
 
 
 Distributed Data Parallel in Notebooks
@@ -165,8 +121,11 @@ The Trainer enables it by default when such environments are detected.
 Among the native distributed strategies, regular DDP (``strategy="ddp"``) is still recommended as the go-to strategy over Spawn and Fork/Notebook for its speed and stability but it can only be used with scripts.
 
 
+----
+
+
 Comparison of DDP variants and tradeoffs
-****************************************
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. list-table:: DDP variants and their tradeoffs
    :widths: 40 20 20 20
@@ -202,68 +161,23 @@ Comparison of DDP variants and tradeoffs
      - Fast
 
 
-Distributed and 16-bit precision
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Below are the possible configurations we support.
-
-+-------+---------+-----+--------+-----------------------------------------------------------------------+
-| 1 GPU | 1+ GPUs | DDP | 16-bit | command                                                               |
-+=======+=========+=====+========+=======================================================================+
-| Y     |         |     |        | `Trainer(accelerator="gpu", devices=1)`                               |
-+-------+---------+-----+--------+-----------------------------------------------------------------------+
-| Y     |         |     | Y      | `Trainer(accelerator="gpu", devices=1, precision=16)`                 |
-+-------+---------+-----+--------+-----------------------------------------------------------------------+
-|       | Y       | Y   |        | `Trainer(accelerator="gpu", devices=k, strategy='ddp')`               |
-+-------+---------+-----+--------+-----------------------------------------------------------------------+
-|       | Y       | Y   | Y      | `Trainer(accelerator="gpu", devices=k, strategy='ddp', precision=16)` |
-+-------+---------+-----+--------+-----------------------------------------------------------------------+
-
-DDP can also be used with 1 GPU, but there's no reason to do so other than debugging distributed-related issues.
+----
 
 
-Implement Your Own Distributed (DDP) training
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-If you need your own way to init PyTorch DDP you can override :meth:`lightning.pytorch.strategies.ddp.DDPStrategy.setup_distributed`.
-
-If you also need to use your own DDP implementation, override :meth:`lightning.pytorch.strategies.ddp.DDPStrategy.configure_ddp`.
-
-----------
-
-Torch Distributed Elastic
--------------------------
-Lightning supports the use of Torch Distributed Elastic to enable fault-tolerant and elastic distributed job scheduling. To use it, specify the 'ddp' backend and the number of GPUs you want to use in the trainer.
+TorchRun (TorchElastic)
+-----------------------
+Lightning supports the use of TorchRun (previously known as TorchElastic) to enable fault-tolerant and elastic distributed job scheduling.
+To use it, specify the DDP strategy and the number of GPUs you want to use in the Trainer.
 
 .. code-block:: python
 
     Trainer(accelerator="gpu", devices=8, strategy="ddp")
 
-To launch a fault-tolerant job, run the following on all nodes.
+Then simply launch your script with the :doc:`torchrun <../clouds/cluster_intermediate_2>` command.
 
-.. code-block:: bash
 
-    python -m torch.distributed.run
-            --nnodes=NUM_NODES
-            --nproc_per_node=TRAINERS_PER_NODE
-            --rdzv_id=JOB_ID
-            --rdzv_backend=c10d
-            --rdzv_endpoint=HOST_NODE_ADDR
-            YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...)
+----
 
-To launch an elastic job, run the following on at least ``MIN_SIZE`` nodes and at most ``MAX_SIZE`` nodes.
-
-.. code-block:: bash
-
-    python -m torch.distributed.run
-            --nnodes=MIN_SIZE:MAX_SIZE
-            --nproc_per_node=TRAINERS_PER_NODE
-            --rdzv_id=JOB_ID
-            --rdzv_backend=c10d
-            --rdzv_endpoint=HOST_NODE_ADDR
-            YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...)
-
-See the official `Torch Distributed Elastic documentation <https://pytorch.org/docs/stable/distributed.elastic.html>`_ for details
-on installation and more use cases.
 
 Optimize multi-machine communication
 ------------------------------------
diff --git a/docs/source-pytorch/clouds/cluster_advanced.rst b/docs/source-pytorch/clouds/cluster_advanced.rst
index 9fe1a4bd36..0d5aefefc8 100644
--- a/docs/source-pytorch/clouds/cluster_advanced.rst
+++ b/docs/source-pytorch/clouds/cluster_advanced.rst
@@ -15,6 +15,7 @@ schedules the resources and time for which the job is allowed to run.
 
 ----
 
+
 ***************************
 Design your training script
 ***************************
diff --git a/docs/source-pytorch/clouds/cluster_intermediate_1.rst b/docs/source-pytorch/clouds/cluster_intermediate_1.rst
index d668b2bf9e..391c9b1779 100644
--- a/docs/source-pytorch/clouds/cluster_intermediate_1.rst
+++ b/docs/source-pytorch/clouds/cluster_intermediate_1.rst
@@ -5,13 +5,15 @@ Run on an on-prem cluster (intermediate)
 ########################################
 **Audience**: Users who need to run on an academic or enterprise private cluster.
 
+
 ----
 
+
 .. _non-slurm:
 
-*****************
-Setup the cluster
-*****************
+******************
+Set up the cluster
+******************
 This guide shows how to run a training job on a general purpose cluster. We recommend beginners to try this method
 first because it requires the least amount of configuration and changes to the code.
 To setup a multi-node computing cluster you need:
@@ -29,11 +31,13 @@ PyTorch Lightning follows the design of `PyTorch distributed communication packa
 
 .. _training_script_setup:
 
+
 ----
 
-*************************
-Setup the training script
-*************************
+
+**************************
+Set up the training script
+**************************
 To train a model using multiple nodes, do the following:
 
 1.  Design your :ref:`lightning_module` (no need to add anything specific here).
@@ -45,8 +49,10 @@ To train a model using multiple nodes, do the following:
        # train on 32 GPUs across 4 nodes
        trainer = Trainer(accelerator="gpu", devices=8, num_nodes=4, strategy="ddp")
 
+
 ----
 
+
 ***************************
 Submit a job to the cluster
 ***************************
@@ -57,8 +63,10 @@ This means that you need to:
 2. Copy all your import dependencies and the script itself to each node.
 3. Run the script on each node.
 
+
 ----
 
+
 ******************
 Debug on a cluster
 ******************
diff --git a/docs/source-pytorch/clouds/cluster_intermediate_2.rst b/docs/source-pytorch/clouds/cluster_intermediate_2.rst
index 8e0d8d1b4d..fbe2d8f781 100644
--- a/docs/source-pytorch/clouds/cluster_intermediate_2.rst
+++ b/docs/source-pytorch/clouds/cluster_intermediate_2.rst
@@ -4,34 +4,63 @@ Run on an on-prem cluster (intermediate)
 
 .. _torch_distributed_run:
 
-*************************
-Run with TorchDistributed
-*************************
-`Torch Distributed Run <https://pytorch.org/docs/stable/elastic/run.html>`__ provides helper functions to setup distributed environment variables from the `PyTorch distributed communication package <https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization>`__ that need to be defined on each node.
-
-Once the script is setup like described in :ref:` Training Script Setup<training_script_setup>`, you can run the below command across your nodes to start multi-node training.
+********************************
+Run with TorchRun (TorchElastic)
+********************************
 
+`TorchRun <https://pytorch.org/docs/stable/elastic/run.html>`__ (previously known as TorchElastic) provides helper functions to set up distributed environment variables from the `PyTorch distributed communication package <https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization>`__ that need to be defined on each node.
+Once the script is set up like described in :ref:`Training Script Setup <training_script_setup>`, you can run the below command across your nodes to start multi-node training.
 Like a custom cluster, you have to ensure that there is network connectivity between the nodes with firewall rules that allow traffic flow on a specified *MASTER_PORT*.
-
 Finally, you'll need to decide which node you'd like to be the main node (*MASTER_ADDR*), and the ranks of each node (*NODE_RANK*).
 
 For example:
 
-* *MASTER_ADDR* 10.10.10.16
-* *MASTER_PORT* 29500
-* *NODE_RANK* 0 for the first node, 1 for the second node
+* **MASTER_ADDR:** 10.10.10.16
+* **MASTER_PORT:** 29500
+* **NODE_RANK:** 0 for the first node, 1 for the second node, etc.
 
 Run the below command with the appropriate variables set on each node.
 
 .. code-block:: bash
 
-    python -m torch.distributed.run
-        --nnodes=2 # number of nodes you'd like to run with
-        --master_addr <MASTER_ADDR>
-        --master_port <MASTER_PORT>
-        --node_rank <NODE_RANK>
-        train.py (--arg1 ... train script args...)
+    torchrun \
+        --nproc_per_node=<GPUS_PER_NODE> \
+        --nnodes=<NUM_NODES> \
+        --node_rank <NODE_RANK> \
+        --master_addr <MASTER_ADDR> \
+        --master_port <MASTER_PORT> \
+        train.py --arg1 --arg2
 
-.. note::
 
-    ``torch.distributed.run`` assumes that you'd like to spawn a process per GPU if GPU devices are found on the node. This can be adjusted with ``-nproc_per_node``.
+- **--nproc_per_node:** Number of processes that will be launched per node (default 1). This number must match the number set in ``Trainer(devices=...)`` if specified in Trainer.
+- **--nnodes:** Number of nodes/machines (default 1). This number must match the number set in ``Trainer(num_nodes=...)`` if specified in Trainer.
+- **--node_rank:** The index of the node/machine.
+- **--master_addr:** The IP address of the main node with node rank 0.
+- **--master_port:** The port that will be used for communication between the nodes. Must be open in the firewall on each node to permit TCP traffic.
+
+For more advanced configuration options in TorchRun such as elastic, fault-tolerant training, see the `official documentation <https://pytorch.org/docs/stable/elastic/run.html>`_.
+
+|
+
+**Example running on 2 nodes with 8 GPUs each:**
+
+Assume the main node has the IP address 10.10.10.16.
+On node the first node, you would run this command:
+
+.. code-block:: bash
+
+    torchrun \
+        --nproc_per_node=8 --nnodes=2 --node_rank 0 \
+        --master_addr 10.10.10.16 --master_port 50000 \
+        train.py
+
+On the second node, you would run this command:
+
+.. code-block:: bash
+
+    torchrun \
+        --nproc_per_node=8 --nnodes=2 --node_rank 1 \
+        --master_addr 10.10.10.16 --master_port 50000 \
+        train.py
+
+Note that the only difference between the two commands is the node rank!
diff --git a/docs/source-pytorch/glossary/index.rst b/docs/source-pytorch/glossary/index.rst
index 5ca677c48e..c91ca9125c 100644
--- a/docs/source-pytorch/glossary/index.rst
+++ b/docs/source-pytorch/glossary/index.rst
@@ -44,7 +44,7 @@
    SLURM <../clouds/cluster_advanced>
    Transfer learning <../advanced/transfer_learning>
    Trainer <../common/trainer>
-   Torch distributed <../clouds/cluster_intermediate_2>
+   TorchRun (TorchElastic) <../clouds/cluster_intermediate_2>
    Warnings <../advanced/warnings>