From e461e90f845310b0866ebe7c1046c706d362e78b Mon Sep 17 00:00:00 2001 From: awaelchli Date: Tue, 27 Feb 2024 04:29:26 +0100 Subject: [PATCH] Update the Multi-GPU docs (#19525) --- .gitignore | 1 + .../accelerators/gpu_intermediate.rst | 168 +++++------------- .../clouds/cluster_advanced.rst | 1 + .../clouds/cluster_intermediate_1.rst | 20 ++- .../clouds/cluster_intermediate_2.rst | 65 +++++-- docs/source-pytorch/glossary/index.rst | 2 +- 6 files changed, 105 insertions(+), 152 deletions(-) diff --git a/.gitignore b/.gitignore index 0420135b02..de1de44fec 100644 --- a/.gitignore +++ b/.gitignore @@ -23,6 +23,7 @@ docs/source-pytorch/_static/images/course_UvA-DL docs/source-pytorch/_static/images/lightning_examples docs/source-pytorch/_static/fetched-s3-assets docs/source-pytorch/integrations/hpu +docs/source-pytorch/integrations/strategies/Hivemind.rst docs/source-fabric/*/generated diff --git a/docs/source-pytorch/accelerators/gpu_intermediate.rst b/docs/source-pytorch/accelerators/gpu_intermediate.rst index 90d4256324..023fd02c18 100644 --- a/docs/source-pytorch/accelerators/gpu_intermediate.rst +++ b/docs/source-pytorch/accelerators/gpu_intermediate.rst @@ -8,18 +8,19 @@ GPU training (Intermediate) ---- -Distributed Training strategies + +Distributed training strategies ------------------------------- Lightning supports multiple ways of doing distributed training. +- Regular (``strategy='ddp'``) +- Spawn (``strategy='ddp_spawn'``) +- Notebook/Fork (``strategy='ddp_notebook'``) + .. video:: https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/yt/Trainer+flags+4-+multi+node+training_3.mp4 :poster: https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/yt_thumbs/thumb_multi_gpus.png :width: 400 -- DistributedDataParallel (multiple-gpus across many machines) - - Regular (``strategy='ddp'``) - - Spawn (``strategy='ddp_spawn'``) - - Notebook/Fork (``strategy='ddp_notebook'``) .. note:: If you request multiple GPUs or nodes without setting a strategy, DDP will be automatically used. @@ -28,22 +29,22 @@ For a deeper understanding of what Lightning is doing, feel free to read this `guide `_. +---- + + Distributed Data Parallel ^^^^^^^^^^^^^^^^^^^^^^^^^ :class:`~torch.nn.parallel.DistributedDataParallel` (DDP) works as follows: 1. Each GPU across each node gets its own process. - 2. Each GPU gets visibility into a subset of the overall dataset. It will only ever see that subset. - 3. Each process inits the model. - 4. Each process performs a full forward and backward pass in parallel. - 5. The gradients are synced and averaged across all processes. - 6. Each process updates its optimizer. +| + .. code-block:: python # train on 8 GPUs (same machine (ie: node)) @@ -59,34 +60,31 @@ variables: # example for 3 GPUs DDP MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=0 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc - MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=1 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc - MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=2 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc + MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=0 LOCAL_RANK=1 python my_file.py --accelerator 'gpu' --devices 3 --etc + MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=0 LOCAL_RANK=2 python my_file.py --accelerator 'gpu' --devices 3 --etc -We use DDP this way because `ddp_spawn` has a few limitations (due to Python and PyTorch): +Using DDP this way has a few disadvantages over ``torch.multiprocessing.spawn()``: -1. Since `.spawn()` trains the model in subprocesses, the model on the main process does not get updated. -2. Dataloader(num_workers=N), where N is large, bottlenecks training with DDP... ie: it will be VERY slow or won't work at all. This is a PyTorch limitation. -3. Forces everything to be picklable. +1. All processes (including the main process) participate in training and have the updated state of the model and Trainer state. +2. No multiprocessing pickle errors +3. Easily scales to multi-node training -There are cases in which it is NOT possible to use DDP. Examples are: +| -- Jupyter Notebook, Google COLAB, Kaggle, etc. -- You have a nested script without a root package +It is NOT possible to use DDP in interactive environments like Jupyter Notebook, Google COLAB, Kaggle, etc. +In these situations you should use `ddp_notebook`. + + +---- -In these situations you should use `ddp_notebook` or `dp` instead. Distributed Data Parallel Spawn ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -`ddp_spawn` is exactly like `ddp` except that it uses .spawn to start the training processes. -.. warning:: It is STRONGLY recommended to use `DDP` for speed and performance. +.. warning:: It is STRONGLY recommended to use DDP for speed and performance. -.. code-block:: python - - mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,)) - -If your script does not support being called from the command line (ie: it is nested without a root -project module) you can use the following method: +The `ddp_spawn` strategy is similar to `ddp` except that it uses ``torch.multiprocessing.spawn()`` to start the training processes. +Use this for debugging only, or if you are converting a code base to Lightning that relies on spawn. .. code-block:: python @@ -95,54 +93,12 @@ project module) you can use the following method: We STRONGLY discourage this use because it has limitations (due to Python and PyTorch): -1. The model you pass in will not update. Please save a checkpoint and restore from there. -2. Set Dataloader(num_workers=0) or it will bottleneck training. +1. After ``.fit()``, only the model's weights get restored to the main process, but no other state of the Trainer. +2. Does not support multi-node training. +3. It is generally slower than DDP. -`ddp` is MUCH faster than `ddp_spawn`. We recommend you -1. Install a top-level module for your project using setup.py - -.. code-block:: python - - # setup.py - #!/usr/bin/env python - - from setuptools import setup, find_packages - - setup( - name="src", - version="0.0.1", - description="Describe Your Cool Project", - author="", - author_email="", - url="https://github.com/YourSeed", # REPLACE WITH YOUR OWN GITHUB PROJECT LINK - install_requires=["lightning"], - packages=find_packages(), - ) - -2. Setup your project like so: - -.. code-block:: bash - - /project - /src - some_file.py - /or_a_folder - setup.py - -3. Install as a root-level package - -.. code-block:: bash - - cd /project - pip install -e . - -You can then call your scripts anywhere - -.. code-block:: bash - - cd /project/src - python some_file.py --accelerator 'gpu' --devices 8 --strategy 'ddp' +---- Distributed Data Parallel in Notebooks @@ -165,8 +121,11 @@ The Trainer enables it by default when such environments are detected. Among the native distributed strategies, regular DDP (``strategy="ddp"``) is still recommended as the go-to strategy over Spawn and Fork/Notebook for its speed and stability but it can only be used with scripts. +---- + + Comparison of DDP variants and tradeoffs -**************************************** +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. list-table:: DDP variants and their tradeoffs :widths: 40 20 20 20 @@ -202,68 +161,23 @@ Comparison of DDP variants and tradeoffs - Fast -Distributed and 16-bit precision -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Below are the possible configurations we support. - -+-------+---------+-----+--------+-----------------------------------------------------------------------+ -| 1 GPU | 1+ GPUs | DDP | 16-bit | command | -+=======+=========+=====+========+=======================================================================+ -| Y | | | | `Trainer(accelerator="gpu", devices=1)` | -+-------+---------+-----+--------+-----------------------------------------------------------------------+ -| Y | | | Y | `Trainer(accelerator="gpu", devices=1, precision=16)` | -+-------+---------+-----+--------+-----------------------------------------------------------------------+ -| | Y | Y | | `Trainer(accelerator="gpu", devices=k, strategy='ddp')` | -+-------+---------+-----+--------+-----------------------------------------------------------------------+ -| | Y | Y | Y | `Trainer(accelerator="gpu", devices=k, strategy='ddp', precision=16)` | -+-------+---------+-----+--------+-----------------------------------------------------------------------+ - -DDP can also be used with 1 GPU, but there's no reason to do so other than debugging distributed-related issues. +---- -Implement Your Own Distributed (DDP) training -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -If you need your own way to init PyTorch DDP you can override :meth:`lightning.pytorch.strategies.ddp.DDPStrategy.setup_distributed`. - -If you also need to use your own DDP implementation, override :meth:`lightning.pytorch.strategies.ddp.DDPStrategy.configure_ddp`. - ----------- - -Torch Distributed Elastic -------------------------- -Lightning supports the use of Torch Distributed Elastic to enable fault-tolerant and elastic distributed job scheduling. To use it, specify the 'ddp' backend and the number of GPUs you want to use in the trainer. +TorchRun (TorchElastic) +----------------------- +Lightning supports the use of TorchRun (previously known as TorchElastic) to enable fault-tolerant and elastic distributed job scheduling. +To use it, specify the DDP strategy and the number of GPUs you want to use in the Trainer. .. code-block:: python Trainer(accelerator="gpu", devices=8, strategy="ddp") -To launch a fault-tolerant job, run the following on all nodes. +Then simply launch your script with the :doc:`torchrun <../clouds/cluster_intermediate_2>` command. -.. code-block:: bash - python -m torch.distributed.run - --nnodes=NUM_NODES - --nproc_per_node=TRAINERS_PER_NODE - --rdzv_id=JOB_ID - --rdzv_backend=c10d - --rdzv_endpoint=HOST_NODE_ADDR - YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...) +---- -To launch an elastic job, run the following on at least ``MIN_SIZE`` nodes and at most ``MAX_SIZE`` nodes. - -.. code-block:: bash - - python -m torch.distributed.run - --nnodes=MIN_SIZE:MAX_SIZE - --nproc_per_node=TRAINERS_PER_NODE - --rdzv_id=JOB_ID - --rdzv_backend=c10d - --rdzv_endpoint=HOST_NODE_ADDR - YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...) - -See the official `Torch Distributed Elastic documentation `_ for details -on installation and more use cases. Optimize multi-machine communication ------------------------------------ diff --git a/docs/source-pytorch/clouds/cluster_advanced.rst b/docs/source-pytorch/clouds/cluster_advanced.rst index 9fe1a4bd36..0d5aefefc8 100644 --- a/docs/source-pytorch/clouds/cluster_advanced.rst +++ b/docs/source-pytorch/clouds/cluster_advanced.rst @@ -15,6 +15,7 @@ schedules the resources and time for which the job is allowed to run. ---- + *************************** Design your training script *************************** diff --git a/docs/source-pytorch/clouds/cluster_intermediate_1.rst b/docs/source-pytorch/clouds/cluster_intermediate_1.rst index d668b2bf9e..391c9b1779 100644 --- a/docs/source-pytorch/clouds/cluster_intermediate_1.rst +++ b/docs/source-pytorch/clouds/cluster_intermediate_1.rst @@ -5,13 +5,15 @@ Run on an on-prem cluster (intermediate) ######################################## **Audience**: Users who need to run on an academic or enterprise private cluster. + ---- + .. _non-slurm: -***************** -Setup the cluster -***************** +****************** +Set up the cluster +****************** This guide shows how to run a training job on a general purpose cluster. We recommend beginners to try this method first because it requires the least amount of configuration and changes to the code. To setup a multi-node computing cluster you need: @@ -29,11 +31,13 @@ PyTorch Lightning follows the design of `PyTorch distributed communication packa .. _training_script_setup: + ---- -************************* -Setup the training script -************************* + +************************** +Set up the training script +************************** To train a model using multiple nodes, do the following: 1. Design your :ref:`lightning_module` (no need to add anything specific here). @@ -45,8 +49,10 @@ To train a model using multiple nodes, do the following: # train on 32 GPUs across 4 nodes trainer = Trainer(accelerator="gpu", devices=8, num_nodes=4, strategy="ddp") + ---- + *************************** Submit a job to the cluster *************************** @@ -57,8 +63,10 @@ This means that you need to: 2. Copy all your import dependencies and the script itself to each node. 3. Run the script on each node. + ---- + ****************** Debug on a cluster ****************** diff --git a/docs/source-pytorch/clouds/cluster_intermediate_2.rst b/docs/source-pytorch/clouds/cluster_intermediate_2.rst index 8e0d8d1b4d..fbe2d8f781 100644 --- a/docs/source-pytorch/clouds/cluster_intermediate_2.rst +++ b/docs/source-pytorch/clouds/cluster_intermediate_2.rst @@ -4,34 +4,63 @@ Run on an on-prem cluster (intermediate) .. _torch_distributed_run: -************************* -Run with TorchDistributed -************************* -`Torch Distributed Run `__ provides helper functions to setup distributed environment variables from the `PyTorch distributed communication package `__ that need to be defined on each node. - -Once the script is setup like described in :ref:` Training Script Setup`, you can run the below command across your nodes to start multi-node training. +******************************** +Run with TorchRun (TorchElastic) +******************************** +`TorchRun `__ (previously known as TorchElastic) provides helper functions to set up distributed environment variables from the `PyTorch distributed communication package `__ that need to be defined on each node. +Once the script is set up like described in :ref:`Training Script Setup `, you can run the below command across your nodes to start multi-node training. Like a custom cluster, you have to ensure that there is network connectivity between the nodes with firewall rules that allow traffic flow on a specified *MASTER_PORT*. - Finally, you'll need to decide which node you'd like to be the main node (*MASTER_ADDR*), and the ranks of each node (*NODE_RANK*). For example: -* *MASTER_ADDR* 10.10.10.16 -* *MASTER_PORT* 29500 -* *NODE_RANK* 0 for the first node, 1 for the second node +* **MASTER_ADDR:** 10.10.10.16 +* **MASTER_PORT:** 29500 +* **NODE_RANK:** 0 for the first node, 1 for the second node, etc. Run the below command with the appropriate variables set on each node. .. code-block:: bash - python -m torch.distributed.run - --nnodes=2 # number of nodes you'd like to run with - --master_addr - --master_port - --node_rank - train.py (--arg1 ... train script args...) + torchrun \ + --nproc_per_node= \ + --nnodes= \ + --node_rank \ + --master_addr \ + --master_port \ + train.py --arg1 --arg2 -.. note:: - ``torch.distributed.run`` assumes that you'd like to spawn a process per GPU if GPU devices are found on the node. This can be adjusted with ``-nproc_per_node``. +- **--nproc_per_node:** Number of processes that will be launched per node (default 1). This number must match the number set in ``Trainer(devices=...)`` if specified in Trainer. +- **--nnodes:** Number of nodes/machines (default 1). This number must match the number set in ``Trainer(num_nodes=...)`` if specified in Trainer. +- **--node_rank:** The index of the node/machine. +- **--master_addr:** The IP address of the main node with node rank 0. +- **--master_port:** The port that will be used for communication between the nodes. Must be open in the firewall on each node to permit TCP traffic. + +For more advanced configuration options in TorchRun such as elastic, fault-tolerant training, see the `official documentation `_. + +| + +**Example running on 2 nodes with 8 GPUs each:** + +Assume the main node has the IP address 10.10.10.16. +On node the first node, you would run this command: + +.. code-block:: bash + + torchrun \ + --nproc_per_node=8 --nnodes=2 --node_rank 0 \ + --master_addr 10.10.10.16 --master_port 50000 \ + train.py + +On the second node, you would run this command: + +.. code-block:: bash + + torchrun \ + --nproc_per_node=8 --nnodes=2 --node_rank 1 \ + --master_addr 10.10.10.16 --master_port 50000 \ + train.py + +Note that the only difference between the two commands is the node rank! diff --git a/docs/source-pytorch/glossary/index.rst b/docs/source-pytorch/glossary/index.rst index 5ca677c48e..c91ca9125c 100644 --- a/docs/source-pytorch/glossary/index.rst +++ b/docs/source-pytorch/glossary/index.rst @@ -44,7 +44,7 @@ SLURM <../clouds/cluster_advanced> Transfer learning <../advanced/transfer_learning> Trainer <../common/trainer> - Torch distributed <../clouds/cluster_intermediate_2> + TorchRun (TorchElastic) <../clouds/cluster_intermediate_2> Warnings <../advanced/warnings>