Update the Multi-GPU docs (#19525)

This commit is contained in:
awaelchli 2024-02-27 04:29:26 +01:00 committed by GitHub
parent a89ea11799
commit e461e90f84
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
6 changed files with 105 additions and 152 deletions

1
.gitignore vendored
View File

@ -23,6 +23,7 @@ docs/source-pytorch/_static/images/course_UvA-DL
docs/source-pytorch/_static/images/lightning_examples
docs/source-pytorch/_static/fetched-s3-assets
docs/source-pytorch/integrations/hpu
docs/source-pytorch/integrations/strategies/Hivemind.rst
docs/source-fabric/*/generated

View File

@ -8,18 +8,19 @@ GPU training (Intermediate)
----
Distributed Training strategies
Distributed training strategies
-------------------------------
Lightning supports multiple ways of doing distributed training.
- Regular (``strategy='ddp'``)
- Spawn (``strategy='ddp_spawn'``)
- Notebook/Fork (``strategy='ddp_notebook'``)
.. video:: https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/yt/Trainer+flags+4-+multi+node+training_3.mp4
:poster: https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/yt_thumbs/thumb_multi_gpus.png
:width: 400
- DistributedDataParallel (multiple-gpus across many machines)
- Regular (``strategy='ddp'``)
- Spawn (``strategy='ddp_spawn'``)
- Notebook/Fork (``strategy='ddp_notebook'``)
.. note::
If you request multiple GPUs or nodes without setting a strategy, DDP will be automatically used.
@ -28,22 +29,22 @@ For a deeper understanding of what Lightning is doing, feel free to read this
`guide <https://medium.com/@_willfalcon/9-tips-for-training-lightning-fast-neural-networks-in-pytorch-8e63a502f565>`_.
----
Distributed Data Parallel
^^^^^^^^^^^^^^^^^^^^^^^^^
:class:`~torch.nn.parallel.DistributedDataParallel` (DDP) works as follows:
1. Each GPU across each node gets its own process.
2. Each GPU gets visibility into a subset of the overall dataset. It will only ever see that subset.
3. Each process inits the model.
4. Each process performs a full forward and backward pass in parallel.
5. The gradients are synced and averaged across all processes.
6. Each process updates its optimizer.
|
.. code-block:: python
# train on 8 GPUs (same machine (ie: node))
@ -59,34 +60,31 @@ variables:
# example for 3 GPUs DDP
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=0 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=1 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=2 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=0 LOCAL_RANK=1 python my_file.py --accelerator 'gpu' --devices 3 --etc
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=0 LOCAL_RANK=2 python my_file.py --accelerator 'gpu' --devices 3 --etc
We use DDP this way because `ddp_spawn` has a few limitations (due to Python and PyTorch):
Using DDP this way has a few disadvantages over ``torch.multiprocessing.spawn()``:
1. Since `.spawn()` trains the model in subprocesses, the model on the main process does not get updated.
2. Dataloader(num_workers=N), where N is large, bottlenecks training with DDP... ie: it will be VERY slow or won't work at all. This is a PyTorch limitation.
3. Forces everything to be picklable.
1. All processes (including the main process) participate in training and have the updated state of the model and Trainer state.
2. No multiprocessing pickle errors
3. Easily scales to multi-node training
There are cases in which it is NOT possible to use DDP. Examples are:
|
- Jupyter Notebook, Google COLAB, Kaggle, etc.
- You have a nested script without a root package
It is NOT possible to use DDP in interactive environments like Jupyter Notebook, Google COLAB, Kaggle, etc.
In these situations you should use `ddp_notebook`.
----
In these situations you should use `ddp_notebook` or `dp` instead.
Distributed Data Parallel Spawn
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
`ddp_spawn` is exactly like `ddp` except that it uses .spawn to start the training processes.
.. warning:: It is STRONGLY recommended to use `DDP` for speed and performance.
.. warning:: It is STRONGLY recommended to use DDP for speed and performance.
.. code-block:: python
mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
If your script does not support being called from the command line (ie: it is nested without a root
project module) you can use the following method:
The `ddp_spawn` strategy is similar to `ddp` except that it uses ``torch.multiprocessing.spawn()`` to start the training processes.
Use this for debugging only, or if you are converting a code base to Lightning that relies on spawn.
.. code-block:: python
@ -95,54 +93,12 @@ project module) you can use the following method:
We STRONGLY discourage this use because it has limitations (due to Python and PyTorch):
1. The model you pass in will not update. Please save a checkpoint and restore from there.
2. Set Dataloader(num_workers=0) or it will bottleneck training.
1. After ``.fit()``, only the model's weights get restored to the main process, but no other state of the Trainer.
2. Does not support multi-node training.
3. It is generally slower than DDP.
`ddp` is MUCH faster than `ddp_spawn`. We recommend you
1. Install a top-level module for your project using setup.py
.. code-block:: python
# setup.py
#!/usr/bin/env python
from setuptools import setup, find_packages
setup(
name="src",
version="0.0.1",
description="Describe Your Cool Project",
author="",
author_email="",
url="https://github.com/YourSeed", # REPLACE WITH YOUR OWN GITHUB PROJECT LINK
install_requires=["lightning"],
packages=find_packages(),
)
2. Setup your project like so:
.. code-block:: bash
/project
/src
some_file.py
/or_a_folder
setup.py
3. Install as a root-level package
.. code-block:: bash
cd /project
pip install -e .
You can then call your scripts anywhere
.. code-block:: bash
cd /project/src
python some_file.py --accelerator 'gpu' --devices 8 --strategy 'ddp'
----
Distributed Data Parallel in Notebooks
@ -165,8 +121,11 @@ The Trainer enables it by default when such environments are detected.
Among the native distributed strategies, regular DDP (``strategy="ddp"``) is still recommended as the go-to strategy over Spawn and Fork/Notebook for its speed and stability but it can only be used with scripts.
----
Comparison of DDP variants and tradeoffs
****************************************
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. list-table:: DDP variants and their tradeoffs
:widths: 40 20 20 20
@ -202,68 +161,23 @@ Comparison of DDP variants and tradeoffs
- Fast
Distributed and 16-bit precision
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Below are the possible configurations we support.
+-------+---------+-----+--------+-----------------------------------------------------------------------+
| 1 GPU | 1+ GPUs | DDP | 16-bit | command |
+=======+=========+=====+========+=======================================================================+
| Y | | | | `Trainer(accelerator="gpu", devices=1)` |
+-------+---------+-----+--------+-----------------------------------------------------------------------+
| Y | | | Y | `Trainer(accelerator="gpu", devices=1, precision=16)` |
+-------+---------+-----+--------+-----------------------------------------------------------------------+
| | Y | Y | | `Trainer(accelerator="gpu", devices=k, strategy='ddp')` |
+-------+---------+-----+--------+-----------------------------------------------------------------------+
| | Y | Y | Y | `Trainer(accelerator="gpu", devices=k, strategy='ddp', precision=16)` |
+-------+---------+-----+--------+-----------------------------------------------------------------------+
DDP can also be used with 1 GPU, but there's no reason to do so other than debugging distributed-related issues.
----
Implement Your Own Distributed (DDP) training
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you need your own way to init PyTorch DDP you can override :meth:`lightning.pytorch.strategies.ddp.DDPStrategy.setup_distributed`.
If you also need to use your own DDP implementation, override :meth:`lightning.pytorch.strategies.ddp.DDPStrategy.configure_ddp`.
----------
Torch Distributed Elastic
-------------------------
Lightning supports the use of Torch Distributed Elastic to enable fault-tolerant and elastic distributed job scheduling. To use it, specify the 'ddp' backend and the number of GPUs you want to use in the trainer.
TorchRun (TorchElastic)
-----------------------
Lightning supports the use of TorchRun (previously known as TorchElastic) to enable fault-tolerant and elastic distributed job scheduling.
To use it, specify the DDP strategy and the number of GPUs you want to use in the Trainer.
.. code-block:: python
Trainer(accelerator="gpu", devices=8, strategy="ddp")
To launch a fault-tolerant job, run the following on all nodes.
Then simply launch your script with the :doc:`torchrun <../clouds/cluster_intermediate_2>` command.
.. code-block:: bash
python -m torch.distributed.run
--nnodes=NUM_NODES
--nproc_per_node=TRAINERS_PER_NODE
--rdzv_id=JOB_ID
--rdzv_backend=c10d
--rdzv_endpoint=HOST_NODE_ADDR
YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...)
----
To launch an elastic job, run the following on at least ``MIN_SIZE`` nodes and at most ``MAX_SIZE`` nodes.
.. code-block:: bash
python -m torch.distributed.run
--nnodes=MIN_SIZE:MAX_SIZE
--nproc_per_node=TRAINERS_PER_NODE
--rdzv_id=JOB_ID
--rdzv_backend=c10d
--rdzv_endpoint=HOST_NODE_ADDR
YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...)
See the official `Torch Distributed Elastic documentation <https://pytorch.org/docs/stable/distributed.elastic.html>`_ for details
on installation and more use cases.
Optimize multi-machine communication
------------------------------------

View File

@ -15,6 +15,7 @@ schedules the resources and time for which the job is allowed to run.
----
***************************
Design your training script
***************************

View File

@ -5,13 +5,15 @@ Run on an on-prem cluster (intermediate)
########################################
**Audience**: Users who need to run on an academic or enterprise private cluster.
----
.. _non-slurm:
*****************
Setup the cluster
*****************
******************
Set up the cluster
******************
This guide shows how to run a training job on a general purpose cluster. We recommend beginners to try this method
first because it requires the least amount of configuration and changes to the code.
To setup a multi-node computing cluster you need:
@ -29,11 +31,13 @@ PyTorch Lightning follows the design of `PyTorch distributed communication packa
.. _training_script_setup:
----
*************************
Setup the training script
*************************
**************************
Set up the training script
**************************
To train a model using multiple nodes, do the following:
1. Design your :ref:`lightning_module` (no need to add anything specific here).
@ -45,8 +49,10 @@ To train a model using multiple nodes, do the following:
# train on 32 GPUs across 4 nodes
trainer = Trainer(accelerator="gpu", devices=8, num_nodes=4, strategy="ddp")
----
***************************
Submit a job to the cluster
***************************
@ -57,8 +63,10 @@ This means that you need to:
2. Copy all your import dependencies and the script itself to each node.
3. Run the script on each node.
----
******************
Debug on a cluster
******************

View File

@ -4,34 +4,63 @@ Run on an on-prem cluster (intermediate)
.. _torch_distributed_run:
*************************
Run with TorchDistributed
*************************
`Torch Distributed Run <https://pytorch.org/docs/stable/elastic/run.html>`__ provides helper functions to setup distributed environment variables from the `PyTorch distributed communication package <https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization>`__ that need to be defined on each node.
Once the script is setup like described in :ref:` Training Script Setup<training_script_setup>`, you can run the below command across your nodes to start multi-node training.
********************************
Run with TorchRun (TorchElastic)
********************************
`TorchRun <https://pytorch.org/docs/stable/elastic/run.html>`__ (previously known as TorchElastic) provides helper functions to set up distributed environment variables from the `PyTorch distributed communication package <https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization>`__ that need to be defined on each node.
Once the script is set up like described in :ref:`Training Script Setup <training_script_setup>`, you can run the below command across your nodes to start multi-node training.
Like a custom cluster, you have to ensure that there is network connectivity between the nodes with firewall rules that allow traffic flow on a specified *MASTER_PORT*.
Finally, you'll need to decide which node you'd like to be the main node (*MASTER_ADDR*), and the ranks of each node (*NODE_RANK*).
For example:
* *MASTER_ADDR* 10.10.10.16
* *MASTER_PORT* 29500
* *NODE_RANK* 0 for the first node, 1 for the second node
* **MASTER_ADDR:** 10.10.10.16
* **MASTER_PORT:** 29500
* **NODE_RANK:** 0 for the first node, 1 for the second node, etc.
Run the below command with the appropriate variables set on each node.
.. code-block:: bash
python -m torch.distributed.run
--nnodes=2 # number of nodes you'd like to run with
--master_addr <MASTER_ADDR>
--master_port <MASTER_PORT>
--node_rank <NODE_RANK>
train.py (--arg1 ... train script args...)
torchrun \
--nproc_per_node=<GPUS_PER_NODE> \
--nnodes=<NUM_NODES> \
--node_rank <NODE_RANK> \
--master_addr <MASTER_ADDR> \
--master_port <MASTER_PORT> \
train.py --arg1 --arg2
.. note::
``torch.distributed.run`` assumes that you'd like to spawn a process per GPU if GPU devices are found on the node. This can be adjusted with ``-nproc_per_node``.
- **--nproc_per_node:** Number of processes that will be launched per node (default 1). This number must match the number set in ``Trainer(devices=...)`` if specified in Trainer.
- **--nnodes:** Number of nodes/machines (default 1). This number must match the number set in ``Trainer(num_nodes=...)`` if specified in Trainer.
- **--node_rank:** The index of the node/machine.
- **--master_addr:** The IP address of the main node with node rank 0.
- **--master_port:** The port that will be used for communication between the nodes. Must be open in the firewall on each node to permit TCP traffic.
For more advanced configuration options in TorchRun such as elastic, fault-tolerant training, see the `official documentation <https://pytorch.org/docs/stable/elastic/run.html>`_.
|
**Example running on 2 nodes with 8 GPUs each:**
Assume the main node has the IP address 10.10.10.16.
On node the first node, you would run this command:
.. code-block:: bash
torchrun \
--nproc_per_node=8 --nnodes=2 --node_rank 0 \
--master_addr 10.10.10.16 --master_port 50000 \
train.py
On the second node, you would run this command:
.. code-block:: bash
torchrun \
--nproc_per_node=8 --nnodes=2 --node_rank 1 \
--master_addr 10.10.10.16 --master_port 50000 \
train.py
Note that the only difference between the two commands is the node rank!

View File

@ -44,7 +44,7 @@
SLURM <../clouds/cluster_advanced>
Transfer learning <../advanced/transfer_learning>
Trainer <../common/trainer>
Torch distributed <../clouds/cluster_intermediate_2>
TorchRun (TorchElastic) <../clouds/cluster_intermediate_2>
Warnings <../advanced/warnings>