Update the Multi-GPU docs (#19525)
This commit is contained in:
parent
a89ea11799
commit
e461e90f84
|
@ -23,6 +23,7 @@ docs/source-pytorch/_static/images/course_UvA-DL
|
|||
docs/source-pytorch/_static/images/lightning_examples
|
||||
docs/source-pytorch/_static/fetched-s3-assets
|
||||
docs/source-pytorch/integrations/hpu
|
||||
docs/source-pytorch/integrations/strategies/Hivemind.rst
|
||||
|
||||
docs/source-fabric/*/generated
|
||||
|
||||
|
|
|
@ -8,18 +8,19 @@ GPU training (Intermediate)
|
|||
|
||||
----
|
||||
|
||||
Distributed Training strategies
|
||||
|
||||
Distributed training strategies
|
||||
-------------------------------
|
||||
Lightning supports multiple ways of doing distributed training.
|
||||
|
||||
- Regular (``strategy='ddp'``)
|
||||
- Spawn (``strategy='ddp_spawn'``)
|
||||
- Notebook/Fork (``strategy='ddp_notebook'``)
|
||||
|
||||
.. video:: https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/yt/Trainer+flags+4-+multi+node+training_3.mp4
|
||||
:poster: https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/yt_thumbs/thumb_multi_gpus.png
|
||||
:width: 400
|
||||
|
||||
- DistributedDataParallel (multiple-gpus across many machines)
|
||||
- Regular (``strategy='ddp'``)
|
||||
- Spawn (``strategy='ddp_spawn'``)
|
||||
- Notebook/Fork (``strategy='ddp_notebook'``)
|
||||
|
||||
.. note::
|
||||
If you request multiple GPUs or nodes without setting a strategy, DDP will be automatically used.
|
||||
|
@ -28,22 +29,22 @@ For a deeper understanding of what Lightning is doing, feel free to read this
|
|||
`guide <https://medium.com/@_willfalcon/9-tips-for-training-lightning-fast-neural-networks-in-pytorch-8e63a502f565>`_.
|
||||
|
||||
|
||||
----
|
||||
|
||||
|
||||
Distributed Data Parallel
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
:class:`~torch.nn.parallel.DistributedDataParallel` (DDP) works as follows:
|
||||
|
||||
1. Each GPU across each node gets its own process.
|
||||
|
||||
2. Each GPU gets visibility into a subset of the overall dataset. It will only ever see that subset.
|
||||
|
||||
3. Each process inits the model.
|
||||
|
||||
4. Each process performs a full forward and backward pass in parallel.
|
||||
|
||||
5. The gradients are synced and averaged across all processes.
|
||||
|
||||
6. Each process updates its optimizer.
|
||||
|
||||
|
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# train on 8 GPUs (same machine (ie: node))
|
||||
|
@ -59,34 +60,31 @@ variables:
|
|||
|
||||
# example for 3 GPUs DDP
|
||||
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=0 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc
|
||||
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=1 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc
|
||||
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=2 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc
|
||||
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=0 LOCAL_RANK=1 python my_file.py --accelerator 'gpu' --devices 3 --etc
|
||||
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=0 LOCAL_RANK=2 python my_file.py --accelerator 'gpu' --devices 3 --etc
|
||||
|
||||
We use DDP this way because `ddp_spawn` has a few limitations (due to Python and PyTorch):
|
||||
Using DDP this way has a few disadvantages over ``torch.multiprocessing.spawn()``:
|
||||
|
||||
1. Since `.spawn()` trains the model in subprocesses, the model on the main process does not get updated.
|
||||
2. Dataloader(num_workers=N), where N is large, bottlenecks training with DDP... ie: it will be VERY slow or won't work at all. This is a PyTorch limitation.
|
||||
3. Forces everything to be picklable.
|
||||
1. All processes (including the main process) participate in training and have the updated state of the model and Trainer state.
|
||||
2. No multiprocessing pickle errors
|
||||
3. Easily scales to multi-node training
|
||||
|
||||
There are cases in which it is NOT possible to use DDP. Examples are:
|
||||
|
|
||||
|
||||
- Jupyter Notebook, Google COLAB, Kaggle, etc.
|
||||
- You have a nested script without a root package
|
||||
It is NOT possible to use DDP in interactive environments like Jupyter Notebook, Google COLAB, Kaggle, etc.
|
||||
In these situations you should use `ddp_notebook`.
|
||||
|
||||
|
||||
----
|
||||
|
||||
In these situations you should use `ddp_notebook` or `dp` instead.
|
||||
|
||||
Distributed Data Parallel Spawn
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
`ddp_spawn` is exactly like `ddp` except that it uses .spawn to start the training processes.
|
||||
|
||||
.. warning:: It is STRONGLY recommended to use `DDP` for speed and performance.
|
||||
.. warning:: It is STRONGLY recommended to use DDP for speed and performance.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
|
||||
|
||||
If your script does not support being called from the command line (ie: it is nested without a root
|
||||
project module) you can use the following method:
|
||||
The `ddp_spawn` strategy is similar to `ddp` except that it uses ``torch.multiprocessing.spawn()`` to start the training processes.
|
||||
Use this for debugging only, or if you are converting a code base to Lightning that relies on spawn.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
|
@ -95,54 +93,12 @@ project module) you can use the following method:
|
|||
|
||||
We STRONGLY discourage this use because it has limitations (due to Python and PyTorch):
|
||||
|
||||
1. The model you pass in will not update. Please save a checkpoint and restore from there.
|
||||
2. Set Dataloader(num_workers=0) or it will bottleneck training.
|
||||
1. After ``.fit()``, only the model's weights get restored to the main process, but no other state of the Trainer.
|
||||
2. Does not support multi-node training.
|
||||
3. It is generally slower than DDP.
|
||||
|
||||
`ddp` is MUCH faster than `ddp_spawn`. We recommend you
|
||||
|
||||
1. Install a top-level module for your project using setup.py
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# setup.py
|
||||
#!/usr/bin/env python
|
||||
|
||||
from setuptools import setup, find_packages
|
||||
|
||||
setup(
|
||||
name="src",
|
||||
version="0.0.1",
|
||||
description="Describe Your Cool Project",
|
||||
author="",
|
||||
author_email="",
|
||||
url="https://github.com/YourSeed", # REPLACE WITH YOUR OWN GITHUB PROJECT LINK
|
||||
install_requires=["lightning"],
|
||||
packages=find_packages(),
|
||||
)
|
||||
|
||||
2. Setup your project like so:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
/project
|
||||
/src
|
||||
some_file.py
|
||||
/or_a_folder
|
||||
setup.py
|
||||
|
||||
3. Install as a root-level package
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
cd /project
|
||||
pip install -e .
|
||||
|
||||
You can then call your scripts anywhere
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
cd /project/src
|
||||
python some_file.py --accelerator 'gpu' --devices 8 --strategy 'ddp'
|
||||
----
|
||||
|
||||
|
||||
Distributed Data Parallel in Notebooks
|
||||
|
@ -165,8 +121,11 @@ The Trainer enables it by default when such environments are detected.
|
|||
Among the native distributed strategies, regular DDP (``strategy="ddp"``) is still recommended as the go-to strategy over Spawn and Fork/Notebook for its speed and stability but it can only be used with scripts.
|
||||
|
||||
|
||||
----
|
||||
|
||||
|
||||
Comparison of DDP variants and tradeoffs
|
||||
****************************************
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. list-table:: DDP variants and their tradeoffs
|
||||
:widths: 40 20 20 20
|
||||
|
@ -202,68 +161,23 @@ Comparison of DDP variants and tradeoffs
|
|||
- Fast
|
||||
|
||||
|
||||
Distributed and 16-bit precision
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Below are the possible configurations we support.
|
||||
|
||||
+-------+---------+-----+--------+-----------------------------------------------------------------------+
|
||||
| 1 GPU | 1+ GPUs | DDP | 16-bit | command |
|
||||
+=======+=========+=====+========+=======================================================================+
|
||||
| Y | | | | `Trainer(accelerator="gpu", devices=1)` |
|
||||
+-------+---------+-----+--------+-----------------------------------------------------------------------+
|
||||
| Y | | | Y | `Trainer(accelerator="gpu", devices=1, precision=16)` |
|
||||
+-------+---------+-----+--------+-----------------------------------------------------------------------+
|
||||
| | Y | Y | | `Trainer(accelerator="gpu", devices=k, strategy='ddp')` |
|
||||
+-------+---------+-----+--------+-----------------------------------------------------------------------+
|
||||
| | Y | Y | Y | `Trainer(accelerator="gpu", devices=k, strategy='ddp', precision=16)` |
|
||||
+-------+---------+-----+--------+-----------------------------------------------------------------------+
|
||||
|
||||
DDP can also be used with 1 GPU, but there's no reason to do so other than debugging distributed-related issues.
|
||||
----
|
||||
|
||||
|
||||
Implement Your Own Distributed (DDP) training
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
If you need your own way to init PyTorch DDP you can override :meth:`lightning.pytorch.strategies.ddp.DDPStrategy.setup_distributed`.
|
||||
|
||||
If you also need to use your own DDP implementation, override :meth:`lightning.pytorch.strategies.ddp.DDPStrategy.configure_ddp`.
|
||||
|
||||
----------
|
||||
|
||||
Torch Distributed Elastic
|
||||
-------------------------
|
||||
Lightning supports the use of Torch Distributed Elastic to enable fault-tolerant and elastic distributed job scheduling. To use it, specify the 'ddp' backend and the number of GPUs you want to use in the trainer.
|
||||
TorchRun (TorchElastic)
|
||||
-----------------------
|
||||
Lightning supports the use of TorchRun (previously known as TorchElastic) to enable fault-tolerant and elastic distributed job scheduling.
|
||||
To use it, specify the DDP strategy and the number of GPUs you want to use in the Trainer.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
Trainer(accelerator="gpu", devices=8, strategy="ddp")
|
||||
|
||||
To launch a fault-tolerant job, run the following on all nodes.
|
||||
Then simply launch your script with the :doc:`torchrun <../clouds/cluster_intermediate_2>` command.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python -m torch.distributed.run
|
||||
--nnodes=NUM_NODES
|
||||
--nproc_per_node=TRAINERS_PER_NODE
|
||||
--rdzv_id=JOB_ID
|
||||
--rdzv_backend=c10d
|
||||
--rdzv_endpoint=HOST_NODE_ADDR
|
||||
YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...)
|
||||
----
|
||||
|
||||
To launch an elastic job, run the following on at least ``MIN_SIZE`` nodes and at most ``MAX_SIZE`` nodes.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python -m torch.distributed.run
|
||||
--nnodes=MIN_SIZE:MAX_SIZE
|
||||
--nproc_per_node=TRAINERS_PER_NODE
|
||||
--rdzv_id=JOB_ID
|
||||
--rdzv_backend=c10d
|
||||
--rdzv_endpoint=HOST_NODE_ADDR
|
||||
YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...)
|
||||
|
||||
See the official `Torch Distributed Elastic documentation <https://pytorch.org/docs/stable/distributed.elastic.html>`_ for details
|
||||
on installation and more use cases.
|
||||
|
||||
Optimize multi-machine communication
|
||||
------------------------------------
|
||||
|
|
|
@ -15,6 +15,7 @@ schedules the resources and time for which the job is allowed to run.
|
|||
|
||||
----
|
||||
|
||||
|
||||
***************************
|
||||
Design your training script
|
||||
***************************
|
||||
|
|
|
@ -5,13 +5,15 @@ Run on an on-prem cluster (intermediate)
|
|||
########################################
|
||||
**Audience**: Users who need to run on an academic or enterprise private cluster.
|
||||
|
||||
|
||||
----
|
||||
|
||||
|
||||
.. _non-slurm:
|
||||
|
||||
*****************
|
||||
Setup the cluster
|
||||
*****************
|
||||
******************
|
||||
Set up the cluster
|
||||
******************
|
||||
This guide shows how to run a training job on a general purpose cluster. We recommend beginners to try this method
|
||||
first because it requires the least amount of configuration and changes to the code.
|
||||
To setup a multi-node computing cluster you need:
|
||||
|
@ -29,11 +31,13 @@ PyTorch Lightning follows the design of `PyTorch distributed communication packa
|
|||
|
||||
.. _training_script_setup:
|
||||
|
||||
|
||||
----
|
||||
|
||||
*************************
|
||||
Setup the training script
|
||||
*************************
|
||||
|
||||
**************************
|
||||
Set up the training script
|
||||
**************************
|
||||
To train a model using multiple nodes, do the following:
|
||||
|
||||
1. Design your :ref:`lightning_module` (no need to add anything specific here).
|
||||
|
@ -45,8 +49,10 @@ To train a model using multiple nodes, do the following:
|
|||
# train on 32 GPUs across 4 nodes
|
||||
trainer = Trainer(accelerator="gpu", devices=8, num_nodes=4, strategy="ddp")
|
||||
|
||||
|
||||
----
|
||||
|
||||
|
||||
***************************
|
||||
Submit a job to the cluster
|
||||
***************************
|
||||
|
@ -57,8 +63,10 @@ This means that you need to:
|
|||
2. Copy all your import dependencies and the script itself to each node.
|
||||
3. Run the script on each node.
|
||||
|
||||
|
||||
----
|
||||
|
||||
|
||||
******************
|
||||
Debug on a cluster
|
||||
******************
|
||||
|
|
|
@ -4,34 +4,63 @@ Run on an on-prem cluster (intermediate)
|
|||
|
||||
.. _torch_distributed_run:
|
||||
|
||||
*************************
|
||||
Run with TorchDistributed
|
||||
*************************
|
||||
`Torch Distributed Run <https://pytorch.org/docs/stable/elastic/run.html>`__ provides helper functions to setup distributed environment variables from the `PyTorch distributed communication package <https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization>`__ that need to be defined on each node.
|
||||
|
||||
Once the script is setup like described in :ref:` Training Script Setup<training_script_setup>`, you can run the below command across your nodes to start multi-node training.
|
||||
********************************
|
||||
Run with TorchRun (TorchElastic)
|
||||
********************************
|
||||
|
||||
`TorchRun <https://pytorch.org/docs/stable/elastic/run.html>`__ (previously known as TorchElastic) provides helper functions to set up distributed environment variables from the `PyTorch distributed communication package <https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization>`__ that need to be defined on each node.
|
||||
Once the script is set up like described in :ref:`Training Script Setup <training_script_setup>`, you can run the below command across your nodes to start multi-node training.
|
||||
Like a custom cluster, you have to ensure that there is network connectivity between the nodes with firewall rules that allow traffic flow on a specified *MASTER_PORT*.
|
||||
|
||||
Finally, you'll need to decide which node you'd like to be the main node (*MASTER_ADDR*), and the ranks of each node (*NODE_RANK*).
|
||||
|
||||
For example:
|
||||
|
||||
* *MASTER_ADDR* 10.10.10.16
|
||||
* *MASTER_PORT* 29500
|
||||
* *NODE_RANK* 0 for the first node, 1 for the second node
|
||||
* **MASTER_ADDR:** 10.10.10.16
|
||||
* **MASTER_PORT:** 29500
|
||||
* **NODE_RANK:** 0 for the first node, 1 for the second node, etc.
|
||||
|
||||
Run the below command with the appropriate variables set on each node.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python -m torch.distributed.run
|
||||
--nnodes=2 # number of nodes you'd like to run with
|
||||
--master_addr <MASTER_ADDR>
|
||||
--master_port <MASTER_PORT>
|
||||
--node_rank <NODE_RANK>
|
||||
train.py (--arg1 ... train script args...)
|
||||
torchrun \
|
||||
--nproc_per_node=<GPUS_PER_NODE> \
|
||||
--nnodes=<NUM_NODES> \
|
||||
--node_rank <NODE_RANK> \
|
||||
--master_addr <MASTER_ADDR> \
|
||||
--master_port <MASTER_PORT> \
|
||||
train.py --arg1 --arg2
|
||||
|
||||
.. note::
|
||||
|
||||
``torch.distributed.run`` assumes that you'd like to spawn a process per GPU if GPU devices are found on the node. This can be adjusted with ``-nproc_per_node``.
|
||||
- **--nproc_per_node:** Number of processes that will be launched per node (default 1). This number must match the number set in ``Trainer(devices=...)`` if specified in Trainer.
|
||||
- **--nnodes:** Number of nodes/machines (default 1). This number must match the number set in ``Trainer(num_nodes=...)`` if specified in Trainer.
|
||||
- **--node_rank:** The index of the node/machine.
|
||||
- **--master_addr:** The IP address of the main node with node rank 0.
|
||||
- **--master_port:** The port that will be used for communication between the nodes. Must be open in the firewall on each node to permit TCP traffic.
|
||||
|
||||
For more advanced configuration options in TorchRun such as elastic, fault-tolerant training, see the `official documentation <https://pytorch.org/docs/stable/elastic/run.html>`_.
|
||||
|
||||
|
|
||||
|
||||
**Example running on 2 nodes with 8 GPUs each:**
|
||||
|
||||
Assume the main node has the IP address 10.10.10.16.
|
||||
On node the first node, you would run this command:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
torchrun \
|
||||
--nproc_per_node=8 --nnodes=2 --node_rank 0 \
|
||||
--master_addr 10.10.10.16 --master_port 50000 \
|
||||
train.py
|
||||
|
||||
On the second node, you would run this command:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
torchrun \
|
||||
--nproc_per_node=8 --nnodes=2 --node_rank 1 \
|
||||
--master_addr 10.10.10.16 --master_port 50000 \
|
||||
train.py
|
||||
|
||||
Note that the only difference between the two commands is the node rank!
|
||||
|
|
|
@ -44,7 +44,7 @@
|
|||
SLURM <../clouds/cluster_advanced>
|
||||
Transfer learning <../advanced/transfer_learning>
|
||||
Trainer <../common/trainer>
|
||||
Torch distributed <../clouds/cluster_intermediate_2>
|
||||
TorchRun (TorchElastic) <../clouds/cluster_intermediate_2>
|
||||
Warnings <../advanced/warnings>
|
||||
|
||||
|
||||
|
|
Loading…
Reference in New Issue