:orphan:

.. _gpu_intermediate:

GPU training (Intermediate)
===========================
**Audience:** Users looking to train across machines or experiment with different scaling techniques.

----

Distributed Training strategies
-------------------------------
Lightning supports multiple ways of doing distributed training.

.. raw:: html

    <video width="50%" max-width="400px" controls
    poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/yt_thumbs/thumb_multi_gpus.png"
    src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/yt/Trainer+flags+4-+multi+node+training_3.mp4"></video>

|

- DistributedDataParallel (multiple-gpus across many machines)
    - Regular (``strategy='ddp'``)
    - Spawn (``strategy='ddp_spawn'``)
    - Notebook/Fork (``strategy='ddp_notebook'``)

.. note::
    If you request multiple GPUs or nodes without setting a strategy, DDP will be automatically used.

For a deeper understanding of what Lightning is doing, feel free to read this
`guide <https://medium.com/@_willfalcon/9-tips-for-training-lightning-fast-neural-networks-in-pytorch-8e63a502f565>`_.


Distributed Data Parallel
^^^^^^^^^^^^^^^^^^^^^^^^^
:class:`~torch.nn.parallel.DistributedDataParallel` (DDP) works as follows:

1. Each GPU across each node gets its own process.

2. Each GPU gets visibility into a subset of the overall dataset. It will only ever see that subset.

3. Each process inits the model.

4. Each process performs a full forward and backward pass in parallel.

5. The gradients are synced and averaged across all processes.

6. Each process updates its optimizer.

.. code-block:: python

    # train on 8 GPUs (same machine (ie: node))
    trainer = Trainer(accelerator="gpu", devices=8, strategy="ddp")

    # train on 32 GPUs (4 nodes)
    trainer = Trainer(accelerator="gpu", devices=8, strategy="ddp", num_nodes=4)

This Lightning implementation of DDP calls your script under the hood multiple times with the correct environment
variables:

.. code-block:: bash

    # example for 3 GPUs DDP
    MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=0 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc
    MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=1 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc
    MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=2 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc

We use DDP this way because `ddp_spawn` has a few limitations (due to Python and PyTorch):

1. Since `.spawn()` trains the model in subprocesses, the model on the main process does not get updated.
2. Dataloader(num_workers=N), where N is large, bottlenecks training with DDP... ie: it will be VERY slow or won't work at all. This is a PyTorch limitation.
3. Forces everything to be picklable.

There are cases in which it is NOT possible to use DDP. Examples are:

- Jupyter Notebook, Google COLAB, Kaggle, etc.
- You have a nested script without a root package

In these situations you should use `ddp_notebook` or `dp` instead.

Distributed Data Parallel Spawn
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
`ddp_spawn` is exactly like `ddp` except that it uses .spawn to start the training processes.

.. warning:: It is STRONGLY recommended to use `DDP` for speed and performance.

.. code-block:: python

    mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))

If your script does not support being called from the command line (ie: it is nested without a root
project module) you can use the following method:

.. code-block:: python

    # train on 8 GPUs (same machine (ie: node))
    trainer = Trainer(accelerator="gpu", devices=8, strategy="ddp_spawn")

We STRONGLY discourage this use because it has limitations (due to Python and PyTorch):

1. The model you pass in will not update. Please save a checkpoint and restore from there.
2. Set Dataloader(num_workers=0) or it will bottleneck training.

`ddp` is MUCH faster than `ddp_spawn`. We recommend you

1. Install a top-level module for your project using setup.py

.. code-block:: python

    # setup.py
    #!/usr/bin/env python

    from setuptools import setup, find_packages

    setup(
        name="src",
        version="0.0.1",
        description="Describe Your Cool Project",
        author="",
        author_email="",
        url="https://github.com/YourSeed",  # REPLACE WITH YOUR OWN GITHUB PROJECT LINK
        install_requires=["pytorch-lightning"],
        packages=find_packages(),
    )

2. Setup your project like so:

.. code-block:: bash

    /project
        /src
            some_file.py
            /or_a_folder
        setup.py

3. Install as a root-level package

.. code-block:: bash

    cd /project
    pip install -e .

You can then call your scripts anywhere

.. code-block:: bash

    cd /project/src
    python some_file.py --accelerator 'gpu' --devices 8 --strategy 'ddp'


Distributed Data Parallel in Notebooks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

DDP Notebook/Fork is an alternative to Spawn that can be used in interactive Python and Jupyter notebooks, Google Colab, Kaggle notebooks, and so on:
The Trainer enables it by default when such environments are detected.

.. code-block:: python

    # train on 8 GPUs in a Jupyter notebook
    trainer = Trainer(accelerator="gpu", devices=8)

    # can be set explicitly
    trainer = Trainer(accelerator="gpu", devices=8, strategy="ddp_notebook")

    # can also be used in non-interactive environments
    trainer = Trainer(accelerator="gpu", devices=8, strategy="ddp_fork")

Among the native distributed strategies, regular DDP (``strategy="ddp"``) is still recommended as the go-to strategy over Spawn and Fork/Notebook for its speed and stability but it can only be used with scripts.


Comparison of DDP variants and tradeoffs
****************************************

.. list-table:: DDP variants and their tradeoffs
   :widths: 40 20 20 20
   :header-rows: 1

   * -
     - DDP
     - DDP Spawn
     - DDP Notebook/Fork
   * - Works in Jupyter notebooks / IPython environments
     - No
     - No
     - Yes
   * - Supports multi-node
     - Yes
     - Yes
     - Yes
   * - Supported platforms
     - Linux, Mac, Win
     - Linux, Mac, Win
     - Linux, Mac
   * - Requires all objects to be picklable
     - No
     - Yes
     - No
   * - Limitations in the main process
     - None
     - The state of objects is not up-to-date after returning to the main process (`Trainer.fit()` etc). Only the model parameters get transferred over.
     - GPU operations such as moving tensors to the GPU or calling ``torch.cuda`` functions before invoking ``Trainer.fit`` is not allowed.
   * - Process creation time
     - Slow
     - Slow
     - Fast


Distributed and 16-bit precision
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Below are the possible configurations we support.

+-------+---------+-----+--------+-----------------------------------------------------------------------+
| 1 GPU | 1+ GPUs | DDP | 16-bit | command                                                               |
+=======+=========+=====+========+=======================================================================+
| Y     |         |     |        | `Trainer(accelerator="gpu", devices=1)`                               |
+-------+---------+-----+--------+-----------------------------------------------------------------------+
| Y     |         |     | Y      | `Trainer(accelerator="gpu", devices=1, precision=16)`                 |
+-------+---------+-----+--------+-----------------------------------------------------------------------+
|       | Y       | Y   |        | `Trainer(accelerator="gpu", devices=k, strategy='ddp')`               |
+-------+---------+-----+--------+-----------------------------------------------------------------------+
|       | Y       | Y   | Y      | `Trainer(accelerator="gpu", devices=k, strategy='ddp', precision=16)` |
+-------+---------+-----+--------+-----------------------------------------------------------------------+

DDP can also be used with 1 GPU, but there's no reason to do so other than debugging distributed-related issues.


Implement Your Own Distributed (DDP) training
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you need your own way to init PyTorch DDP you can override :meth:`pytorch_lightning.strategies.ddp.DDPStrategy.setup_distributed`.

If you also need to use your own DDP implementation, override :meth:`pytorch_lightning.strategies.ddp.DDPStrategy.configure_ddp`.

----------

Torch Distributed Elastic
-------------------------
Lightning supports the use of Torch Distributed Elastic to enable fault-tolerant and elastic distributed job scheduling. To use it, specify the 'ddp' backend and the number of GPUs you want to use in the trainer.

.. code-block:: python

    Trainer(accelerator="gpu", devices=8, strategy="ddp")

To launch a fault-tolerant job, run the following on all nodes.

.. code-block:: bash

    python -m torch.distributed.run
            --nnodes=NUM_NODES
            --nproc_per_node=TRAINERS_PER_NODE
            --rdzv_id=JOB_ID
            --rdzv_backend=c10d
            --rdzv_endpoint=HOST_NODE_ADDR
            YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...)

To launch an elastic job, run the following on at least ``MIN_SIZE`` nodes and at most ``MAX_SIZE`` nodes.

.. code-block:: bash

    python -m torch.distributed.run
            --nnodes=MIN_SIZE:MAX_SIZE
            --nproc_per_node=TRAINERS_PER_NODE
            --rdzv_id=JOB_ID
            --rdzv_backend=c10d
            --rdzv_endpoint=HOST_NODE_ADDR
            YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...)

See the official `Torch Distributed Elastic documentation <https://pytorch.org/docs/stable/distributed.elastic.html>`_ for details
on installation and more use cases.

Optimize multi-machine communication
------------------------------------

By default, Lightning will select the ``nccl`` backend over ``gloo`` when running on GPUs.
Find more information about PyTorch's supported backends `here <https://pytorch.org/docs/stable/distributed.html>`__.

Lightning allows explicitly specifying the backend via the `process_group_backend` constructor argument on the relevant Strategy classes. By default, Lightning will select the appropriate process group backend based on the hardware used.

.. code-block:: python

    from pytorch_lightning.strategies import DDPStrategy

    # Explicitly specify the process group backend if you choose to
    ddp = DDPStrategy(process_group_backend="nccl")

    # Configure the strategy on the Trainer
    trainer = Trainer(strategy=ddp, accelerator="gpu", devices=8)