lightning/docs/source-fabric/fundamentals/launch.rst

###########################
Launch distributed training
###########################

To run your code distributed across many devices and many machines, you need to do two things:

1. Configure Fabric with the number of devices and number of machines you want to use
2. Launch your code in multiple processes


----


*************
Simple Launch
*************

.. video:: https://pl-public-data.s3.amazonaws.com/assets_lightning/fabric/animations/launch.mp4
    :width: 800
    :autoplay:
    :loop:
    :muted:


You can configure and launch processes on your machine directly with Fabric's :meth:`~lightning.fabric.fabric.Fabric.launch` method:

.. code-block:: python

    # train.py
    ...

    # Configure accelerator, devices, num_nodes, etc.
    fabric = Fabric(devices=4, ...)

    # This launches itself into multiple processes
    fabric.launch()


In the command line, you run this like any other Python script:

.. code-block:: bash

    python train.py


This is the recommended way for running on a single machine and is the most convenient method for development and debugging.

It is also possible to use Fabric in a Jupyter notebook (including Google Colab, Kaggle, etc.) and launch multiple processes there.
You can learn more about it :ref:`here <Fabric in Notebooks>`.


----


*******************
Launch with the CLI
*******************

.. video:: https://pl-public-data.s3.amazonaws.com/assets_lightning/fabric/animations/launch-cli.mp4
    :width: 800
    :autoplay:
    :loop:
    :muted:

An alternative way to launch your Python script in multiple processes is to use the dedicated command line interface (CLI):

.. code-block:: bash

    lightning run model path/to/your/script.py

This is essentially the same as running ``python path/to/your/script.py``, but it also lets you configure the following settings externally without changing your code:

- ``--accelerator``: The accelerator to use
- ``--devices``: The number of devices to use (per machine)
- ``--num_nodes``: The number of machines (nodes) to use
- ``--precision``: Which type of precision to use
- ``--strategy``: The strategy (communication layer between processes)


.. code-block:: bash

    lightning run model --help

    Usage: lightning run model [OPTIONS] SCRIPT [SCRIPT_ARGS]...

      Run a Lightning Fabric script.

      SCRIPT is the path to the Python script with the code to run. The script
      must contain a Fabric object.

      SCRIPT_ARGS are the remaining arguments that you can pass to the script
      itself and are expected to be parsed there.

    Options:
      --accelerator [cpu|gpu|cuda|mps|tpu]
                                      The hardware accelerator to run on.
      --strategy [ddp|dp|deepspeed]   Strategy for how to run across multiple
                                      devices.
      --devices TEXT                  Number of devices to run on (``int``), which
                                      devices to run on (``list`` or ``str``), or
                                      ``'auto'``. The value applies per node.
      --num-nodes, --num_nodes INTEGER
                                      Number of machines (nodes) for distributed
                                      execution.
      --node-rank, --node_rank INTEGER
                                      The index of the machine (node) this command
                                      gets started on. Must be a number in the
                                      range 0, ..., num_nodes - 1.
      --main-address, --main_address TEXT
                                      The hostname or IP address of the main
                                      machine (usually the one with node_rank =
                                      0).
      --main-port, --main_port INTEGER
                                      The main port to connect to the main
                                      machine.
      --precision [16-mixed|bf16-mixed|32-true|64-true|64|32|16|bf16]
                                      Double precision (``64-true`` or ``64``),
                                      full precision (``32-true`` or ``64``), half
                                      precision (``16-mixed`` or ``16``) or
                                      bfloat16 precision (``bf16-mixed`` or
                                      ``bf16``)
      --help                          Show this message and exit.


Here is how you run DDP with 8 GPUs and `torch.bfloat16 <https://pytorch.org/docs/1.10.0/generated/torch.Tensor.bfloat16.html>`_ precision:

.. code-block:: bash

    lightning run model ./path/to/train.py \
        --strategy=ddp \
        --devices=8 \
        --accelerator=cuda \
        --precision="bf16"

Or `DeepSpeed Zero3 <https://www.deepspeed.ai/2021/03/07/zero3-offload.html>`_ with mixed precision:

.. code-block:: bash

     lightning run model ./path/to/train.py \
        --strategy=deepspeed_stage_3 \
        --devices=8 \
        --accelerator=cuda \
        --precision=16

:class:`~lightning.fabric.fabric.Fabric` can also figure it out automatically for you!

.. code-block:: bash

    lightning run model ./path/to/train.py \
        --devices=auto \
        --accelerator=auto \
        --precision=16


----


.. _Fabric Cluster:

*******************
Launch on a Cluster
*******************

Fabric enables distributed training across multiple machines in several ways.
Choose from the following options based on your expertise level and available infrastructure.

.. raw:: html

    <div class="display-card-container">
        <div class="row">

.. displayitem::
    :header: Lightning Cloud
    :description: The easiest way to scale models in the cloud. No infrastructure setup required.
    :col_css: col-md-4
    :button_link: ../guide/multi_node/cloud.html
    :height: 160
    :tag: basic

.. displayitem::
    :header: SLURM Managed Cluster
    :description: Most popular for academic and private enterprise clusters.
    :col_css: col-md-4
    :button_link: ../guide/multi_node/slurm.html
    :height: 160
    :tag: intermediate

.. displayitem::
    :header: Bare Bones Cluster
    :description: Train across machines on a network using `torchrun`.
    :col_css: col-md-4
    :button_link: ../guide/multi_node/barebones.html
    :height: 160
    :tag: advanced

.. displayitem::
    :header: Other Cluster Environments
    :description: MPI, LSF, Kubeflow
    :col_css: col-md-4
    :button_link: ../guide/multi_node/other.html
    :height: 160
    :tag: advanced

.. raw:: html

        </div>
    </div>


----


**********
Next steps
**********

.. raw:: html

    <div class="display-card-container">
        <div class="row">

.. displayitem::
    :header: Mixed Precision Training
    :description:  Save memory and speed up training using mixed precision
    :col_css: col-md-4
    :button_link: ../fundamentals/precision.html
    :height: 160
    :tag: basic

.. displayitem::
    :header: Distributed Communication
    :description: Learn all about communication primitives for distributed operation. Gather, reduce, broadcast, etc.
    :button_link: ../advanced/distributed_communication.html
    :col_css: col-md-4
    :height: 160
    :tag: advanced

.. raw:: html

        </div>
    </div>
Update Lightning Lite docs (5/n) (#16291) * organize * organize * organize * organize * Fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * accelerator * distributed launch * notebooks * code structure * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lightning_module * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * x * update * conflicts * fix duplicates * links.rst * api folder * add todo for build errors * resolve duplicate reference warnings * address review by eden Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> 2023-01-10 19:11:03 +00:00			`###########################`
			`Launch distributed training`
			`###########################`

Grammar corrections for Fabric docs (#16494) 2023-01-25 10:45:09 +00:00			`To run your code distributed across many devices and many machines, you need to do two things:`
Update Lightning Lite docs (5/n) (#16291) * organize * organize * organize * organize * Fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * accelerator * distributed launch * notebooks * code structure * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lightning_module * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * x * update * conflicts * fix duplicates * links.rst * api folder * add todo for build errors * resolve duplicate reference warnings * address review by eden Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> 2023-01-10 19:11:03 +00:00
			`1. Configure Fabric with the number of devices and number of machines you want to use`
			`2. Launch your code in multiple processes`


Update Lightning Lite docs (6/n) (#16342) 2023-01-12 13:37:24 +00:00			`----`
Update Lightning Lite docs (5/n) (#16291) * organize * organize * organize * organize * Fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * accelerator * distributed launch * notebooks * code structure * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lightning_module * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * x * update * conflicts * fix duplicates * links.rst * api folder * add todo for build errors * resolve duplicate reference warnings * address review by eden Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> 2023-01-10 19:11:03 +00:00

Promote `Fabric.launch()` as the default experience in Fabric docs (#16878) 2023-02-27 13:19:54 +00:00			`*************`
			`Simple Launch`
			`*************`

docs: fetch external sources (#17941) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> 2023-07-03 18:16:45 +00:00			`.. video:: https://pl-public-data.s3.amazonaws.com/assets_lightning/fabric/animations/launch.mp4`
			`:width: 800`
			`:autoplay:`
			`:loop:`
			`:muted:`
Add cute teaser animations to Fabric docs (#17021) 2023-03-10 17:16:07 +00:00

Miscellaneous updates in Fabric docs (#16980) 2023-03-07 15:43:47 +00:00			You can configure and launch processes on your machine directly with Fabric's :meth:`~lightning.fabric.fabric.Fabric.launch` method:
Promote `Fabric.launch()` as the default experience in Fabric docs (#16878) 2023-02-27 13:19:54 +00:00
			`.. code-block:: python`

			`# train.py`
			`...`

			`# Configure accelerator, devices, num_nodes, etc.`
			`fabric = Fabric(devices=4, ...)`

			`# This launches itself into multiple processes`
			`fabric.launch()`


			`In the command line, you run this like any other Python script:`

			`.. code-block:: bash`

			`python train.py`


			`This is the recommended way for running on a single machine and is the most convenient method for development and debugging.`

			`It is also possible to use Fabric in a Jupyter notebook (including Google Colab, Kaggle, etc.) and launch multiple processes there.`
			You can learn more about it :ref:`here <Fabric in Notebooks>`.


			`----`


Update Lightning Lite docs (5/n) (#16291) * organize * organize * organize * organize * Fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * accelerator * distributed launch * notebooks * code structure * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lightning_module * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * x * update * conflicts * fix duplicates * links.rst * api folder * add todo for build errors * resolve duplicate reference warnings * address review by eden Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> 2023-01-10 19:11:03 +00:00			`*******************`
			`Launch with the CLI`
			`*******************`

docs: fetch external sources (#17941) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> 2023-07-03 18:16:45 +00:00			`.. video:: https://pl-public-data.s3.amazonaws.com/assets_lightning/fabric/animations/launch-cli.mp4`
			`:width: 800`
			`:autoplay:`
			`:loop:`
			`:muted:`
Add cute teaser animations to Fabric docs (#17021) 2023-03-10 17:16:07 +00:00
Promote `Fabric.launch()` as the default experience in Fabric docs (#16878) 2023-02-27 13:19:54 +00:00			`An alternative way to launch your Python script in multiple processes is to use the dedicated command line interface (CLI):`
Update Lightning Lite docs (5/n) (#16291) * organize * organize * organize * organize * Fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * accelerator * distributed launch * notebooks * code structure * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lightning_module * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * x * update * conflicts * fix duplicates * links.rst * api folder * add todo for build errors * resolve duplicate reference warnings * address review by eden Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> 2023-01-10 19:11:03 +00:00
			`.. code-block:: bash`

			`lightning run model path/to/your/script.py`

Promote `Fabric.launch()` as the default experience in Fabric docs (#16878) 2023-02-27 13:19:54 +00:00			This is essentially the same as running ``python path/to/your/script.py``, but it also lets you configure the following settings externally without changing your code:
Update Lightning Lite docs (5/n) (#16291) * organize * organize * organize * organize * Fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * accelerator * distributed launch * notebooks * code structure * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lightning_module * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * x * update * conflicts * fix duplicates * links.rst * api folder * add todo for build errors * resolve duplicate reference warnings * address review by eden Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> 2023-01-10 19:11:03 +00:00
			- ``--accelerator``: The accelerator to use
			- ``--devices``: The number of devices to use (per machine)
			- ``--num_nodes``: The number of machines (nodes) to use
			- ``--precision``: Which type of precision to use
			- ``--strategy``: The strategy (communication layer between processes)


			`.. code-block:: bash`

			`lightning run model --help`

			`Usage: lightning run model [OPTIONS] SCRIPT [SCRIPT_ARGS]...`

			`Run a Lightning Fabric script.`

			`SCRIPT is the path to the Python script with the code to run. The script`
			`must contain a Fabric object.`

			`SCRIPT_ARGS are the remaining arguments that you can pass to the script`
			`itself and are expected to be parsed there.`

			`Options:`
			`--accelerator [cpu\|gpu\|cuda\|mps\|tpu]`
			`The hardware accelerator to run on.`
			`--strategy [ddp\|dp\|deepspeed] Strategy for how to run across multiple`
			`devices.`
			--devices TEXT Number of devices to run on (``int``), which
			devices to run on (``list`` or ``str``), or
			``'auto'``. The value applies per node.
			`--num-nodes, --num_nodes INTEGER`
			`Number of machines (nodes) for distributed`
			`execution.`
			`--node-rank, --node_rank INTEGER`
			`The index of the machine (node) this command`
			`gets started on. Must be a number in the`
			`range 0, ..., num_nodes - 1.`
			`--main-address, --main_address TEXT`
			`The hostname or IP address of the main`
			`machine (usually the one with node_rank =`
			`0).`
			`--main-port, --main_port INTEGER`
			`The main port to connect to the main`
			`machine.`
Introduce new precision layout in fabric (#16767) 2023-02-17 10:41:18 +00:00			`--precision [16-mixed\|bf16-mixed\|32-true\|64-true\|64\|32\|16\|bf16]`
			Double precision (``64-true`` or ``64``),
			full precision (``32-true`` or ``64``), half
			precision (``16-mixed`` or ``16``) or
			bfloat16 precision (``bf16-mixed`` or
			``bf16``)
Update Lightning Lite docs (5/n) (#16291) * organize * organize * organize * organize * Fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * accelerator * distributed launch * notebooks * code structure * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lightning_module * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * x * update * conflicts * fix duplicates * links.rst * api folder * add todo for build errors * resolve duplicate reference warnings * address review by eden Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> 2023-01-10 19:11:03 +00:00			`--help Show this message and exit.`



			Here is how you run DDP with 8 GPUs and `torch.bfloat16 <https://pytorch.org/docs/1.10.0/generated/torch.Tensor.bfloat16.html>`_ precision:

			`.. code-block:: bash`

			`lightning run model ./path/to/train.py \`
			`--strategy=ddp \`
			`--devices=8 \`
			`--accelerator=cuda \`
			`--precision="bf16"`

			Or `DeepSpeed Zero3 <https://www.deepspeed.ai/2021/03/07/zero3-offload.html>`_ with mixed precision:

			`.. code-block:: bash`

			`lightning run model ./path/to/train.py \`
Multi-node documentation for Fabric (#16495) Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> 2023-01-25 22:07:09 +00:00			`--strategy=deepspeed_stage_3 \`
Update Lightning Lite docs (5/n) (#16291) * organize * organize * organize * organize * Fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * accelerator * distributed launch * notebooks * code structure * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lightning_module * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * x * update * conflicts * fix duplicates * links.rst * api folder * add todo for build errors * resolve duplicate reference warnings * address review by eden Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> 2023-01-10 19:11:03 +00:00			`--devices=8 \`
			`--accelerator=cuda \`
			`--precision=16`

docs: update `pytorch_lightning` imports (#16864) * update docs imports * ci * fabric * trigger * links * . * docstring * chlog * cleaning 2023-02-27 20:14:23 +00:00			:class:`~lightning.fabric.fabric.Fabric` can also figure it out automatically for you!
Update Lightning Lite docs (5/n) (#16291) * organize * organize * organize * organize * Fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * accelerator * distributed launch * notebooks * code structure * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lightning_module * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * x * update * conflicts * fix duplicates * links.rst * api folder * add todo for build errors * resolve duplicate reference warnings * address review by eden Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> 2023-01-10 19:11:03 +00:00
			`.. code-block:: bash`

			`lightning run model ./path/to/train.py \`
			`--devices=auto \`
			`--accelerator=auto \`
			`--precision=16`


Update Lightning Lite docs (6/n) (#16342) 2023-01-12 13:37:24 +00:00			`----`
Update Lightning Lite docs (5/n) (#16291) * organize * organize * organize * organize * Fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * accelerator * distributed launch * notebooks * code structure * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lightning_module * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * x * update * conflicts * fix duplicates * links.rst * api folder * add todo for build errors * resolve duplicate reference warnings * address review by eden Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> 2023-01-10 19:11:03 +00:00

Multi-node documentation for Fabric (#16495) Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> 2023-01-25 22:07:09 +00:00			`.. _Fabric Cluster:`

			`*******************`
			`Launch on a Cluster`
			`*******************`

			`Fabric enables distributed training across multiple machines in several ways.`
			`Choose from the following options based on your expertise level and available infrastructure.`

			`.. raw:: html`

			`<div class="display-card-container">`
			`<div class="row">`

			`.. displayitem::`
			`:header: Lightning Cloud`
			`:description: The easiest way to scale models in the cloud. No infrastructure setup required.`
			`:col_css: col-md-4`
			`:button_link: ../guide/multi_node/cloud.html`
			`:height: 160`
			`:tag: basic`

			`.. displayitem::`
			`:header: SLURM Managed Cluster`
			`:description: Most popular for academic and private enterprise clusters.`
			`:col_css: col-md-4`
			`:button_link: ../guide/multi_node/slurm.html`
			`:height: 160`
			`:tag: intermediate`

			`.. displayitem::`
			`:header: Bare Bones Cluster`
Add MPI cluster environment (#16570) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> 2023-02-03 10:45:11 +00:00			:description: Train across machines on a network using `torchrun`.
Multi-node documentation for Fabric (#16495) Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> 2023-01-25 22:07:09 +00:00			`:col_css: col-md-4`
			`:button_link: ../guide/multi_node/barebones.html`
			`:height: 160`
			`:tag: advanced`

Add MPI cluster environment (#16570) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> 2023-02-03 10:45:11 +00:00			`.. displayitem::`
			`:header: Other Cluster Environments`
			`:description: MPI, LSF, Kubeflow`
			`:col_css: col-md-4`
			`:button_link: ../guide/multi_node/other.html`
			`:height: 160`
			`:tag: advanced`

Multi-node documentation for Fabric (#16495) Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> 2023-01-25 22:07:09 +00:00			`.. raw:: html`

			`</div>`
			`</div>`


			`----`


Distributed communication docs for Lite (#16373) Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> 2023-01-18 22:30:51 +00:00			`**********`
			`Next steps`
			`**********`

			`.. raw:: html`

			`<div class="display-card-container">`
			`<div class="row">`

			`.. displayitem::`
			`:header: Mixed Precision Training`
			`:description: Save memory and speed up training using mixed precision`
			`:col_css: col-md-4`
			`:button_link: ../fundamentals/precision.html`
			`:height: 160`
Restructure Fabric docs (#17111) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> 2023-03-17 08:42:58 +00:00			`:tag: basic`
Distributed communication docs for Lite (#16373) Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> 2023-01-18 22:30:51 +00:00
			`.. displayitem::`
			`:header: Distributed Communication`
			`:description: Learn all about communication primitives for distributed operation. Gather, reduce, broadcast, etc.`
			`:button_link: ../advanced/distributed_communication.html`
			`:col_css: col-md-4`
			`:height: 160`
			`:tag: advanced`

			`.. raw:: html`

			`</div>`
			`</div>`