2023-01-10 19:11:03 +00:00
###########################
Launch distributed training
###########################
2023-01-25 10:45:09 +00:00
To run your code distributed across many devices and many machines, you need to do two things:
2023-01-10 19:11:03 +00:00
1. Configure Fabric with the number of devices and number of machines you want to use
2. Launch your code in multiple processes
2023-01-12 13:37:24 +00:00
----
2023-01-10 19:11:03 +00:00
2023-02-27 13:19:54 +00:00
***** ***** ***
Simple Launch
***** ***** ***
2023-07-03 18:16:45 +00:00
.. video :: https://pl-public-data.s3.amazonaws.com/assets_lightning/fabric/animations/launch.mp4
:width: 800
:autoplay:
:loop:
:muted:
2023-07-04 11:27:54 +00:00
:nocontrols:
2023-03-10 17:16:07 +00:00
2023-03-07 15:43:47 +00:00
You can configure and launch processes on your machine directly with Fabric's :meth: `~lightning.fabric.fabric.Fabric.launch` method:
2023-02-27 13:19:54 +00:00
.. code-block :: python
# train.py
...
# Configure accelerator, devices, num_nodes, etc.
fabric = Fabric(devices=4, ...)
# This launches itself into multiple processes
fabric.launch()
In the command line, you run this like any other Python script:
.. code-block :: bash
python train.py
This is the recommended way for running on a single machine and is the most convenient method for development and debugging.
It is also possible to use Fabric in a Jupyter notebook (including Google Colab, Kaggle, etc.) and launch multiple processes there.
You can learn more about it :ref: `here <Fabric in Notebooks>` .
----
2023-01-10 19:11:03 +00:00
***** ***** ***** *** *
Launch with the CLI
***** ***** ***** *** *
2023-07-03 18:16:45 +00:00
.. video :: https://pl-public-data.s3.amazonaws.com/assets_lightning/fabric/animations/launch-cli.mp4
:width: 800
:autoplay:
:loop:
:muted:
2023-07-04 11:27:54 +00:00
:nocontrols:
2023-03-10 17:16:07 +00:00
2023-02-27 13:19:54 +00:00
An alternative way to launch your Python script in multiple processes is to use the dedicated command line interface (CLI):
2023-01-10 19:11:03 +00:00
.. code-block :: bash
2024-02-27 16:36:46 +00:00
fabric run path/to/your/script.py
2023-01-10 19:11:03 +00:00
2023-02-27 13:19:54 +00:00
This is essentially the same as running `` python path/to/your/script.py `` , but it also lets you configure the following settings externally without changing your code:
2023-01-10 19:11:03 +00:00
- `` --accelerator `` : The accelerator to use
- `` --devices `` : The number of devices to use (per machine)
- `` --num_nodes `` : The number of machines (nodes) to use
- `` --precision `` : Which type of precision to use
- `` --strategy `` : The strategy (communication layer between processes)
.. code-block :: bash
2024-02-27 16:36:46 +00:00
fabric run --help
2023-01-10 19:11:03 +00:00
2024-02-27 16:36:46 +00:00
Usage: fabric run [OPTIONS] SCRIPT [SCRIPT_ARGS]...
2023-01-10 19:11:03 +00:00
Run a Lightning Fabric script.
SCRIPT is the path to the Python script with the code to run. The script
must contain a Fabric object.
SCRIPT_ARGS are the remaining arguments that you can pass to the script
itself and are expected to be parsed there.
Options:
--accelerator [cpu|gpu|cuda|mps|tpu]
The hardware accelerator to run on.
--strategy [ddp|dp|deepspeed] Strategy for how to run across multiple
devices.
--devices TEXT Number of devices to run on (`` int `` ), which
devices to run on (`` list `` or `` str `` ), or
`` 'auto' `` . The value applies per node.
--num-nodes, --num_nodes INTEGER
Number of machines (nodes) for distributed
execution.
--node-rank, --node_rank INTEGER
The index of the machine (node) this command
gets started on. Must be a number in the
range 0, ..., num_nodes - 1.
--main-address, --main_address TEXT
The hostname or IP address of the main
machine (usually the one with node_rank =
0).
--main-port, --main_port INTEGER
The main port to connect to the main
machine.
2023-02-17 10:41:18 +00:00
--precision [16-mixed|bf16-mixed|32-true|64-true|64|32|16|bf16]
Double precision (`` 64-true `` or `` 64 `` ),
2024-09-27 19:19:43 +00:00
full precision (`` 32-true `` or `` 32 `` ), half
2023-02-17 10:41:18 +00:00
precision (`` 16-mixed `` or `` 16 `` ) or
bfloat16 precision (`` bf16-mixed `` or
`` bf16 `` )
2023-01-10 19:11:03 +00:00
--help Show this message and exit.
Here is how you run DDP with 8 GPUs and `torch.bfloat16 <https://pytorch.org/docs/1.10.0/generated/torch.Tensor.bfloat16.html> `_ precision:
.. code-block :: bash
2024-02-27 16:36:46 +00:00
fabric run ./path/to/train.py \
2023-01-10 19:11:03 +00:00
--strategy=ddp \
--devices=8 \
--accelerator=cuda \
--precision="bf16"
Or `DeepSpeed Zero3 <https://www.deepspeed.ai/2021/03/07/zero3-offload.html> `_ with mixed precision:
.. code-block :: bash
2024-02-27 16:36:46 +00:00
fabric run ./path/to/train.py \
2023-01-25 22:07:09 +00:00
--strategy=deepspeed_stage_3 \
2023-01-10 19:11:03 +00:00
--devices=8 \
--accelerator=cuda \
--precision=16
2023-02-27 20:14:23 +00:00
:class: `~lightning.fabric.fabric.Fabric` can also figure it out automatically for you!
2023-01-10 19:11:03 +00:00
.. code-block :: bash
2024-02-27 16:36:46 +00:00
fabric run ./path/to/train.py \
2023-01-10 19:11:03 +00:00
--devices=auto \
--accelerator=auto \
--precision=16
2023-01-12 13:37:24 +00:00
----
2023-01-10 19:11:03 +00:00
2023-01-25 22:07:09 +00:00
.. _Fabric Cluster:
***** ***** ***** *** *
Launch on a Cluster
***** ***** ***** *** *
Fabric enables distributed training across multiple machines in several ways.
Choose from the following options based on your expertise level and available infrastructure.
.. raw :: html
<div class="display-card-container">
<div class="row">
.. displayitem ::
2024-01-23 23:23:49 +00:00
:header: Run single or multi-node on Lightning Studios
2023-01-25 22:07:09 +00:00
:description: The easiest way to scale models in the cloud. No infrastructure setup required.
:col_css: col-md-4
:button_link: ../guide/multi_node/cloud.html
:height: 160
:tag: basic
.. displayitem ::
:header: SLURM Managed Cluster
:description: Most popular for academic and private enterprise clusters.
:col_css: col-md-4
:button_link: ../guide/multi_node/slurm.html
:height: 160
:tag: intermediate
.. displayitem ::
:header: Bare Bones Cluster
2023-02-03 10:45:11 +00:00
:description: Train across machines on a network using `torchrun`.
2023-01-25 22:07:09 +00:00
:col_css: col-md-4
:button_link: ../guide/multi_node/barebones.html
:height: 160
:tag: advanced
2023-02-03 10:45:11 +00:00
.. displayitem ::
:header: Other Cluster Environments
:description: MPI, LSF, Kubeflow
:col_css: col-md-4
:button_link: ../guide/multi_node/other.html
:height: 160
:tag: advanced
2023-01-25 22:07:09 +00:00
.. raw :: html
</div>
</div>
----
2023-01-18 22:30:51 +00:00
***** *****
Next steps
***** *****
.. raw :: html
<div class="display-card-container">
<div class="row">
.. displayitem ::
:header: Mixed Precision Training
2024-05-22 10:20:40 +00:00
:description: Save memory and speed up training using mixed precision
2023-01-18 22:30:51 +00:00
:col_css: col-md-4
:button_link: ../fundamentals/precision.html
:height: 160
2023-03-17 08:42:58 +00:00
:tag: basic
2023-01-18 22:30:51 +00:00
.. displayitem ::
:header: Distributed Communication
:description: Learn all about communication primitives for distributed operation. Gather, reduce, broadcast, etc.
:button_link: ../advanced/distributed_communication.html
:col_css: col-md-4
:height: 160
:tag: advanced
.. raw :: html
</div>
</div>