270 lines
8.3 KiB
ReStructuredText
270 lines
8.3 KiB
ReStructuredText
###########################################
|
|
Communication between distributed processes
|
|
###########################################
|
|
|
|
With Fabric, you can easily access information about a process or send data between processes with a standardized API and agnostic to the distributed strategy.
|
|
|
|
|
|
----
|
|
|
|
|
|
*******************
|
|
Rank and world size
|
|
*******************
|
|
|
|
The rank assigned to a process is a zero-based index in the range of *0, ..., world size - 1*, where *world size* is the total number of distributed processes.
|
|
If you are using multi-GPU, think of the rank as the *GPU ID* or *GPU index*, although rank generally extends to distributed processing.
|
|
|
|
The rank is unique across all processes, regardless of how they are distributed across machines, and it is therefore also called **global rank**.
|
|
We can also identify processes by their **local rank**, which is unique among processes running on the same machine but is not unique globally across all machines.
|
|
Finally, each process is associated with a **node rank** in the range *0, ..., num nodes - 1*, which identifies which machine (node) the process is running on.
|
|
|
|
.. figure:: https://pl-public-data.s3.amazonaws.com/assets_lightning/fabric_collectives_ranks.jpeg
|
|
:alt: The different type of process ranks: Local, global, node.
|
|
:width: 100%
|
|
|
|
Here is how you launch multiple processes in Fabric:
|
|
|
|
.. code-block:: python
|
|
|
|
from lightning.fabric import Fabric
|
|
|
|
# Devices and num_nodes determine how many processes there are
|
|
fabric = Fabric(devices=2, num_nodes=3)
|
|
fabric.launch()
|
|
|
|
Learn more about :doc:`launching distributed training <../fundamentals/launch>`.
|
|
And here is how you access all rank and world size information:
|
|
|
|
.. code-block:: python
|
|
|
|
# The total number of processes running across all devices and nodes
|
|
fabric.world_size # 2 * 3 = 6
|
|
|
|
# The global index of the current process across all devices and nodes
|
|
fabric.global_rank # -> {0, 1, 2, 3, 4, 5}
|
|
|
|
# The index of the current process among the processes running on the local node
|
|
fabric.local_rank # -> {0, 1}
|
|
|
|
# The index of the current node
|
|
fabric.node_rank # -> {0, 1, 2}
|
|
|
|
# Do something only on rank 0
|
|
if fabric.global_rank == 0:
|
|
...
|
|
|
|
|
|
.. _race conditions:
|
|
|
|
Avoid race conditions
|
|
=====================
|
|
|
|
Access to the rank information helps you avoid *race conditions* which could crash your script or lead to corrupted data.
|
|
Such conditions can occur when multiple processes try to write to the same file simultaneously, for example, writing a checkpoint file or downloading a dataset.
|
|
Avoid this from happening by guarding your logic with a rank check:
|
|
|
|
.. code-block:: python
|
|
|
|
# Only write files from one process (rank 0) ...
|
|
if fabric.global_rank == 0:
|
|
with open("output.txt", "w") as file:
|
|
file.write(...)
|
|
|
|
# ... or save from all processes but don't write to the same file
|
|
with open(f"output-{fabric.global_rank}.txt", "w") as file:
|
|
file.write(...)
|
|
|
|
# Multi-node: download a dataset, the filesystem between nodes is shared
|
|
if fabric.global_rank == 0:
|
|
download_dataset()
|
|
|
|
# Multi-node: download a dataset, the filesystem between nodes is NOT shared
|
|
if fabric.local_rank == 0:
|
|
download_dataset()
|
|
|
|
----
|
|
|
|
|
|
*******
|
|
Barrier
|
|
*******
|
|
|
|
The barrier forces every process to wait until all processes have reached it.
|
|
In other words, it is a **synchronization**.
|
|
|
|
.. figure:: https://pl-public-data.s3.amazonaws.com/assets_lightning/fabric_collectives_barrier.jpeg
|
|
:alt: The barrier for process synchronization
|
|
:width: 100%
|
|
|
|
A barrier is needed when processes do different amounts of work and as a result fall out of sync.
|
|
|
|
.. code-block:: python
|
|
|
|
fabric = Fabric(accelerator="cpu", devices=4)
|
|
fabric.launch()
|
|
|
|
# Simulate each process taking a different amount of time
|
|
sleep(2 * fabric.global_rank)
|
|
print(f"Process {fabric.global_rank} is done.")
|
|
|
|
# Wait for all processes to reach the barrier
|
|
fabric.barrier()
|
|
print("All processes reached the barrier!")
|
|
|
|
|
|
A more realistic scenario is when downloading data.
|
|
Here, we need to ensure that processes only start to load the data once it has completed downloading.
|
|
Since downloading should be done on rank 0 only to :ref:`avoid race conditions <race conditions>`, we need a barrier:
|
|
|
|
.. code-block:: python
|
|
|
|
if fabric.global_rank == 0:
|
|
print("Downloading dataset. This can take a while ...")
|
|
download_dataset("http://...")
|
|
|
|
# All other processes wait here until rank 0 is done with downloading:
|
|
fabric.barrier()
|
|
|
|
# After everyone reached the barrier, they can access the downloaded files:
|
|
load_dataset()
|
|
|
|
|
|
----
|
|
|
|
.. _broadcast collective:
|
|
|
|
*********
|
|
Broadcast
|
|
*********
|
|
|
|
.. figure:: https://pl-public-data.s3.amazonaws.com/assets_lightning/fabric_collectives_broadcast.jpeg
|
|
:alt: The broadcast collective operation
|
|
:width: 100%
|
|
|
|
The broadcast operation sends a tensor of data from one process to all other processes so that all end up with the same data.
|
|
|
|
.. code-block:: python
|
|
|
|
fabric = Fabric(...)
|
|
|
|
# Transfer a tensor from one process to all the others
|
|
result = fabric.broadcast(tensor)
|
|
|
|
# By default, the source is the process rank 0 ...
|
|
result = fabric.broadcast(tensor, src=0)
|
|
|
|
# ... which can be change to a different rank
|
|
result = fabric.broadcast(tensor, src=3)
|
|
|
|
|
|
Full example:
|
|
|
|
.. code-block:: python
|
|
|
|
fabric = Fabric(devices=4, accelerator="cpu")
|
|
fabric.launch()
|
|
|
|
# Data is different on each process
|
|
learning_rate = torch.rand(1)
|
|
print("Before broadcast:", learning_rate)
|
|
|
|
# Transfer the tensor from one process to all the others
|
|
learning_rate = fabric.broadcast(learning_rate)
|
|
print("After broadcast:", learning_rate)
|
|
|
|
|
|
----
|
|
|
|
|
|
******
|
|
Gather
|
|
******
|
|
|
|
.. figure:: https://pl-public-data.s3.amazonaws.com/assets_lightning/fabric_collectives_all-gather.jpeg
|
|
:alt: The All-gather collective operation
|
|
:width: 100%
|
|
|
|
The gather operation transfers the tensors from each process to every other process and stacks the results.
|
|
As opposed to the :ref:`broadcast <broadcast collective>`, every process gets the data from every other process, not just from a particular rank.
|
|
|
|
.. code-block:: python
|
|
|
|
fabric = Fabric(...)
|
|
|
|
# Gather the data from
|
|
result = fabric.all_gather(tensor)
|
|
|
|
# Tip: Turn off gradient syncing if you don't need to back-propagate through it
|
|
with torch.no_grad():
|
|
result = fabric.all_gather(tensor)
|
|
|
|
# Also works with a (nested) collection of tensors (dict, list, tuple):
|
|
collection = {"loss": torch.tensor(...), "data": ...}
|
|
gathered_collection = fabric.all_gather(collection)
|
|
|
|
|
|
Full example:
|
|
|
|
.. code-block:: python
|
|
|
|
fabric = Fabric(devices=4, accelerator="cpu")
|
|
fabric.launch()
|
|
|
|
# Data is different in each process
|
|
data = torch.tensor(10 * fabric.global_rank)
|
|
|
|
# Every process gathers the tensors from all other processes
|
|
# and stacks the result:
|
|
result = fabric.all_gather(data)
|
|
print("Result of all-gather:", result) # tensor([ 0, 10, 20, 30])
|
|
|
|
|
|
----
|
|
|
|
|
|
******
|
|
Reduce
|
|
******
|
|
|
|
.. figure:: https://pl-public-data.s3.amazonaws.com/assets_lightning/fabric_collectives_all-reduce.jpeg
|
|
:alt: The All-reduce collective operation
|
|
:width: 100%
|
|
|
|
|
|
The reduction is an operation that takes multiple values (tensors) as input and returns a single value.
|
|
An example of a reduction is *summation*, e.g., ``torch.sum()``.
|
|
The :meth:`~lightning.fabric.fabric.Fabric.all_reduce` operation allows you to apply a reduction across multiple processes:
|
|
|
|
.. code-block:: python
|
|
|
|
fabric = Fabric(...)
|
|
|
|
# Compute the mean of a tensor across processes:
|
|
result = fabric.all_reduce(tensor, reduce_op="mean")
|
|
|
|
# Or the sum:
|
|
result = fabric.all_reduce(tensor, reduce_op="sum")
|
|
|
|
# Also works with a (nested) collection of tensors (dict, list, tuple):
|
|
collection = {"loss": torch.tensor(...), "data": ...}
|
|
reduced_collection = fabric.all_reduce(collection)
|
|
|
|
The support of options for ``reduce_op`` depends on the strategy used, but all strategies support *sum* and *mean*.
|
|
|
|
Full example:
|
|
|
|
.. code-block:: python
|
|
|
|
fabric = Fabric(devices=4, accelerator="cpu")
|
|
fabric.launch()
|
|
|
|
# Data is different in each process
|
|
data = torch.tensor(10 * fabric.global_rank)
|
|
|
|
# Sum the tensors from every process
|
|
result = fabric.all_reduce(data, reduce_op="sum")
|
|
|
|
# sum(0 + 10 + 20 + 30) = tensor(60)
|
|
print("Result of all-reduce:", result)
|