Generally, the bigger your model is, the longer it takes to save a checkpoint to disk.
With distributed checkpoints (sometimes called sharded checkpoints), you can save and load the state of your training script with multiple GPUs or nodes more efficiently, avoiding memory issues.
----
*****************************
Save a distributed checkpoint
*****************************
The distributed checkpoint format is the default when you train with the :doc:`FSDP strategy <../../advanced/model_parallel/fsdp>`.
..code-block:: python
import lightning as L
from lightning.fabric.strategies import FSDPStrategy
Note that you can load the distributed checkpoint even if the world size has changed, i.e., you are running on a different number of GPUs than when you saved the checkpoint.
..collapse:: Full example
..code-block:: python
import torch
import lightning as L
from lightning.fabric.strategies import FSDPStrategy
from lightning.pytorch.demos import Transformer, WikiText2
state = {"model": model, "optimizer": optimizer, "iteration": 0}
fabric.print("Loading checkpoint ...")
fabric.load("my-checkpoint.ckpt", state)
..important::
If you want to load a distributed checkpoint into a script that doesn't use FSDP (or Fabric at all), then you will have to :ref:`convert it to a single-file checkpoint first <Convert dist-checkpoint>`.
You will need to do this for example if you want to load the checkpoint into a script that doesn't use FSDP, or need to export the checkpoint to a different format for deployment, evaluation, etc.
..note::
All tensors in the checkpoint will be converted to CPU tensors, and no GPUs are required to run the conversion command.
This function assumes you have enough free CPU memory to hold the entire checkpoint in memory.
..collapse:: Full example
Assuming you have saved a checkpoint ``my-checkpoint.ckpt`` using the examples above, run the following command to convert it: