Update Lightning Lite docs (3/n) (#16245)
This commit is contained in:
parent
0a928e8ead
commit
a913db8e88
|
@ -169,33 +169,34 @@ Furthermore, you can access the current device from ``fabric.device`` or rely on
|
|||
|
||||
----------
|
||||
|
||||
|
||||
Distributed Training Pitfalls
|
||||
=============================
|
||||
|
||||
The :class:`~lightning_fabric.fabric.Fabric` provides you with the tools to scale your training, but there are several major challenges ahead of you now:
|
||||
*******************
|
||||
Fabric in Notebooks
|
||||
*******************
|
||||
|
||||
|
||||
.. list-table::
|
||||
:widths: 50 50
|
||||
:header-rows: 0
|
||||
|
||||
* - Processes divergence
|
||||
- This happens when processes execute a different section of the code due to different if/else conditions, race conditions on existing files and so on, resulting in hanging.
|
||||
* - Cross processes reduction
|
||||
- Miscalculated metrics or gradients due to errors in their reduction.
|
||||
* - Large sharded models
|
||||
- Instantiation, materialization and state management of large models.
|
||||
* - Rank 0 only actions
|
||||
- Logging, profiling, and so on.
|
||||
* - Checkpointing / Early stopping / Callbacks / Logging
|
||||
- Ability to customize your training behavior easily and make it stateful.
|
||||
* - Fault-tolerant training
|
||||
- Ability to resume from a failure as if it never happened.
|
||||
Fabric works exactly the same way in notebooks (Jupyter, Google Colab, Kaggle, etc.) if you only run in a single process or a single GPU.
|
||||
If you want to use multiprocessing, for example multi-GPU, you can put your code in a function and pass that function to the
|
||||
:meth:`~lightning_fabric.fabric.Fabric.launch` method:
|
||||
|
||||
|
||||
If you are facing one of those challenges, then you are already meeting the limit of :class:`~lightning_fabric.fabric.Fabric`.
|
||||
We recommend you to convert to :doc:`Lightning <../starter/introduction>`, so you never have to worry about those.
|
||||
.. code-block:: python
|
||||
|
||||
|
||||
# Notebook Cell
|
||||
def train(fabric):
|
||||
|
||||
model = ...
|
||||
optimizer = ...
|
||||
model, optimizer = fabric.setup(model, optimizer)
|
||||
...
|
||||
|
||||
|
||||
# Notebook Cell
|
||||
fabric = Fabric(accelerator="cuda", devices=2)
|
||||
fabric.launch(train) # Launches the `train` function on two GPUs
|
||||
|
||||
|
||||
As you can see, this function accepts one argument, the ``Fabric`` object, and it gets launched on as many devices as specified.
|
||||
|
||||
|
||||
----------
|
||||
|
|
Loading…
Reference in New Issue