Update Lightning Lite docs (3/n) (#16245)
This commit is contained in:
parent
0a928e8ead
commit
a913db8e88
|
@ -169,33 +169,34 @@ Furthermore, you can access the current device from ``fabric.device`` or rely on
|
||||||
|
|
||||||
----------
|
----------
|
||||||
|
|
||||||
|
*******************
|
||||||
Distributed Training Pitfalls
|
Fabric in Notebooks
|
||||||
=============================
|
*******************
|
||||||
|
|
||||||
The :class:`~lightning_fabric.fabric.Fabric` provides you with the tools to scale your training, but there are several major challenges ahead of you now:
|
|
||||||
|
|
||||||
|
|
||||||
.. list-table::
|
Fabric works exactly the same way in notebooks (Jupyter, Google Colab, Kaggle, etc.) if you only run in a single process or a single GPU.
|
||||||
:widths: 50 50
|
If you want to use multiprocessing, for example multi-GPU, you can put your code in a function and pass that function to the
|
||||||
:header-rows: 0
|
:meth:`~lightning_fabric.fabric.Fabric.launch` method:
|
||||||
|
|
||||||
* - Processes divergence
|
|
||||||
- This happens when processes execute a different section of the code due to different if/else conditions, race conditions on existing files and so on, resulting in hanging.
|
|
||||||
* - Cross processes reduction
|
|
||||||
- Miscalculated metrics or gradients due to errors in their reduction.
|
|
||||||
* - Large sharded models
|
|
||||||
- Instantiation, materialization and state management of large models.
|
|
||||||
* - Rank 0 only actions
|
|
||||||
- Logging, profiling, and so on.
|
|
||||||
* - Checkpointing / Early stopping / Callbacks / Logging
|
|
||||||
- Ability to customize your training behavior easily and make it stateful.
|
|
||||||
* - Fault-tolerant training
|
|
||||||
- Ability to resume from a failure as if it never happened.
|
|
||||||
|
|
||||||
|
|
||||||
If you are facing one of those challenges, then you are already meeting the limit of :class:`~lightning_fabric.fabric.Fabric`.
|
.. code-block:: python
|
||||||
We recommend you to convert to :doc:`Lightning <../starter/introduction>`, so you never have to worry about those.
|
|
||||||
|
|
||||||
|
# Notebook Cell
|
||||||
|
def train(fabric):
|
||||||
|
|
||||||
|
model = ...
|
||||||
|
optimizer = ...
|
||||||
|
model, optimizer = fabric.setup(model, optimizer)
|
||||||
|
...
|
||||||
|
|
||||||
|
|
||||||
|
# Notebook Cell
|
||||||
|
fabric = Fabric(accelerator="cuda", devices=2)
|
||||||
|
fabric.launch(train) # Launches the `train` function on two GPUs
|
||||||
|
|
||||||
|
|
||||||
|
As you can see, this function accepts one argument, the ``Fabric`` object, and it gets launched on as many devices as specified.
|
||||||
|
|
||||||
|
|
||||||
----------
|
----------
|
||||||
|
|
Loading…
Reference in New Issue