Document limitations of multi-GPU in Jupyter notebooks (#18132)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
This commit is contained in:
parent
0e7e6b31c5
commit
6552d29a12
|
@ -26,3 +26,62 @@ If you want to use multiprocessing, for example, multi-GPU, you can put your cod
|
|||
|
||||
|
||||
As you can see, this function accepts one argument, the ``Fabric`` object, and it gets launched on as many devices as specified.
|
||||
|
||||
|
||||
----
|
||||
|
||||
|
||||
*********************
|
||||
Multi-GPU Limitations
|
||||
*********************
|
||||
|
||||
The multi-GPU capabilities in Jupyter are enabled by launching processes using the 'fork' start method.
|
||||
It is the only supported way of multi-processing in notebooks, but also brings some limitations that you should be aware of.
|
||||
|
||||
Avoid initializing CUDA before launch
|
||||
=====================================
|
||||
|
||||
Don't run torch CUDA functions before calling ``fabric.launch(train)`` in any of the notebook cells beforehand, otherwise your code may hang or crash.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# BAD: Don't run CUDA-related code before `.launch()`
|
||||
# x = torch.tensor(1).cuda()
|
||||
# torch.cuda.empty_cache()
|
||||
# torch.cuda.is_available()
|
||||
|
||||
|
||||
def train(fabric):
|
||||
# GOOD: Move CUDA calls into the training function
|
||||
x = torch.tensor(1).cuda()
|
||||
torch.cuda.empty_cache()
|
||||
torch.cuda.is_available()
|
||||
...
|
||||
|
||||
|
||||
fabric = Fabric(accelerator="cuda", devices=2)
|
||||
fabric.launch(train)
|
||||
|
||||
|
||||
Move data loading code inside the function
|
||||
==========================================
|
||||
|
||||
If you define/load your data in the main process before calling ``fabric.launch(train)``, you may see a slowdown or crashes (segmentation fault, SIGSEV, etc.).
|
||||
The best practice is to move your data loading code inside the training function to avoid these issues:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# BAD: Don't load data in the main process
|
||||
# dataset = MyDataset("data/")
|
||||
# dataloader = torch.utils.data.DataLoader(dataset)
|
||||
|
||||
|
||||
def train(fabric):
|
||||
# GOOD: Move data loading code into the training function
|
||||
dataset = MyDataset("data/")
|
||||
dataloader = torch.utils.data.DataLoader(dataset)
|
||||
...
|
||||
|
||||
|
||||
fabric = Fabric(accelerator="cuda", devices=2)
|
||||
fabric.launch(train)
|
||||
|
|
|
@ -48,6 +48,7 @@ To use multiple GPUs on notebooks, use the *DDP_NOTEBOOK* mode.
|
|||
Trainer(accelerator="gpu", devices=4, strategy="ddp_notebook")
|
||||
|
||||
If you want to use other strategies, please launch your training via the command-shell.
|
||||
See also: :doc:`../../common/notebooks`
|
||||
|
||||
----
|
||||
|
||||
|
|
|
@ -138,6 +138,13 @@ How-to Guides
|
|||
:col_css: col-md-4
|
||||
:height: 180
|
||||
|
||||
.. displayitem::
|
||||
:header: Train in a notebook
|
||||
:description: Train models in interactive notebooks (Jupyter, Colab, Kaggle, etc.)
|
||||
:col_css: col-md-4
|
||||
:button_link: ../common/notebooks.html
|
||||
:height: 180
|
||||
|
||||
.. displayitem::
|
||||
:header: Train on single or multiple GPUs
|
||||
:description: Train models faster with GPU accelerators
|
||||
|
|
|
@ -0,0 +1,124 @@
|
|||
:orphan:
|
||||
|
||||
.. _jupyter_notebooks:
|
||||
|
||||
##############################################
|
||||
Interactive Notebooks (Jupyter, Colab, Kaggle)
|
||||
##############################################
|
||||
|
||||
**Audience:** Users looking to train models in interactive notebooks (Jupyter, Colab, Kaggle, etc.).
|
||||
|
||||
|
||||
----
|
||||
|
||||
|
||||
**********************
|
||||
Lightning in notebooks
|
||||
**********************
|
||||
|
||||
You can use the Lightning Trainer in interactive notebooks just like in a regular Python script, including multi-GPU training!
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import lightning as L
|
||||
|
||||
# Works in Jupyter, Colab and Kaggle!
|
||||
trainer = L.Trainer(accelerator="auto", devices="auto")
|
||||
|
||||
|
||||
You can find many notebook examples on our :doc:`tutorials page <../tutorials>` too!
|
||||
|
||||
|
||||
----
|
||||
|
||||
|
||||
.. _jupyter_notebook_example:
|
||||
|
||||
************
|
||||
Full example
|
||||
************
|
||||
|
||||
Paste the following code block into a notebook cell:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import lightning as L
|
||||
from torch import nn, optim, utils
|
||||
import torchvision
|
||||
|
||||
encoder = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))
|
||||
decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))
|
||||
|
||||
|
||||
class LitAutoEncoder(L.LightningModule):
|
||||
def __init__(self, encoder, decoder):
|
||||
super().__init__()
|
||||
self.encoder = encoder
|
||||
self.decoder = decoder
|
||||
|
||||
def training_step(self, batch, batch_idx):
|
||||
x, y = batch
|
||||
x = x.view(x.size(0), -1)
|
||||
z = self.encoder(x)
|
||||
x_hat = self.decoder(z)
|
||||
loss = nn.functional.mse_loss(x_hat, x)
|
||||
self.log("train_loss", loss)
|
||||
return loss
|
||||
|
||||
def configure_optimizers(self):
|
||||
return optim.Adam(self.parameters(), lr=1e-3)
|
||||
|
||||
def prepare_data(self):
|
||||
torchvision.datasets.MNIST(".", download=True)
|
||||
|
||||
def train_dataloader(self):
|
||||
dataset = torchvision.datasets.MNIST(".", transform=torchvision.transforms.ToTensor())
|
||||
return utils.data.DataLoader(dataset, batch_size=64)
|
||||
|
||||
|
||||
autoencoder = LitAutoEncoder(encoder, decoder)
|
||||
trainer = L.Trainer(max_epochs=2, devices="auto")
|
||||
trainer.fit(model=autoencoder)
|
||||
|
||||
|
||||
----
|
||||
|
||||
|
||||
*********************
|
||||
Multi-GPU Limitations
|
||||
*********************
|
||||
|
||||
The multi-GPU capabilities in Jupyter are enabled by launching processes using the 'fork' start method.
|
||||
It is the only supported way of multi-processing in notebooks, but also brings some limitations that you should be aware of.
|
||||
|
||||
Avoid initializing CUDA before .fit()
|
||||
=====================================
|
||||
|
||||
Don't run torch CUDA functions before calling ``trainer.fit()`` in any of the notebook cells beforehand, otherwise your code may hang or crash.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# BAD: Don't run CUDA-related code before `.fit()`
|
||||
x = torch.tensor(1).cuda()
|
||||
torch.cuda.empty_cache()
|
||||
torch.cuda.is_available()
|
||||
|
||||
trainer = L.Trainer(accelerator="cuda", devices=2)
|
||||
trainer.fit(model)
|
||||
|
||||
|
||||
Move data loading code inside the hooks
|
||||
=======================================
|
||||
|
||||
If you define/load your data in the main process before calling ``trainer.fit()``, you may see a slowdown or crashes (segmentation fault, SIGSEV, etc.).
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# BAD: Don't load data in the main process
|
||||
dataset = MyDataset("data/")
|
||||
train_dataloader = torch.utils.data.DataLoader(dataset)
|
||||
|
||||
trainer = L.Trainer(accelerator="cuda", devices=2)
|
||||
trainer.fit(model, train_dataloader)
|
||||
|
||||
The best practice is to move your data loading code inside the ``*_dataloader()`` hooks in the :class:`~lightning.pytorch.core.module.LightningModule` or :class:`~lightning.pytorch.core.datamodule.LightningDataModule` as shown in the :ref:`example above <jupyter_notebook_example>`.
|
|
@ -112,6 +112,13 @@ Customize and extend Lightning for things like custom hardware or distributed st
|
|||
:button_link: advanced/model_parallel.html
|
||||
:height: 100
|
||||
|
||||
.. displayitem::
|
||||
:header: Train in a notebook
|
||||
:description: Train models in interactive notebooks (Jupyter, Colab, Kaggle, etc.)
|
||||
:col_css: col-md-12
|
||||
:button_link: common/notebooks.html
|
||||
:height: 100
|
||||
|
||||
.. displayitem::
|
||||
:header: Train on single or multiple GPUs
|
||||
:description: Train models faster with GPUs.
|
||||
|
|
Loading…
Reference in New Issue