Some docs update (#3794)
* docs update * docs update * suggestions * Update docs/source/introduction_guide.rst Co-authored-by: William Falcon <waf2107@columbia.edu>
This commit is contained in:
parent
a677833f84
commit
62320632d4
|
@ -34,7 +34,7 @@ Move the model architecture and forward pass to your :class:`~pytorch_lightning.
|
|||
|
||||
2. Move the optimizer(s) and schedulers
|
||||
=======================================
|
||||
Move your optimizers to :func:`pytorch_lightning.core.LightningModule.configure_optimizers` hook. Make sure to use the hook parameters (self in this case).
|
||||
Move your optimizers to the :func:`~pytorch_lightning.core.LightningModule.configure_optimizers` hook.
|
||||
|
||||
.. testcode::
|
||||
|
||||
|
@ -46,7 +46,8 @@ Move your optimizers to :func:`pytorch_lightning.core.LightningModule.configure_
|
|||
|
||||
3. Find the train loop "meat"
|
||||
=============================
|
||||
Lightning automates most of the trining for you, the epoch and batch iterations, all you need to keep is the training step logic. This should go into :func:`pytorch_lightning.core.LightningModule.training_step` hook (make sure to use the hook parameters, self in this case):
|
||||
Lightning automates most of the training for you, the epoch and batch iterations, all you need to keep is the training step logic.
|
||||
This should go into the :func:`~pytorch_lightning.core.LightningModule.training_step` hook (make sure to use the hook parameters, ``batch`` and ``batch_idx`` in this case):
|
||||
|
||||
.. testcode::
|
||||
|
||||
|
@ -60,7 +61,8 @@ Lightning automates most of the trining for you, the epoch and batch iterations,
|
|||
|
||||
4. Find the val loop "meat"
|
||||
===========================
|
||||
To add an (optional) validation loop add logic to :func:`pytorch_lightning.core.LightningModule.validation_step` hook (make sure to use the hook parameters, self in this case).
|
||||
To add an (optional) validation loop add logic to the
|
||||
:func:`~pytorch_lightning.core.LightningModule.validation_step` hook (make sure to use the hook parameters, ``batch`` and ``batch_idx`` in this case).
|
||||
|
||||
.. testcode::
|
||||
|
||||
|
@ -72,11 +74,12 @@ To add an (optional) validation loop add logic to :func:`pytorch_lightning.core.
|
|||
val_loss = F.cross_entropy(y_hat, y)
|
||||
return val_loss
|
||||
|
||||
.. note:: model.eval() and torch.no_grad() are called automatically for validation
|
||||
.. note:: ``model.eval()`` and ``torch.no_grad()`` are called automatically for validation
|
||||
|
||||
5. Find the test loop "meat"
|
||||
============================
|
||||
To add an (optional) test loop add logic to :func:`pytorch_lightning.core.LightningModule.test_step` hook (make sure to use the hook parameters, self in this case).
|
||||
To add an (optional) test loop add logic to the
|
||||
:func:`~pytorch_lightning.core.LightningModule.test_step` hook (make sure to use the hook parameters, ``batch`` and ``batch_idx`` in this case).
|
||||
|
||||
.. testcode::
|
||||
|
||||
|
@ -88,7 +91,7 @@ To add an (optional) test loop add logic to :func:`pytorch_lightning.core.Lightn
|
|||
loss = F.cross_entropy(y_hat, y)
|
||||
return loss
|
||||
|
||||
.. note:: model.eval() and torch.no_grad() are called automatically for testing.
|
||||
.. note:: ``model.eval()`` and ``torch.no_grad()`` are called automatically for testing.
|
||||
|
||||
The test loop will not be used until you call.
|
||||
|
||||
|
@ -96,7 +99,7 @@ The test loop will not be used until you call.
|
|||
|
||||
trainer.test()
|
||||
|
||||
.. note:: .test() loads the best checkpoint automatically
|
||||
.. tip:: .test() loads the best checkpoint automatically
|
||||
|
||||
6. Remove any .cuda() or to.device() calls
|
||||
==========================================
|
||||
|
|
|
@ -98,8 +98,8 @@ Let's first start with the model. In this case we'll design a 3-layer neural net
|
|||
x = F.log_softmax(x, dim=1)
|
||||
return x
|
||||
|
||||
Notice this is a :class:`~pytorch_lightning.core.LightningModule` instead of a `torch.nn.Module`. A LightningModule is
|
||||
equivalent to a pure PyTorch Module except it has added functionality. However, you can use it EXACTLY the same as you would a PyTorch Module.
|
||||
Notice this is a :class:`~pytorch_lightning.core.LightningModule` instead of a ``torch.nn.Module``. A LightningModule is
|
||||
equivalent to a pure PyTorch Module except it has added functionality. However, you can use it **EXACTLY** the same as you would a PyTorch Module.
|
||||
|
||||
.. testcode::
|
||||
|
||||
|
@ -274,8 +274,8 @@ Using DataModules allows easier sharing of full dataset definitions.
|
|||
model = LitModel(num_classes=imagenet_dm.num_classes)
|
||||
trainer.fit(model, imagenet_dm)
|
||||
|
||||
.. note:: `prepare_data` is called only one 1 GPU in distributed training (automatically)
|
||||
.. note:: `setup` is called on every GPU (automatically)
|
||||
.. note:: ``prepare_data()`` is called on only one GPU in distributed training (automatically)
|
||||
.. note:: ``setup()`` is called on every GPU (automatically)
|
||||
|
||||
Models defined by data
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
@ -292,10 +292,12 @@ When your models need to know about the data, it's best to process the data befo
|
|||
trainer.fit(model, dm)
|
||||
|
||||
|
||||
1. use `prepare_data` to download and process the dataset.
|
||||
2. use `setup` to do splits, and build your model internals
|
||||
1. use ``prepare_data()`` to download and process the dataset.
|
||||
2. use ``setup()`` to do splits, and build your model internals
|
||||
|
||||
An alternative to using a DataModule is to defer initialization of the models modules to the `setup` method of your LightningModule as follows:
|
||||
|
|
||||
|
||||
An alternative to using a DataModule is to defer initialization of the models modules to the ``setup`` method of your LightningModule as follows:
|
||||
|
||||
.. testcode::
|
||||
|
||||
|
@ -326,7 +328,7 @@ In PyTorch we do it as follows:
|
|||
optimizer = Adam(LitMNIST().parameters(), lr=1e-3)
|
||||
|
||||
|
||||
In Lightning we do the same but organize it under the configure_optimizers method.
|
||||
In Lightning we do the same but organize it under the :func:`~pytorch_lightning.core.LightningModule.configure_optimizers` method.
|
||||
|
||||
.. testcode::
|
||||
|
||||
|
@ -379,8 +381,8 @@ In the case of MNIST we do the following
|
|||
optimizer.step()
|
||||
optimizer.zero_grad()
|
||||
|
||||
In Lightning, everything that is in the training step gets organized under the `training_step` function
|
||||
in the LightningModule
|
||||
In Lightning, everything that is in the training step gets organized under the
|
||||
:func:`~pytorch_lightning.core.LightningModule.training_step` function in the LightningModule.
|
||||
|
||||
.. testcode::
|
||||
|
||||
|
@ -546,7 +548,7 @@ Or multiple nodes
|
|||
|
||||
Refer to the :ref:`distributed computing guide for more details <multi_gpu>`.
|
||||
|
||||
train on TPUs
|
||||
Train on TPUs
|
||||
^^^^^^^^^^^^^
|
||||
Did you know you can use PyTorch on TPUs? It's very hard to do, but we've
|
||||
worked with the xla team to use their awesome library to get this to work
|
||||
|
@ -578,11 +580,11 @@ In distributed training (multiple GPUs and multiple TPU cores) each GPU or TPU c
|
|||
of this program. This means that without taking any care you will download the dataset N times which
|
||||
will cause all sorts of issues.
|
||||
|
||||
To solve this problem, make sure your download code is in the `prepare_data` method in the DataModule.
|
||||
To solve this problem, make sure your download code is in the ``prepare_data`` method in the DataModule.
|
||||
In this method we do all the preparation we need to do once (instead of on every gpu).
|
||||
|
||||
`prepare_data` can be called in two ways, once per node or only on the root node
|
||||
(`Trainer(prepare_data_per_node=False)`).
|
||||
``prepare_data`` can be called in two ways, once per node or only on the root node
|
||||
(``Trainer(prepare_data_per_node=False)``).
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
|
@ -619,7 +621,7 @@ In this method we do all the preparation we need to do once (instead of on every
|
|||
def test_dataloader(self):
|
||||
return DataLoader(self.test_dataset, batch_size=self.batch_size)
|
||||
|
||||
The `prepare_data` method is also a good place to do any data processing that needs to be done only
|
||||
The ``prepare_data`` method is also a good place to do any data processing that needs to be done only
|
||||
once (ie: download or tokenize, etc...).
|
||||
|
||||
.. note:: Lightning inserts the correct DistributedSampler for distributed training. No need to add yourself!
|
||||
|
@ -657,7 +659,7 @@ Validating
|
|||
For most cases, we stop training the model when the performance on a validation
|
||||
split of the data reaches a minimum.
|
||||
|
||||
Just like the `training_step`, we can define a `validation_step` to check whatever
|
||||
Just like the ``training_step``, we can define a ``validation_step`` to check whatever
|
||||
metrics we care about, generate samples or add more to our logs.
|
||||
|
||||
.. code-block:: python
|
||||
|
@ -676,7 +678,7 @@ Now we can train with a validation loop as well.
|
|||
trainer = Trainer(tpu_cores=8)
|
||||
trainer.fit(model, train_loader, val_loader)
|
||||
|
||||
You may have noticed the words `Validation sanity check` logged. This is because Lightning runs 2 batches
|
||||
You may have noticed the words **Validation sanity check** logged. This is because Lightning runs 2 batches
|
||||
of validation before starting to train. This is a kind of unit test to make sure that if you have a bug
|
||||
in the validation loop, you won't need to potentially wait a full epoch to find out.
|
||||
|
||||
|
@ -744,7 +746,7 @@ Just like the validation loop, we define a test loop
|
|||
|
||||
|
||||
However, to make sure the test set isn't used inadvertently, Lightning has a separate API to run tests.
|
||||
Once you train your model simply call `.test()`.
|
||||
Once you train your model simply call ``.test()``.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
|
@ -794,8 +796,8 @@ and use it for prediction.
|
|||
x = torch.randn(1, 1, 28, 28)
|
||||
out = model(x)
|
||||
|
||||
On the surface, it looks like `forward` and `training_step` are similar. Generally, we want to make sure that
|
||||
what we want the model to do is what happens in the `forward`. whereas the `training_step` likely calls forward from
|
||||
On the surface, it looks like ``forward`` and ``training_step`` are similar. Generally, we want to make sure that
|
||||
what we want the model to do is what happens in the ``forward``. whereas the ``training_step`` likely calls forward from
|
||||
within it.
|
||||
|
||||
.. testcode::
|
||||
|
@ -879,7 +881,7 @@ Or maybe we have a model that we use to do generation
|
|||
z = sample_noise()
|
||||
generated_imgs = model(z)
|
||||
|
||||
How you split up what goes in `forward` vs `training_step` depends on how you want to use this model for
|
||||
How you split up what goes in ``forward`` vs ``training_step`` depends on how you want to use this model for
|
||||
prediction.
|
||||
|
||||
----------------
|
||||
|
@ -977,7 +979,7 @@ And pass the callbacks into the trainer
|
|||
Starting to init trainer!
|
||||
Trainer is init now
|
||||
|
||||
.. note::
|
||||
.. tip::
|
||||
See full list of 12+ hooks in the :ref:`callbacks`.
|
||||
|
||||
----------------
|
||||
|
@ -1142,4 +1144,4 @@ the data to build your models.
|
|||
|
||||
In Lightning this code is organized inside a :ref:`datamodules`.
|
||||
|
||||
.. note:: DataModules are optional but encouraged, otherwise you can use standard DataModules
|
||||
.. tip:: DataModules are optional but encouraged, otherwise you can use standard DataLoaders
|
||||
|
|
|
@ -286,7 +286,7 @@ a forward method or trace only the sub-models you need.
|
|||
********************
|
||||
Using CPUs/GPUs/TPUs
|
||||
********************
|
||||
It's trivial to use CPUs, GPUs or TPUs in Lightning. There's NO NEED to change your code, simply change the :class:`~pytorch_lightning.trainer.Trainer` options.
|
||||
It's trivial to use CPUs, GPUs or TPUs in Lightning. There's **NO NEED** to change your code, simply change the :class:`~pytorch_lightning.trainer.Trainer` options.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
|
@ -377,6 +377,7 @@ If you prefer to do it manually, here's the equivalent
|
|||
Data flow
|
||||
*********
|
||||
Each loop (training, validation, test) has three hooks you can implement:
|
||||
|
||||
- x_step
|
||||
- x_step_end
|
||||
- x_epoch_end
|
||||
|
@ -434,7 +435,7 @@ The lightning equivalent is:
|
|||
gpu_1_loss = losses[1]
|
||||
return (gpu_0_loss + gpu_1_loss) * 1/2
|
||||
|
||||
The validation and test loops have the same structure.
|
||||
.. tip:: The validation and test loops have the same structure.
|
||||
|
||||
-----------------
|
||||
|
||||
|
@ -467,6 +468,10 @@ you can override the default behavior by manually setting the flags
|
|||
def training_step(self, batch, batch_idx):
|
||||
self.log('my_loss', loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
|
||||
|
||||
.. note::
|
||||
The loss value shown in the progress bar is smoothed (averaged) over the last values,
|
||||
so it differs from the actual loss returned in train/validation step.
|
||||
|
||||
You can also use any method of your logger directly:
|
||||
|
||||
.. code-block:: python
|
||||
|
@ -481,6 +486,10 @@ Once your training starts, you can view the logs by using your favorite logger o
|
|||
|
||||
tensorboard --logdir ./lightning_logs
|
||||
|
||||
.. note::
|
||||
Lightning automatically shows the loss value returned from ``training_step`` in the progress bar.
|
||||
So, no need to explicitly log like this ``self.log('loss', loss, prog_bar=True)``.
|
||||
|
||||
Read more about :ref:`loggers`.
|
||||
|
||||
----------------
|
||||
|
@ -668,8 +677,9 @@ Or read our :ref:`introduction_guide` to learn more!
|
|||
**********
|
||||
Community
|
||||
**********
|
||||
Out community of core maintainers and thousands of expert researchers is active on our Slack and Forum. Drop by to
|
||||
hang out, ask Lightning questions or even discuss research!
|
||||
Our community of core maintainers and thousands of expert researchers is active on our
|
||||
`Slack <https://join.slack.com/t/pytorch-lightning/shared_invite/zt-f6bl2l0l-JYMK3tbAgAmGRrlNr00f1A>`_
|
||||
and `Forum <https://forums.pytorchlightning.ai/>`_. Drop by to hang out, ask Lightning questions or even discuss research!
|
||||
|
||||
Masterclass
|
||||
===========
|
||||
|
|
|
@ -8,7 +8,7 @@ Here are some best practices to increase your performance.
|
|||
|
||||
Dataloaders
|
||||
-----------
|
||||
When building your Dataloader set `num_workers` > 0 and `pin_memory=True` (only for GPUs).
|
||||
When building your DataLoader set ``num_workers > 0`` and ``pin_memory=True`` (only for GPUs).
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
|
@ -16,23 +16,23 @@ When building your Dataloader set `num_workers` > 0 and `pin_memory=True` (only
|
|||
|
||||
num_workers
|
||||
^^^^^^^^^^^
|
||||
The question of how many `num_workers` is tricky. Here's a summary of
|
||||
The question of how many ``num_workers`` is tricky. Here's a summary of
|
||||
some references, [`1 <https://discuss.pytorch.org/t/guidelines-for-assigning-num-workers-to-dataloader/813>`_], and our suggestions.
|
||||
|
||||
1. num_workers=0 means ONLY the main process will load batches (that can be a bottleneck).
|
||||
2. num_workers=1 means ONLY one worker (just not the main process) will load data but it will still be slow.
|
||||
3. The num_workers depends on the batch size and your machine.
|
||||
4. A general place to start is to set `num_workers` equal to the number of CPUs on that machine.
|
||||
1. ``num_workers=0`` means ONLY the main process will load batches (that can be a bottleneck).
|
||||
2. ``num_workers=1`` means ONLY one worker (just not the main process) will load data but it will still be slow.
|
||||
3. The ``num_workers`` depends on the batch size and your machine.
|
||||
4. A general place to start is to set ``num_workers`` equal to the number of CPUs on that machine.
|
||||
|
||||
.. warning:: Increasing num_workers will ALSO increase your CPU memory consumption.
|
||||
.. warning:: Increasing ``num_workers`` will ALSO increase your CPU memory consumption.
|
||||
|
||||
The best thing to do is to increase the `num_workers` slowly and stop once you see no more improvement in your training speed.
|
||||
The best thing to do is to increase the ``num_workers`` slowly and stop once you see no more improvement in your training speed.
|
||||
|
||||
Spawn
|
||||
^^^^^
|
||||
When using `distributed_backend=ddp_spawn` (the ddp default) or TPU training, the way multiple GPUs/TPU cores are used is by calling `.spawn()` under the hood.
|
||||
The problem is that PyTorch has issues with `num_workers` > 0 when using .spawn(). For this reason we recommend you
|
||||
use `distributed_backend=ddp` so you can increase the `num_workers`, however your script has to be callable like so:
|
||||
When using ``distributed_backend=ddp_spawn`` (the ddp default) or TPU training, the way multiple GPUs/TPU cores are used is by calling ``.spawn()`` under the hood.
|
||||
The problem is that PyTorch has issues with ``num_workers > 0`` when using ``.spawn()``. For this reason we recommend you
|
||||
use ``distributed_backend=ddp`` so you can increase the ``num_workers``, however your script has to be callable like so:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
|
@ -42,7 +42,7 @@ use `distributed_backend=ddp` so you can increase the `num_workers`, however you
|
|||
|
||||
.item(), .numpy(), .cpu()
|
||||
-------------------------
|
||||
Don't call .item() anywhere on your code. Use `.detach()` instead to remove the connected graph calls. Lightning
|
||||
Don't call ``.item()`` anywhere in your code. Use ``.detach()`` instead to remove the connected graph calls. Lightning
|
||||
takes a great deal of care to be optimized for this.
|
||||
|
||||
----------
|
||||
|
@ -67,7 +67,7 @@ LightningModules know what device they are on! Construct tensors on the device d
|
|||
|
||||
|
||||
For tensors that need to be model attributes, it is best practice to register them as buffers in the modules's
|
||||
`__init__` method:
|
||||
``__init__`` method:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
|
@ -87,25 +87,27 @@ DP performs three GPU transfers for EVERY batch:
|
|||
2. Copy data to device.
|
||||
3. Copy outputs of each device back to master.
|
||||
|
||||
|
|
||||
|
||||
Whereas DDP only performs 1 transfer to sync gradients. Because of this, DDP is MUCH faster than DP.
|
||||
|
||||
----------
|
||||
|
||||
16-bit precision
|
||||
----------------
|
||||
Use 16-bit to decrease the memory (and thus increase your batch size). On certain GPUs (V100s, 2080tis), 16-bit calculations are also faster.
|
||||
Use 16-bit to decrease the memory consumption (and thus increase your batch size). On certain GPUs (V100s, 2080tis), 16-bit calculations are also faster.
|
||||
However, know that 16-bit and multi-processing (any DDP) can have issues. Here are some common problems.
|
||||
|
||||
1. `CUDA error: an illegal memory access was encountered <https://github.com/pytorch/pytorch/issues/21819>`_.
|
||||
The solution is likely setting a specific CUDA, CUDNN, PyTorch version combination.
|
||||
2. `CUDA error: device-side assert triggered`. This is a general catch-all error. To see the actual error run your script like so:
|
||||
2. ``CUDA error: device-side assert triggered``. This is a general catch-all error. To see the actual error run your script like so:
|
||||
|
||||
.. code-block:: bash
|
||||
.. code-block:: bash
|
||||
|
||||
# won't see what the error is
|
||||
python main.py
|
||||
# won't see what the error is
|
||||
python main.py
|
||||
|
||||
# will see what the error is
|
||||
CUDA_LAUNCH_BLOCKING=1 python main.py
|
||||
# will see what the error is
|
||||
CUDA_LAUNCH_BLOCKING=1 python main.py
|
||||
|
||||
We also recommend using 16-bit native found in PyTorch 1.6. Just install this version and Lightning will automatically use it.
|
||||
.. tip:: We also recommend using 16-bit native found in PyTorch 1.6. Just install this version and Lightning will automatically use it.
|
||||
|
|
Loading…
Reference in New Issue