Minor doc fixes (#5139)
* minor doc fix
* minor doc fix
* Apply suggestions from code review
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* suggestions
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
(cherry picked from commit 8d8098c04e
)
This commit is contained in:
parent
a1784b7d55
commit
fe41492b6c
|
@ -58,10 +58,10 @@ This will make your code scale to any arbitrary number of GPUs or TPUs with Ligh
|
|||
z = torch.Tensor(2, 3)
|
||||
z = z.type_as(x)
|
||||
|
||||
The :class:`~pytorch_lightning.core.lightning.LightningModule` knows what device it is on. You can access the reference via `self.device`.
|
||||
The :class:`~pytorch_lightning.core.lightning.LightningModule` knows what device it is on. You can access the reference via ``self.device``.
|
||||
Sometimes it is necessary to store tensors as module attributes. However, if they are not parameters they will
|
||||
remain on the CPU even if the module gets moved to a new device. To prevent that and remain device agnostic,
|
||||
register the tensor as a buffer in your modules's `__init__` method with :meth:`~torch.nn.Module.register_buffer`.
|
||||
register the tensor as a buffer in your modules's ``__init__`` method with :meth:`~torch.nn.Module.register_buffer`.
|
||||
|
||||
.. testcode::
|
||||
|
||||
|
@ -75,8 +75,8 @@ register the tensor as a buffer in your modules's `__init__` method with :meth:`
|
|||
|
||||
Remove samplers
|
||||
^^^^^^^^^^^^^^^
|
||||
In PyTorch, you must use `torch.nn.DistributedSampler` for multi-node or TPU training. The
|
||||
sampler makes sure each GPU sees the appropriate part of your data.
|
||||
In PyTorch, you must use :class:`~torch.utils.data.distributed.DistributedSampler`
|
||||
for multi-node or TPU training. The sampler makes sure each GPU sees the appropriate part of your data.
|
||||
|
||||
.. testcode::
|
||||
|
||||
|
@ -99,7 +99,11 @@ Lightning adds the correct samplers when needed, so no need to explicitly add sa
|
|||
dataset = MNIST(...)
|
||||
return DataLoader(dataset)
|
||||
|
||||
.. note:: You can disable this behavior with `Trainer(replace_sampler_ddp=False)`
|
||||
.. note::
|
||||
By default it will add ``shuffle=True`` for train sampler and ``shuffle=False`` for val/test sampler.
|
||||
``drop_last`` in :class:`~torch.utils.data.distributed.DistributedSampler` will be set to its default value in PyTorch.
|
||||
|
||||
.. note:: You can disable this behavior with ``Trainer(replace_sampler_ddp=False)``
|
||||
|
||||
.. note:: For iterable datasets, we don't do this automatically.
|
||||
|
||||
|
@ -108,7 +112,7 @@ Synchronize validation and test logging
|
|||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
When running in distributed mode, we have to ensure that the validation and test step logging calls are synchronized across processes.
|
||||
This is done by adding `sync_dist=True` to all `self.log` calls in the validation and test step.
|
||||
This is done by adding ``sync_dist=True`` to all ``self.log`` calls in the validation and test step.
|
||||
This ensures that each GPU worker has the same behaviour when tracking model checkpoints, which is important for later downstream tasks such as testing the best checkpoint across all workers.
|
||||
|
||||
Note if you use any built in metrics or custom metrics that use the :ref:`Metrics API <metrics>`, these do not need to be updated and are automatically handled for you.
|
||||
|
@ -229,8 +233,8 @@ Note in particular the difference between `gpus=0`, `gpus=[0]` and `gpus="0"`.
|
|||
|
||||
.. note::
|
||||
|
||||
When specifying number of gpus as an integer `gpus=k`, setting the trainer flag
|
||||
`auto_select_gpus=True` will automatically help you find `k` gpus that are not
|
||||
When specifying number of gpus as an integer ``gpus=k``, setting the trainer flag
|
||||
``auto_select_gpus=True`` will automatically help you find ``k`` gpus that are not
|
||||
occupied by other processes. This is especially useful when GPUs are configured
|
||||
to be in "exclusive mode", such that only one process at a time can access them.
|
||||
For more details see the :ref:`Trainer guide <trainer>`.
|
||||
|
@ -258,12 +262,12 @@ Distributed modes
|
|||
-----------------
|
||||
Lightning allows multiple ways of training
|
||||
|
||||
- Data Parallel (`accelerator='dp'`) (multiple-gpus, 1 machine)
|
||||
- DistributedDataParallel (`accelerator='ddp'`) (multiple-gpus across many machines (python script based)).
|
||||
- DistributedDataParallel (`accelerator='ddp_spawn'`) (multiple-gpus across many machines (spawn based)).
|
||||
- DistributedDataParallel 2 (`accelerator='ddp2'`) (DP in a machine, DDP across machines).
|
||||
- Horovod (`accelerator='horovod'`) (multi-machine, multi-gpu, configured at runtime)
|
||||
- TPUs (`tpu_cores=8|x`) (tpu or TPU pod)
|
||||
- Data Parallel (``accelerator='dp'``) (multiple-gpus, 1 machine)
|
||||
- DistributedDataParallel (``accelerator='ddp'``) (multiple-gpus across many machines (python script based)).
|
||||
- DistributedDataParallel (``accelerator='ddp_spawn'``) (multiple-gpus across many machines (spawn based)).
|
||||
- DistributedDataParallel 2 (``accelerator='ddp2'``) (DP in a machine, DDP across machines).
|
||||
- Horovod (``accelerator='horovod'``) (multi-machine, multi-gpu, configured at runtime)
|
||||
- TPUs (``tpu_cores=8|x``) (tpu or TPU pod)
|
||||
|
||||
.. note::
|
||||
If you request multiple GPUs or nodes without setting a mode, DDP will be automatically used.
|
||||
|
@ -275,7 +279,7 @@ For a deeper understanding of what Lightning is doing, feel free to read this
|
|||
|
||||
Data Parallel
|
||||
^^^^^^^^^^^^^
|
||||
`DataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.DataParallel>`_ (DP) splits a batch across k GPUs.
|
||||
:class:`~torch.nn.DataParallel` (DP) splits a batch across k GPUs.
|
||||
That is, if you have a batch of 32 and use DP with 2 gpus, each GPU will process 16 samples,
|
||||
after which the root node will aggregate the results.
|
||||
|
||||
|
@ -289,7 +293,7 @@ after which the root node will aggregate the results.
|
|||
|
||||
Distributed Data Parallel
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
`DistributedDataParallel <https://pytorch.org/docs/stable/nn.html#distributeddataparallel>`_ (DDP) works as follows:
|
||||
:class:`~torch.nn.parallel.DistributedDataParallel` (DDP) works as follows:
|
||||
|
||||
1. Each GPU across each node gets its own process.
|
||||
|
||||
|
@ -576,26 +580,26 @@ not allow 16-bit and DP training. We tried to get this to work, but it's an issu
|
|||
|
||||
Below are the possible configurations we support.
|
||||
|
||||
+-------+---------+----+-----+---------+------------------------------------------------------------+
|
||||
| 1 GPU | 1+ GPUs | DP | DDP | 16-bit | command |
|
||||
+=======+=========+====+=====+=========+============================================================+
|
||||
| Y | | | | | `Trainer(gpus=1)` |
|
||||
+-------+---------+----+-----+---------+------------------------------------------------------------+
|
||||
| Y | | | | Y | `Trainer(gpus=1, precision=16)` |
|
||||
+-------+---------+----+-----+---------+------------------------------------------------------------+
|
||||
| | Y | Y | | | `Trainer(gpus=k, accelerator='dp')` |
|
||||
+-------+---------+----+-----+---------+------------------------------------------------------------+
|
||||
| | Y | | Y | | `Trainer(gpus=k, accelerator='ddp')` |
|
||||
+-------+---------+----+-----+---------+------------------------------------------------------------+
|
||||
| | Y | | Y | Y | `Trainer(gpus=k, accelerator='ddp', precision=16)` |
|
||||
+-------+---------+----+-----+---------+------------------------------------------------------------+
|
||||
+-------+---------+----+-----+--------+------------------------------------------------------------+
|
||||
| 1 GPU | 1+ GPUs | DP | DDP | 16-bit | command |
|
||||
+=======+=========+====+=====+========+============================================================+
|
||||
| Y | | | | | `Trainer(gpus=1)` |
|
||||
+-------+---------+----+-----+--------+------------------------------------------------------------+
|
||||
| Y | | | | Y | `Trainer(gpus=1, precision=16)` |
|
||||
+-------+---------+----+-----+--------+------------------------------------------------------------+
|
||||
| | Y | Y | | | `Trainer(gpus=k, accelerator='dp')` |
|
||||
+-------+---------+----+-----+--------+------------------------------------------------------------+
|
||||
| | Y | | Y | | `Trainer(gpus=k, accelerator='ddp')` |
|
||||
+-------+---------+----+-----+--------+------------------------------------------------------------+
|
||||
| | Y | | Y | Y | `Trainer(gpus=k, accelerator='ddp', precision=16)` |
|
||||
+-------+---------+----+-----+--------+------------------------------------------------------------+
|
||||
|
||||
|
||||
Implement Your Own Distributed (DDP) training
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
If you need your own way to init PyTorch DDP you can override :meth:`pytorch_lightning.plugins.ddp_plugin.DDPPlugin.init_ddp_connection`.
|
||||
|
||||
If you also need to use your own DDP implementation, override: :meth:`pytorch_lightning.plugins.ddp_plugin.DDPPlugin.configure_ddp`.
|
||||
If you also need to use your own DDP implementation, override :meth:`pytorch_lightning.plugins.ddp_plugin.DDPPlugin.configure_ddp`.
|
||||
|
||||
|
||||
----------
|
||||
|
@ -694,9 +698,7 @@ Reference: https://arxiv.org/abs/1811.06965
|
|||
|
||||
.. note:: DDPSequentialPlugin is currently supported only for Pytorch 1.6.
|
||||
|
||||
To get started, install FairScale through extras using with ``pip install pytorch-lightning["extra"]``
|
||||
|
||||
or directly using
|
||||
To get started, install FairScale using the command below.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
|
|
|
@ -141,9 +141,9 @@ So you can run it like so:
|
|||
|
||||
.. note::
|
||||
If you want to stop a training run early, you can press "Ctrl + C" on your keyboard.
|
||||
The trainer will catch the `KeyboardInterrupt` and attempt a graceful shutdown, including
|
||||
running callbacks such as `on_train_end`. The trainer object will also set an attribute
|
||||
`interrupted` to `True` in such cases. If you have a callback which shuts down compute
|
||||
The trainer will catch the ``KeyboardInterrupt`` and attempt a graceful shutdown, including
|
||||
running callbacks such as ``on_train_end``. The trainer object will also set an attribute
|
||||
``interrupted`` to ``True`` in such cases. If you have a callback which shuts down compute
|
||||
resources, for example, you can conditionally run the shutdown logic for only uninterrupted runs.
|
||||
|
||||
------------
|
||||
|
@ -220,13 +220,13 @@ accelerator
|
|||
|
||||
The accelerator backend to use (previously known as distributed_backend).
|
||||
|
||||
- (```dp```) is DataParallel (split batch among GPUs of same machine)
|
||||
- (```ddp```) is DistributedDataParallel (each gpu on each node trains, and syncs grads)
|
||||
- (```ddp_cpu```) is DistributedDataParallel on CPU (same as `ddp`, but does not use GPUs.
|
||||
- (``'dp'``) is DataParallel (split batch among GPUs of same machine)
|
||||
- (``'ddp'``) is DistributedDataParallel (each gpu on each node trains, and syncs grads)
|
||||
- (``'ddp_cpu'``) is DistributedDataParallel on CPU (same as ``'ddp'``, but does not use GPUs.
|
||||
Useful for multi-node CPU training or single-node debugging. Note that this will **not** give
|
||||
a speedup on a single node, since Torch already makes efficient use of multiple CPUs on a single
|
||||
machine.)
|
||||
- (```ddp2```) dp on node, ddp across nodes. Useful for things like increasing
|
||||
- (``'ddp2'``) dp on node, ddp across nodes. Useful for things like increasing
|
||||
the number of negative samples
|
||||
|
||||
.. testcode::
|
||||
|
@ -245,7 +245,7 @@ Example::
|
|||
# ddp2 = DistributedDataParallel + dp
|
||||
trainer = Trainer(gpus=2, num_nodes=2, accelerator='ddp2')
|
||||
|
||||
.. note:: This option does not apply to TPU. TPUs use ```ddp``` by default (over each core)
|
||||
.. note:: This option does not apply to TPU. TPUs use ``'ddp'`` by default (over each core)
|
||||
|
||||
You can also modify hardware behavior by subclassing an existing accelerator to adjust for your needs.
|
||||
|
||||
|
@ -619,7 +619,7 @@ will need to be set up to use remote filepaths.
|
|||
|
||||
distributed_backend
|
||||
^^^^^^^^^^^^^^^^^^^
|
||||
This has been renamed "accelerator".
|
||||
Deprecated: This has been renamed ``accelerator``.
|
||||
|
||||
fast_dev_run
|
||||
^^^^^^^^^^^^
|
||||
|
@ -818,7 +818,7 @@ Options:
|
|||
# log only the min and max memory on the master node
|
||||
trainer = Trainer(log_gpu_memory='min_max')
|
||||
|
||||
.. note:: Might slow performance because it uses the output of nvidia-smi.
|
||||
.. note:: Might slow performance because it uses the output of ``nvidia-smi``.
|
||||
|
||||
flush_logs_every_n_steps
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
@ -1099,7 +1099,9 @@ as you request.
|
|||
|
||||
Your effective batch size is batch_size * total tpu cores.
|
||||
|
||||
.. note:: No need to add a DistributedDataSampler, Lightning automatically does it for you.
|
||||
.. note::
|
||||
No need to add a :class:`~torch.utils.data.distributed.DistributedSampler`,
|
||||
Lightning automatically does it for you.
|
||||
|
||||
This parameter can be either 1 or 8.
|
||||
|
||||
|
|
|
@ -734,7 +734,7 @@ class LightningModule(
|
|||
out = self(x)
|
||||
return out
|
||||
|
||||
def validation_epoch_end(self, val_step_outputs):
|
||||
def validation_step_end(self, val_step_outputs):
|
||||
for out in val_step_outputs:
|
||||
# do something with these
|
||||
|
||||
|
@ -742,9 +742,7 @@ class LightningModule(
|
|||
See the :ref:`multi_gpu` guide for more details.
|
||||
"""
|
||||
|
||||
def validation_epoch_end(
|
||||
self, outputs: List[Any]
|
||||
) -> None:
|
||||
def validation_epoch_end(self, outputs: List[Any]) -> None:
|
||||
"""
|
||||
Called at the end of the validation epoch with the outputs of all validation steps.
|
||||
|
||||
|
@ -911,7 +909,7 @@ class LightningModule(
|
|||
out = self.encoder(x)
|
||||
return out
|
||||
|
||||
def test_epoch_end(self, output_results):
|
||||
def test_step_end(self, output_results):
|
||||
# this out is now the full size of the batch
|
||||
all_test_step_outs = output_results.out
|
||||
loss = nce_loss(all_test_step_outs)
|
||||
|
|
Loading…
Reference in New Issue