From 965cf4e1b1a92c5c67e21f5ded43bf83afadb25f Mon Sep 17 00:00:00 2001 From: Rohit Gupta Date: Mon, 17 Jan 2022 00:57:36 +0530 Subject: [PATCH] Update speed docs (#11044) Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Aki Nitta --- docs/source/advanced/training_tricks.rst | 3 + docs/source/common/early_stopping.rst | 69 ++++---- docs/source/guides/speed.rst | 204 ++++++++++++----------- 3 files changed, 145 insertions(+), 131 deletions(-) diff --git a/docs/source/advanced/training_tricks.rst b/docs/source/advanced/training_tricks.rst index 2221167a62..150c1acd8a 100644 --- a/docs/source/advanced/training_tricks.rst +++ b/docs/source/advanced/training_tricks.rst @@ -272,6 +272,9 @@ Refer to :doc:`Advanced GPU Optimized Training <../advanced/advanced_gpu>` for m ---------- + +.. _ddp_spawn_shared_memory: + ****************************************** Sharing Datasets Across Process Boundaries ****************************************** diff --git a/docs/source/common/early_stopping.rst b/docs/source/common/early_stopping.rst index 594ab7cdc5..d0c6427e94 100644 --- a/docs/source/common/early_stopping.rst +++ b/docs/source/common/early_stopping.rst @@ -5,9 +5,10 @@ .. _early_stopping: -************** -Early stopping -************** + +############## +Early Stopping +############## .. raw:: html @@ -15,27 +16,28 @@ Early stopping poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/yt_thumbs/thumb_earlystop.png" src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/yt/Trainer+flags+19-+early+stopping_1.mp4"> -| -Stopping an epoch early -======================= -You can stop an epoch early by overriding :meth:`~pytorch_lightning.core.hooks.ModelHooks.on_train_batch_start` to return ``-1`` when some condition is met. +*********************** +Stopping an Epoch Early +*********************** -If you do this repeatedly, for every epoch you had originally requested, then this will stop your entire run. +You can stop and skip the rest of the current epoch early by overriding :meth:`~pytorch_lightning.core.hooks.ModelHooks.on_train_batch_start` to return ``-1`` when some condition is met. ----------- +If you do this repeatedly, for every epoch you had originally requested, then this will stop your entire training. -Early stopping based on metric using the EarlyStopping Callback -=============================================================== -The -:class:`~pytorch_lightning.callbacks.early_stopping.EarlyStopping` -callback can be used to monitor a validation metric and stop the training when no improvement is observed. + +********************** +EarlyStopping Callback +********************** + +The :class:`~pytorch_lightning.callbacks.early_stopping.EarlyStopping` callback can be used to monitor a metric and stop the training when no improvement is observed. To enable it: - Import :class:`~pytorch_lightning.callbacks.early_stopping.EarlyStopping` callback. -- Log the metric you want to monitor using :func:`~pytorch_lightning.core.lightning.LightningModule.log` method. -- Init the callback, and set `monitor` to the logged metric of your choice. +- Log the metric you want to monitor using :meth:`~pytorch_lightning.core.lightning.LightningModule.log` method. +- Init the callback, and set ``monitor`` to the logged metric of your choice. +- Set the ``mode`` based on the metric needs to be monitored. - Pass the :class:`~pytorch_lightning.callbacks.early_stopping.EarlyStopping` callback to the :class:`~pytorch_lightning.trainer.trainer.Trainer` callbacks flag. .. code-block:: python @@ -43,11 +45,15 @@ To enable it: from pytorch_lightning.callbacks.early_stopping import EarlyStopping - def validation_step(self): - self.log("val_loss", loss) + class LitModel(LightningModule): + def validation_step(self, batch, batch_idx): + loss = ... + self.log("val_loss", loss) - trainer = Trainer(callbacks=[EarlyStopping(monitor="val_loss")]) + model = LitModel() + trainer = Trainer(callbacks=[EarlyStopping(monitor="val_loss", mode="min")]) + trainer.fit(model) You can customize the callbacks behaviour by changing its parameters. @@ -62,8 +68,11 @@ Additional parameters that stop training at extreme points: - ``stopping_threshold``: Stops training immediately once the monitored quantity reaches this threshold. It is useful when we know that going beyond a certain optimal value does not further benefit us. - ``divergence_threshold``: Stops training as soon as the monitored quantity becomes worse than this threshold. - When reaching a value this bad, we believe the model cannot recover anymore and it is better to stop early and run with different initial conditions. -- ``check_finite``: When turned on, we stop training if the monitored metric becomes NaN or infinite. + When reaching a value this bad, we believes the model cannot recover anymore and it is better to stop early and run with different initial conditions. +- ``check_finite``: When turned on, it stops training if the monitored metric becomes NaN or infinite. +- ``check_on_train_epoch_end``: When turned on, it checks the metric at the end of a training epoch. Use this only when you are monitoring any metric logged within + training-specific hooks on epoch-level. + In case you need early stopping in a different part of training, subclass :class:`~pytorch_lightning.callbacks.early_stopping.EarlyStopping` and change where it is called: @@ -77,21 +86,15 @@ and change where it is called: def on_train_end(self, trainer, pl_module): # instead, do it at the end of training loop - self._run_early_stopping_check(trainer, pl_module) + self._run_early_stopping_check(trainer) .. note:: The :class:`~pytorch_lightning.callbacks.early_stopping.EarlyStopping` callback runs - at the end of every validation epoch, - which, under the default configuration, happen after every training epoch. - However, the frequency of validation can be modified by setting various parameters - in the :class:`~pytorch_lightning.trainer.trainer.Trainer`, + at the end of every validation epoch by default. However, the frequency of validation + can be modified by setting various parameters in the :class:`~pytorch_lightning.trainer.trainer.Trainer`, for example :paramref:`~pytorch_lightning.trainer.trainer.Trainer.check_val_every_n_epoch` and :paramref:`~pytorch_lightning.trainer.trainer.Trainer.val_check_interval`. - It must be noted that the `patience` parameter counts the number of - validation epochs with no improvement, and not the number of training epochs. - Therefore, with parameters `check_val_every_n_epoch=10` and `patience=3`, the trainer + It must be noted that the ``patience`` parameter counts the number of + validation checks with no improvement, and not the number of training epochs. + Therefore, with parameters ``check_val_every_n_epoch=10`` and ``patience=3``, the trainer will perform at least 40 training epochs before being stopped. - -.. seealso:: - - :class:`~pytorch_lightning.trainer.trainer.Trainer` - - :class:`~pytorch_lightning.callbacks.early_stopping.EarlyStopping` diff --git a/docs/source/guides/speed.rst b/docs/source/guides/speed.rst index 9b8f9dbe9d..ffc75bd3ee 100644 --- a/docs/source/guides/speed.rst +++ b/docs/source/guides/speed.rst @@ -4,41 +4,26 @@ from pytorch_lightning.callbacks.early_stopping import EarlyStopping from pytorch_lightning.core.lightning import LightningModule -.. _speed: +.. _training-speedup: + ####################### -Speed up model training +Speed up Model Training ####################### -There are multiple ways you can speed up your model's time to convergence: +When you are limited with the resources, it becomes hard to speed up model training and reduce the training time +without affecting the model's performance. There are multiple ways you can speed up your model's time to convergence. -* ``_ -* ``_ - -* ``_ - -* ``_ - -* ``_ - -* ``_ - -* ``_ - -* ``_ - -* ``_ - -**************** -GPU/TPU training -**************** +************************ +Training on Accelerators +************************ **Use when:** Whenever possible! -With Lightning, running on GPUs, TPUs or multiple node is a simple switch of a flag. +With Lightning, running on GPUs, TPUs, IPUs on multiple nodes is a simple switch of a flag. -GPU training +GPU Training ============ Lightning supports a variety of plugins to further speed up distributed GPU training. Most notably: @@ -67,19 +52,37 @@ Refer to :doc:`Advanced GPU Optimized Training for more details <../advanced/adv Prefer DDP over DP ^^^^^^^^^^^^^^^^^^ -:class:`~pytorch_lightning.strategies.DataParallelStrategy` performs three GPU transfers for EVERY batch: +:class:`~pytorch_lightning.strategies.dp.DataParallelStrategy` performs 3 GPU transfers for EVERY batch: -1. Copy model to device. -2. Copy data to device. -3. Copy outputs of each device back to main device. +1. Copy the model to the device. +2. Copy the data to the device. +3. Copy the outputs of each device back to the main device. -Whereas :class:`~pytorch_lightning.strategies.DDPStrategy` only performs 1 transfer to sync gradients, making DDP MUCH faster than DP. +.. image:: https://pl-public-data.s3.amazonaws.com/docs/static/images/distributed_training/dp.gif + :alt: Animation showing DP execution. + :width: 500 + :align: center + +| + +Whereas :class:`~pytorch_lightning.strategies.ddp.DDPStrategy` only performs 2 transfer operations, making DDP much faster than DP: + +1. Moving data to the device. +2. Transfer and sync gradients. + +.. image:: https://pl-public-data.s3.amazonaws.com/docs/static/images/distributed_training/ddp.gif + :alt: Animation showing DDP execution. + :width: 500 + :align: center + +| -When using DDP strategies, set find_unused_parameters=False -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -By default we have set ``find_unused_parameters`` to True for compatibility reasons that have been observed in the past (see the `discussion `_ for more details). -This by default comes with a performance hit, and can be disabled in most cases. +When using DDP Plugins, set find_unused_parameters=False +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +By default, we have set ``find_unused_parameters=True`` for compatibility reasons that have been observed in the past (see the `discussion `_ for more details). +When enabled, it can result in a performance hit, and can be disabled in most cases. Read more about it `here `_. .. tip:: It applies to all DDP strategies that support ``find_unused_parameters`` as input. @@ -102,7 +105,7 @@ This by default comes with a performance hit, and can be disabled in most cases. strategy=DDPSpawnStrategy(find_unused_parameters=False), ) -When using DDP on a multi-node cluster, set NCCL parameters +When using DDP on a Multi-node Cluster, set NCCL Parameters ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ `NCCL `__ is the NVIDIA Collective Communications Library which is used under the hood by PyTorch to handle communication across nodes and GPUs. There are reported benefits in terms of speedups when adjusting NCCL parameters as seen in this `issue `__. In the issue we see a 30% speed improvement when training the Transformer XLM-RoBERTa and a 15% improvement in training with Detectron2. @@ -124,22 +127,22 @@ NCCL parameters can be adjusted via environment variables. Dataloaders ^^^^^^^^^^^ -When building your DataLoader set ``num_workers > 0`` and ``pin_memory=True`` (only for GPUs). + +When building your DataLoader set ``num_workers>0`` and ``pin_memory=True`` (only for GPUs). .. code-block:: python Dataloader(dataset, num_workers=8, pin_memory=True) num_workers -""""""""""" +^^^^^^^^^^^ -The question of how many workers to specify in ``num_workers`` is tricky. Here's a summary of -some references, [`1 `_], and our suggestions: +The question of how many workers to specify in ``num_workers`` is tricky. Here's a summary of `some references `_, and our suggestions: 1. ``num_workers=0`` means ONLY the main process will load batches (that can be a bottleneck). 2. ``num_workers=1`` means ONLY one worker (just not the main process) will load data but it will still be slow. -3. The ``num_workers`` depends on the batch size and your machine. -4. A general place to start is to set ``num_workers`` equal to the number of CPU cores on that machine. You can get the number of CPU cores in python using `os.cpu_count()`, but note that depending on your batch size, you may overflow RAM memory. +3. The performance of high ``num_workers`` depends on the batch size and your machine. +4. A general place to start is to set ``num_workers`` equal to the number of CPU cores on that machine. You can get the number of CPU cores in python using ``os.cpu_count()``, but note that depending on your batch size, you may overflow RAM memory. .. warning:: Increasing ``num_workers`` will ALSO increase your CPU memory consumption. @@ -159,26 +162,40 @@ For debugging purposes or for dataloaders that load very small datasets, it is d warnings.filterwarnings("ignore", category=PossibleUserWarning) Spawn -""""" -When using ``strategy=ddp_spawn`` or training on TPUs, the way multiple GPUs/TPU cores are used is by calling ``.spawn()`` under the hood. -The problem is that PyTorch has issues with ``num_workers > 0`` when using ``.spawn()``. For this reason we recommend you -use ``strategy=ddp`` so you can increase the ``num_workers``, however your script has to be callable like so: +^^^^^ + +When using ``strategy="ddp_spawn"`` or training on TPUs, the way multiple GPUs/TPU cores are used is by calling :obj:`torch.multiprocessing` +``.spawn()`` under the hood. The problem is that PyTorch has issues with ``num_workers>0`` when using ``.spawn()``. For this reason, we recommend you +use ``strategy="ddp"`` so you can increase the ``num_workers``, however since DDP doesn't work in an interactive environment like IPython/Jupyter notebooks +your script has to be callable like so: .. code-block:: bash python my_program.py +However, using ``strategy="ddp_spawn"`` enables to reduce memory usage with In-Memory Dataset and shared memory tensors. For more info, checkout +:ref:`Sharing Datasets Across Process Boundaries ` section. -TPU training +Persistent Workers +^^^^^^^^^^^^^^^^^^ + +When using ``strategy="ddp_spawn"`` and ``num_workers>0``, consider setting ``persistent_workers=True`` inside your DataLoader since it can result in data-loading bottlenecks and slowdowns. +This is a limitation of Python ``.spawn()`` and PyTorch. + + +TPU Training ============ -You can set the ``tpu_cores`` trainer flag to 1 or 8 cores. +You can set the ``tpu_cores`` trainer flag to 1, [7] (specific core) or 8 cores. .. code-block:: python # train on 1 TPU core trainer = Trainer(tpu_cores=1) + # train on 7th TPU core + trainer = Trainer(tpu_cores=[7]) + # train on 8 TPU cores trainer = Trainer(tpu_cores=8) @@ -199,12 +216,26 @@ Read more in our :ref:`accelerators` and :ref:`plugins` guides. ----------- +************** +Early Stopping +************** + +Usually, long training epochs can lead to either overfitting or no major improvements in your metrics due to no limited convergence. +Here :class:`~pytorch_lightning.callbacks.early_stopping.EarlyStopping` callback can help you stop the training entirely by monitoring a metric of your choice. + +You can read more about it :ref:`here `. + +---------- + .. _speed_amp: ********************************* -Mixed precision (16-bit) training +Mixed Precision (16-bit) Training ********************************* +Lower precision, such as the 16-bit floating-point, enables the training and deployment of large neural networks since they require less memory, enhance data transfer operations since they required +less memory bandwidth and run match operations much faster on GPUs that support Tensor Core. + **Use when:** * You want to optimize for memory usage on a GPU. @@ -220,7 +251,6 @@ Mixed precision (16-bit) training | - Mixed precision combines the use of both 32 and 16 bit floating points to reduce memory footprint during model training, resulting in improved performance, achieving +3X speedups on modern GPUs. Lightning offers mixed precision training for GPUs and CPUs, as well as bfloat16 mixed precision training for TPUs. @@ -233,6 +263,9 @@ Lightning offers mixed precision training for GPUs and CPUs, as well as bfloat16 trainer = Trainer(precision=16, gpus=4) +Read more about :ref:`mixed-precision training `. + + ---------------- @@ -243,7 +276,9 @@ Control Training Epochs **Use when:** You run a hyperparameter search to find good initial parameters and want to save time, cost (money), or power (environment). It can allow you to be more cost efficient and also run more experiments at the same time. -You can use Trainer flags to force training for a minimum number of epochs or limit to a max number of epochs. Use the `min_epochs` and `max_epochs` Trainer flags to set the number of epochs to run. +You can use Trainer flags to force training for a minimum number of epochs or limit it to a max number of epochs. Use the ``min_epochs`` and ``max_epochs`` Trainer flags to set the number of epochs to run. +Setting ``min_epochs=N`` makes sure that the training will run for at least ``N`` epochs. Setting ``max_epochs=N`` will ensure that training won't happen after +``N`` epochs. .. testcode:: @@ -251,7 +286,7 @@ You can use Trainer flags to force training for a minimum number of epochs or li trainer = Trainer(min_epochs=1, max_epochs=1000) -If running iteration based training, i.e. infinite / iterable dataloader, you can also control the number of steps with the `min_steps` and `max_steps` flags: +If running iteration based training, i.e. infinite / iterable dataloader, you can also control the number of steps with the ``min_steps`` and ``max_steps`` flags: .. testcode:: @@ -283,67 +318,41 @@ Check validation every n epochs **Use when:** You have a small dataset, and want to run less validation checks. -You can limit validation check to only run every n epochs using the `check_val_every_n_epoch` Trainer flag. +You can limit validation check to only run every n epochs using the ``check_val_every_n_epoch`` Trainer flag. .. testcode:: - # DEFAULT + # default trainer = Trainer(check_val_every_n_epoch=1) + # runs validation after every 7th Epoch + trainer = Trainer(check_val_every_n_epoch=7) -Set validation check frequency within 1 training epoch -====================================================== + +Validation within Training Epoch +================================ **Use when:** You have a large training dataset, and want to run mid-epoch validation checks. -For large datasets, it's often desirable to check validation multiple times within a training loop. -Pass in a float to check that often within 1 training epoch. Pass in an int `k` to check every `k` training batches. -Must use an `int` if using an `IterableDataset`. +For large datasets, it's often desirable to check validation multiple times within a training epoch. +Pass in a float to check that often within 1 training epoch. Pass in an int ``K`` to check every ``K`` training batches. +Must use an ``int`` if using an :class:`~torch.utils.data.IterableDataset`. .. testcode:: - # DEFAULT - trainer = Trainer(val_check_interval=0.95) + # default + trainer = Trainer(val_check_interval=1.0) - # check every .25 of an epoch + # check every 1/4 th of an epoch trainer = Trainer(val_check_interval=0.25) - # check every 100 train batches (ie: for `IterableDatasets` or fixed frequency) + # check every 100 train batches (ie: for IterableDatasets or fixed frequency) trainer = Trainer(val_check_interval=100) Learn more in our :ref:`trainer_flags` guide. ---------------- -****************** -Limit Dataset Size -****************** - -Use data subset for training, validation, and test -================================================== - -**Use when:** Debugging or running huge datasets. - -If you don't want to check 100% of the training/validation/test set set these flags: - -.. testcode:: - - # DEFAULT - trainer = Trainer(limit_train_batches=1.0, limit_val_batches=1.0, limit_test_batches=1.0) - - # check 10%, 20%, 30% only, respectively for training, validation and test set - trainer = Trainer(limit_train_batches=0.1, limit_val_batches=0.2, limit_test_batches=0.3) - -If you also pass ``shuffle=True`` to the dataloader, a different random subset of your dataset will be used for each epoch; otherwise the same subset will be used for all epochs. - -.. note:: ``limit_train_batches`` will be overwritten by ``overfit_batches`` if ``overfit_batches > 0`` and will turn off validation. - -.. note:: If you set ``limit_val_batches=0``, validation will be disabled. - -Learn more in our :ref:`trainer_flags` guide. - ------ - ********************* Preload Data Into RAM ********************* @@ -480,19 +489,18 @@ Things to avoid .item(), .numpy(), .cpu() ========================= + Don't call ``.item()`` anywhere in your code. Use ``.detach()`` instead to remove the connected graph calls. Lightning takes a great deal of care to be optimized for this. ----------- +Clear Cache +=========== -empty_cache() -============= -Don't call this unnecessarily! Every time you call this ALL your GPUs have to wait to sync. +Don't call :func:`torch.cuda.empty_cache` unnecessarily! Every time you call this ALL your GPUs have to wait to sync. ----------- +Transferring tensors to device +============================== -Tranfering tensors to device -============================ LightningModules know what device they are on! Construct tensors on the device directly to avoid CPU->Device transfer. .. code-block:: python