2020-05-05 02:16:54 +00:00
|
|
|
.. testsetup:: *
|
|
|
|
|
|
|
|
from pytorch_lightning.trainer.trainer import Trainer
|
|
|
|
|
2020-09-14 01:04:21 +00:00
|
|
|
.. _training_tricks:
|
2020-05-05 02:16:54 +00:00
|
|
|
|
2020-02-11 04:55:22 +00:00
|
|
|
Training Tricks
|
|
|
|
================
|
|
|
|
Lightning implements various tricks to help during training
|
|
|
|
|
2020-06-19 06:38:10 +00:00
|
|
|
----------
|
|
|
|
|
2020-02-11 04:55:22 +00:00
|
|
|
Accumulate gradients
|
2020-06-17 21:44:11 +00:00
|
|
|
--------------------
|
2020-02-11 04:55:22 +00:00
|
|
|
Accumulated gradients runs K small batches of size N before doing a backwards pass.
|
|
|
|
The effect is a large effective batch size of size KxN.
|
|
|
|
|
2020-03-20 19:49:01 +00:00
|
|
|
.. seealso:: :class:`~pytorch_lightning.trainer.trainer.Trainer`
|
2020-02-11 04:55:22 +00:00
|
|
|
|
2020-05-05 02:16:54 +00:00
|
|
|
.. testcode::
|
2020-02-11 04:55:22 +00:00
|
|
|
|
|
|
|
# DEFAULT (ie: no accumulated grads)
|
|
|
|
trainer = Trainer(accumulate_grad_batches=1)
|
|
|
|
|
2020-06-19 06:38:10 +00:00
|
|
|
----------
|
2020-02-11 04:55:22 +00:00
|
|
|
|
|
|
|
Gradient Clipping
|
2020-06-17 21:44:11 +00:00
|
|
|
-----------------
|
2021-04-06 13:27:37 +00:00
|
|
|
Gradient clipping may be enabled to avoid exploding gradients. By default, this will `clip the gradient norm
|
|
|
|
<https://pytorch.org/docs/stable/nn.html#torch.nn.utils.clip_grad_norm_>`_ computed over all model parameters together.
|
|
|
|
If ``gradient_clip_algorithm`` option is set to ``value``, which is ``norm`` by default, this will
|
|
|
|
`clip the gradient value <https://pytorch.org/docs/stable/nn.html#torch.nn.utils.clip_grad_value_>`_ for each parameter instead.
|
2020-02-11 04:55:22 +00:00
|
|
|
|
2020-03-20 19:49:01 +00:00
|
|
|
.. seealso:: :class:`~pytorch_lightning.trainer.trainer.Trainer`
|
2020-02-11 04:55:22 +00:00
|
|
|
|
2020-05-05 02:16:54 +00:00
|
|
|
.. testcode::
|
2020-02-11 04:55:22 +00:00
|
|
|
|
|
|
|
# DEFAULT (ie: don't clip)
|
|
|
|
trainer = Trainer(gradient_clip_val=0)
|
|
|
|
|
|
|
|
# clip gradients with norm above 0.5
|
|
|
|
trainer = Trainer(gradient_clip_val=0.5)
|
2020-05-09 12:28:36 +00:00
|
|
|
|
2021-04-06 13:27:37 +00:00
|
|
|
# clip gradients with value above 0.5
|
|
|
|
# gradient_clip_algorithm types => :class:`~pytorch_lightning.utilities.enums.GradClipAlgorithmType`
|
|
|
|
trainer = Trainer(gradient_clip_val=0.5, gradient_clip_algorithm='value')
|
|
|
|
|
2020-06-19 06:38:10 +00:00
|
|
|
----------
|
|
|
|
|
2021-02-18 13:51:51 +00:00
|
|
|
Stochastic Weight Averaging
|
|
|
|
---------------------------
|
|
|
|
Stochastic Weight Averaging (SWA) can make your models generalize better at virtually no additional cost.
|
|
|
|
This can be used with both non-trained and trained models. The SWA procedure smooths the loss landscape thus making
|
|
|
|
it harder to end up in a local minimum during optimization.
|
|
|
|
|
|
|
|
For a more detailed explanation of SWA and how it works,
|
|
|
|
read `this <https://pytorch.org/blog/pytorch-1.6-now-includes-stochastic-weight-averaging>`__ post by the PyTorch team.
|
|
|
|
|
|
|
|
.. seealso:: :class:`~pytorch_lightning.callbacks.StochasticWeightAveraging` (Callback)
|
|
|
|
|
|
|
|
.. testcode::
|
|
|
|
|
|
|
|
# Enable Stochastic Weight Averaging
|
|
|
|
trainer = Trainer(stochastic_weight_avg=True)
|
|
|
|
|
|
|
|
----------
|
|
|
|
|
2020-05-09 12:28:36 +00:00
|
|
|
Auto scaling of batch size
|
|
|
|
--------------------------
|
|
|
|
Auto scaling of batch size may be enabled to find the largest batch size that fits into
|
|
|
|
memory. Larger batch size often yields better estimates of gradients, but may also result in
|
2020-05-29 05:57:50 +00:00
|
|
|
longer training time. Inspired by https://github.com/BlackHC/toma.
|
2020-05-09 12:28:36 +00:00
|
|
|
|
|
|
|
.. seealso:: :class:`~pytorch_lightning.trainer.trainer.Trainer`
|
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
# DEFAULT (ie: don't scale batch size automatically)
|
|
|
|
trainer = Trainer(auto_scale_batch_size=None)
|
|
|
|
|
2020-10-06 13:13:29 +00:00
|
|
|
# Autoscale batch size
|
2020-05-09 12:28:36 +00:00
|
|
|
trainer = Trainer(auto_scale_batch_size=None|'power'|'binsearch')
|
|
|
|
|
2020-08-31 21:36:09 +00:00
|
|
|
# find the batch size
|
|
|
|
trainer.tune(model)
|
|
|
|
|
2020-05-09 12:28:36 +00:00
|
|
|
Currently, this feature supports two modes `'power'` scaling and `'binsearch'`
|
2020-10-06 13:13:29 +00:00
|
|
|
scaling. In `'power'` scaling, starting from a batch size of 1 keeps doubling
|
|
|
|
the batch size until an out-of-memory (OOM) error is encountered. Setting the
|
|
|
|
argument to `'binsearch'` will initially also try doubling the batch size until
|
|
|
|
it encounters an OOM, after which it will do a binary search that will finetune the
|
|
|
|
batch size. Additionally, it should be noted that the batch size scaler cannot
|
|
|
|
search for batch sizes larger than the size of the training dataset.
|
2020-05-09 12:28:36 +00:00
|
|
|
|
|
|
|
|
2020-10-10 01:03:23 +00:00
|
|
|
.. note::
|
|
|
|
|
|
|
|
This feature expects that a `batch_size` field is either located as a model attribute
|
|
|
|
i.e. `model.batch_size` or as a field in your `hparams` i.e. `model.hparams.batch_size`.
|
|
|
|
The field should exist and will be overridden by the results of this algorithm.
|
|
|
|
Additionally, your `train_dataloader()` method should depend on this field
|
|
|
|
for this feature to work i.e.
|
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
def train_dataloader(self):
|
|
|
|
return DataLoader(train_dataset, batch_size=self.batch_size|self.hparams.batch_size)
|
2020-05-09 12:28:36 +00:00
|
|
|
|
2020-10-10 01:03:23 +00:00
|
|
|
.. warning::
|
2020-10-06 13:13:29 +00:00
|
|
|
|
2020-10-10 01:03:23 +00:00
|
|
|
Due to these constraints, this features does *NOT* work when passing dataloaders directly
|
|
|
|
to `.fit()`.
|
2020-05-09 12:28:36 +00:00
|
|
|
|
|
|
|
The scaling algorithm has a number of parameters that the user can control by
|
2021-04-30 13:54:58 +00:00
|
|
|
invoking the :meth:`~pytorch_lightning.tuner.tuning.Tuner.scale_batch_size` method:
|
2020-05-09 12:28:36 +00:00
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
# Use default in trainer construction
|
|
|
|
trainer = Trainer()
|
2020-09-07 20:45:31 +00:00
|
|
|
tuner = Tuner(trainer)
|
2020-05-09 12:28:36 +00:00
|
|
|
|
|
|
|
# Invoke method
|
2020-10-06 13:13:29 +00:00
|
|
|
new_batch_size = tuner.scale_batch_size(model, *extra_parameters_here)
|
2020-05-09 12:28:36 +00:00
|
|
|
|
2021-04-30 13:54:58 +00:00
|
|
|
# Override old batch size (this is done automatically)
|
2020-05-09 12:28:36 +00:00
|
|
|
model.hparams.batch_size = new_batch_size
|
2020-10-06 13:13:29 +00:00
|
|
|
|
2020-05-09 12:28:36 +00:00
|
|
|
# Fit as normal
|
|
|
|
trainer.fit(model)
|
|
|
|
|
|
|
|
The algorithm in short works by:
|
|
|
|
1. Dumping the current state of the model and trainer
|
|
|
|
2. Iteratively until convergence or maximum number of tries `max_trials` (default 25) has been reached:
|
2020-10-06 13:13:29 +00:00
|
|
|
- Call `fit()` method of trainer. This evaluates `steps_per_trial` (default 3) number of
|
|
|
|
training steps. Each training step can trigger an OOM error if the tensors
|
2020-12-08 21:27:43 +00:00
|
|
|
(training batch, weights, gradients, etc.) allocated during the steps have a
|
2020-05-09 12:28:36 +00:00
|
|
|
too large memory footprint.
|
|
|
|
- If an OOM error is encountered, decrease batch size else increase it.
|
2020-12-08 21:27:43 +00:00
|
|
|
How much the batch size is increased/decreased is determined by the chosen
|
|
|
|
strategy.
|
2020-10-06 13:13:29 +00:00
|
|
|
3. The found batch size is saved to either `model.batch_size` or `model.hparams.batch_size`
|
2020-05-09 12:28:36 +00:00
|
|
|
4. Restore the initial state of model and trainer
|
|
|
|
|
2020-05-14 15:06:03 +00:00
|
|
|
.. warning:: Batch size finder is not supported for DDP yet, it is coming soon.
|
2020-12-09 16:31:18 +00:00
|
|
|
|
|
|
|
|
|
|
|
Sequential Model Parallelism with Checkpointing
|
|
|
|
---------------------------------------------------------------------
|
|
|
|
PyTorch Lightning integration for Sequential Model Parallelism using `FairScale <https://github.com/facebookresearch/fairscale>`_.
|
|
|
|
Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially.
|
|
|
|
|
2021-01-07 05:24:47 +00:00
|
|
|
For more information, refer to :ref:`sequential-parallelism`.
|