lightning/docs/source/training_tricks.rst

.. testsetup:: *

    from pytorch_lightning.trainer.trainer import Trainer


Training Tricks
================
Lightning implements various tricks to help during training

Accumulate gradients
-------------------------------------
Accumulated gradients runs K small batches of size N before doing a backwards pass.
The effect is a large effective batch size of size KxN.

.. seealso:: :class:`~pytorch_lightning.trainer.trainer.Trainer`

.. testcode::

    # DEFAULT (ie: no accumulated grads)
    trainer = Trainer(accumulate_grad_batches=1)


Gradient Clipping
-------------------------------------
Gradient clipping may be enabled to avoid exploding gradients. Specifically, this will `clip the gradient
norm <https://pytorch.org/docs/stable/nn.html#torch.nn.utils.clip_grad_norm_>`_ computed over all model parameters together.

.. seealso:: :class:`~pytorch_lightning.trainer.trainer.Trainer`

.. testcode::

    # DEFAULT (ie: don't clip)
    trainer = Trainer(gradient_clip_val=0)

    # clip gradients with norm above 0.5
    trainer = Trainer(gradient_clip_val=0.5)

Auto scaling of batch size
--------------------------
Auto scaling of batch size may be enabled to find the largest batch size that fits into
memory. Larger batch size often yields better estimates of gradients, but may also result in
longer training time.

.. seealso:: :class:`~pytorch_lightning.trainer.trainer.Trainer`

.. code-block:: python

    # DEFAULT (ie: don't scale batch size automatically)
    trainer = Trainer(auto_scale_batch_size=None)

    # Autoscale batch size 
    trainer = Trainer(auto_scale_batch_size=None|'power'|'binsearch')

Currently, this feature supports two modes `'power'` scaling and `'binsearch'`
scaling. In `'power'` scaling, starting from a batch size of 1 keeps doubling 
the batch size until an out-of-memory (OOM) error is encountered. Setting the 
argument to `'binsearch'` continues to finetune the batch size by performing 
a binary search. 

.. note:: 

    This feature expects that a `batch_size` field in the `hparams` of your model, i.e.,
    `model.hparams.batch_size` should exist and will be overridden by the results of this
    algorithm. Additionally, your `train_dataloader()` method should depend on this field
    for this feature to work i.e.

    .. code-block:: python
        
        def train_dataloader(self):
            return DataLoader(train_dataset, batch_size=self.hparams.batch_size)

.. warning::
            
    Due to these constraints, this features does *NOT* work when passing dataloaders directly
    to `.fit()`. 

The scaling algorithm has a number of parameters that the user can control by
invoking the trainer method `.scale_batch_size` themself (see description below).

.. code-block:: python

    # Use default in trainer construction
    trainer = Trainer()

    # Invoke method
    new_batch_size = trainer.scale_batch_size(model, ...)

    # Override old batch size
    model.hparams.batch_size = new_batch_size
    
    # Fit as normal
    trainer.fit(model)

The algorithm in short works by:
    1. Dumping the current state of the model and trainer
    2. Iteratively until convergence or maximum number of tries `max_trials` (default 25) has been reached:
        - Call `fit()` method of trainer. This evaluates `steps_per_trial` (default 3) number of 
          training steps. Each training step can trigger an OOM error if the tensors 
          (training batch, weights, gradients ect.) allocated during the steps have a 
          too large memory footprint.
        - If an OOM error is encountered, decrease batch size else increase it.
          How much the batch size is increased/decreased is determined by the choosen
          stratrgy.
    3. The found batch size is saved to `model.hparams.batch_size`
    4. Restore the initial state of model and trainer

.. autoclass:: pytorch_lightning.trainer.training_tricks.TrainerTrainingTricksMixin
   :members: scale_batch_size
   :noindex:

.. warning:: Batch size finder is not supported for DDP yet, it is coming soon.
doctest for .rst files (#1511) * add doctest to circleci * Revert "add doctest to circleci" This reverts commit c45b34ea911a81f87989f6c3a832b1e8d8c471c6. * Revert "Revert "add doctest to circleci"" This reverts commit 41fca97fdcfe1cf4f6bdb3bbba75d25fa3b11f70. * doctest docs rst files * Revert "doctest docs rst files" This reverts commit b4a2e83e3da5ed1909de500ec14b6b614527c07f. * doctest only rst * doctest debugging.rst * doctest apex * doctest callbacks * doctest early stopping * doctest for child modules * doctest experiment reporting * indentation * doctest fast training * doctest for hyperparams * doctests for lr_finder * doctests multi-gpu * more doctest * make doctest drone * fix label build error * update fast training * update invalid imports * fix problem with int device count * rebase stuff * wip * wip * wip * intro guide * add missing code block * circleci * logger import for doctest * test if doctest runs on drone * fix mnist download * also run install deps for building docs * install cmake * try sudo * hide output * try pip stuff * try to mock horovod * Tranfer -> Transfer * add torchvision to extras * revert pip stuff * mlflow file location * do not mock torch * torchvision * drone extra req. * try higher sphinx version * Revert "try higher sphinx version" This reverts commit 490ac28e46d6fd52352640dfdf0d765befa56988. * try coverage command * try coverage command * try undoc flag * newline * undo drone * report coverage * review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * remove torchvision from extras * skip tests only if torchvision not available * fix testoutput torchvision Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> 2020-05-05 02:16:54 +00:00			`.. testsetup:: *`

			`from pytorch_lightning.trainer.trainer import Trainer`


Docs (#813) * added outline of all features * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated docs 2020-02-11 04:55:22 +00:00			`Training Tricks`
			`================`
			`Lightning implements various tricks to help during training`

			`Accumulate gradients`
			`-------------------------------------`
			`Accumulated gradients runs K small batches of size N before doing a backwards pass.`
			`The effect is a large effective batch size of size KxN.`

CI: Force docs warnings to be raised as errors (+ fix all) (#1191) * add argument to force warn * fix automodule error * fix permalink error * fix indentation warning * fix warning * fix import warnings * fix duplicate label warning * fix bullet point indentation warning * fix duplicate label warning * fix "import not top level" warning * line too long * fix indentation * fix bullet points indentation warning * fix hooks warnings * fix reference problem with excluded test_tube * fix indentation in print * change imports for trains logger * remove pandas type annotation * Update pytorch_lightning/core/lightning.py * include bullet points inside note * remove old quick start guide (unused) * fix unused warning * fix formatting * fix duplicate label issue * fix duplicate label warning (replaced by class ref) * fix tick * fix indentation warnings * docstring ticks * remove obsolete docstring typing * Revert "remove old quick start guide (unused)" This reverts commit d51bb40695442c8fa11bc9df74f6db56264f7509. * added old quick start guide to navigation * remove unused tutorials file * ignore some modules that got deprecated and are not used anymore * fix duplicate label warning * move examples doc and exclude pl_examples from autodoc * fix formatting for configure_optimizer * fix no blank line warnings * fix "see also" labels and add paramref extension * fix more reference problems * fix multi-gpu reference * fix weird warning * fix indentation and unrecognized characters in code block * fix warning "... not included in toctree" * fix PIL import error * fix duplicate target "here" warning * fix broken link * revert accidentally moved pl_examples * changelog * stdout * note some things to know Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: J. Borovec <jirka.borovec@seznam.cz> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> 2020-03-20 19:49:01 +00:00			.. seealso:: :class:`~pytorch_lightning.trainer.trainer.Trainer`
Docs (#813) * added outline of all features * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated docs 2020-02-11 04:55:22 +00:00
doctest for .rst files (#1511) * add doctest to circleci * Revert "add doctest to circleci" This reverts commit c45b34ea911a81f87989f6c3a832b1e8d8c471c6. * Revert "Revert "add doctest to circleci"" This reverts commit 41fca97fdcfe1cf4f6bdb3bbba75d25fa3b11f70. * doctest docs rst files * Revert "doctest docs rst files" This reverts commit b4a2e83e3da5ed1909de500ec14b6b614527c07f. * doctest only rst * doctest debugging.rst * doctest apex * doctest callbacks * doctest early stopping * doctest for child modules * doctest experiment reporting * indentation * doctest fast training * doctest for hyperparams * doctests for lr_finder * doctests multi-gpu * more doctest * make doctest drone * fix label build error * update fast training * update invalid imports * fix problem with int device count * rebase stuff * wip * wip * wip * intro guide * add missing code block * circleci * logger import for doctest * test if doctest runs on drone * fix mnist download * also run install deps for building docs * install cmake * try sudo * hide output * try pip stuff * try to mock horovod * Tranfer -> Transfer * add torchvision to extras * revert pip stuff * mlflow file location * do not mock torch * torchvision * drone extra req. * try higher sphinx version * Revert "try higher sphinx version" This reverts commit 490ac28e46d6fd52352640dfdf0d765befa56988. * try coverage command * try coverage command * try undoc flag * newline * undo drone * report coverage * review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * remove torchvision from extras * skip tests only if torchvision not available * fix testoutput torchvision Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> 2020-05-05 02:16:54 +00:00			`.. testcode::`
Docs (#813) * added outline of all features * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated docs 2020-02-11 04:55:22 +00:00
			`# DEFAULT (ie: no accumulated grads)`
			`trainer = Trainer(accumulate_grad_batches=1)`


			`Gradient Clipping`
			`-------------------------------------`
			Gradient clipping may be enabled to avoid exploding gradients. Specifically, this will `clip the gradient
			norm <https://pytorch.org/docs/stable/nn.html#torch.nn.utils.clip_grad_norm_>`_ computed over all model parameters together.

CI: Force docs warnings to be raised as errors (+ fix all) (#1191) * add argument to force warn * fix automodule error * fix permalink error * fix indentation warning * fix warning * fix import warnings * fix duplicate label warning * fix bullet point indentation warning * fix duplicate label warning * fix "import not top level" warning * line too long * fix indentation * fix bullet points indentation warning * fix hooks warnings * fix reference problem with excluded test_tube * fix indentation in print * change imports for trains logger * remove pandas type annotation * Update pytorch_lightning/core/lightning.py * include bullet points inside note * remove old quick start guide (unused) * fix unused warning * fix formatting * fix duplicate label issue * fix duplicate label warning (replaced by class ref) * fix tick * fix indentation warnings * docstring ticks * remove obsolete docstring typing * Revert "remove old quick start guide (unused)" This reverts commit d51bb40695442c8fa11bc9df74f6db56264f7509. * added old quick start guide to navigation * remove unused tutorials file * ignore some modules that got deprecated and are not used anymore * fix duplicate label warning * move examples doc and exclude pl_examples from autodoc * fix formatting for configure_optimizer * fix no blank line warnings * fix "see also" labels and add paramref extension * fix more reference problems * fix multi-gpu reference * fix weird warning * fix indentation and unrecognized characters in code block * fix warning "... not included in toctree" * fix PIL import error * fix duplicate target "here" warning * fix broken link * revert accidentally moved pl_examples * changelog * stdout * note some things to know Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: J. Borovec <jirka.borovec@seznam.cz> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> 2020-03-20 19:49:01 +00:00			.. seealso:: :class:`~pytorch_lightning.trainer.trainer.Trainer`
Docs (#813) * added outline of all features * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated docs 2020-02-11 04:55:22 +00:00
doctest for .rst files (#1511) * add doctest to circleci * Revert "add doctest to circleci" This reverts commit c45b34ea911a81f87989f6c3a832b1e8d8c471c6. * Revert "Revert "add doctest to circleci"" This reverts commit 41fca97fdcfe1cf4f6bdb3bbba75d25fa3b11f70. * doctest docs rst files * Revert "doctest docs rst files" This reverts commit b4a2e83e3da5ed1909de500ec14b6b614527c07f. * doctest only rst * doctest debugging.rst * doctest apex * doctest callbacks * doctest early stopping * doctest for child modules * doctest experiment reporting * indentation * doctest fast training * doctest for hyperparams * doctests for lr_finder * doctests multi-gpu * more doctest * make doctest drone * fix label build error * update fast training * update invalid imports * fix problem with int device count * rebase stuff * wip * wip * wip * intro guide * add missing code block * circleci * logger import for doctest * test if doctest runs on drone * fix mnist download * also run install deps for building docs * install cmake * try sudo * hide output * try pip stuff * try to mock horovod * Tranfer -> Transfer * add torchvision to extras * revert pip stuff * mlflow file location * do not mock torch * torchvision * drone extra req. * try higher sphinx version * Revert "try higher sphinx version" This reverts commit 490ac28e46d6fd52352640dfdf0d765befa56988. * try coverage command * try coverage command * try undoc flag * newline * undo drone * report coverage * review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * remove torchvision from extras * skip tests only if torchvision not available * fix testoutput torchvision Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> 2020-05-05 02:16:54 +00:00			`.. testcode::`
Docs (#813) * added outline of all features * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated common use cases doc * updated docs 2020-02-11 04:55:22 +00:00
			`# DEFAULT (ie: don't clip)`
			`trainer = Trainer(gradient_clip_val=0)`

			`# clip gradients with norm above 0.5`
			`trainer = Trainer(gradient_clip_val=0.5)`
Feature: auto scale batch size (#1638) * auto batch finder * fix styling * add description * add different modes * fix copy paste error * better organised code * fix styling * add tests * fix * fix * add some documentation * added CHANGELOG.md * some documentation * update based on review * Update trainer.py * Update docs/source/training_tricks.rst Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Update tests/trainer/test_trainer_tricks.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update tests/trainer/test_trainer_tricks.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * use EvalModelTemplate * param tests * rename * wrap params * rename function * rename * rename param * fix * abs * rename * refactor code * add docs * try * arg * loop * exept * loop * drop bool * docs * docs * added check and test for passing dataloader to fit * styling fix * update based on review Co-authored-by: Nicki Skafte <nugginea@gmail.com> Co-authored-by: William Falcon <waf2107@columbia.edu> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz> 2020-05-09 12:28:36 +00:00
			`Auto scaling of batch size`
			`--------------------------`
			`Auto scaling of batch size may be enabled to find the largest batch size that fits into`
			`memory. Larger batch size often yields better estimates of gradients, but may also result in`
			`longer training time.`

			.. seealso:: :class:`~pytorch_lightning.trainer.trainer.Trainer`

			`.. code-block:: python`

			`# DEFAULT (ie: don't scale batch size automatically)`
			`trainer = Trainer(auto_scale_batch_size=None)`

			`# Autoscale batch size`
			`trainer = Trainer(auto_scale_batch_size=None\|'power'\|'binsearch')`

			Currently, this feature supports two modes `'power'` scaling and `'binsearch'`
			scaling. In `'power'` scaling, starting from a batch size of 1 keeps doubling
			`the batch size until an out-of-memory (OOM) error is encountered. Setting the`
			argument to `'binsearch'` continues to finetune the batch size by performing
			`a binary search.`

			`.. note::`

			This feature expects that a `batch_size` field in the `hparams` of your model, i.e.,
			`model.hparams.batch_size` should exist and will be overridden by the results of this
			algorithm. Additionally, your `train_dataloader()` method should depend on this field
			`for this feature to work i.e.`

			`.. code-block:: python`

			`def train_dataloader(self):`
			`return DataLoader(train_dataset, batch_size=self.hparams.batch_size)`

			`.. warning::`

docs dpp warn (#1835) * add warn * Apply suggestions from code review 2020-05-14 15:06:03 +00:00			`Due to these constraints, this features does NOT work when passing dataloaders directly`
Feature: auto scale batch size (#1638) * auto batch finder * fix styling * add description * add different modes * fix copy paste error * better organised code * fix styling * add tests * fix * fix * add some documentation * added CHANGELOG.md * some documentation * update based on review * Update trainer.py * Update docs/source/training_tricks.rst Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Update tests/trainer/test_trainer_tricks.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update tests/trainer/test_trainer_tricks.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * use EvalModelTemplate * param tests * rename * wrap params * rename function * rename * rename param * fix * abs * rename * refactor code * add docs * try * arg * loop * exept * loop * drop bool * docs * docs * added check and test for passing dataloader to fit * styling fix * update based on review Co-authored-by: Nicki Skafte <nugginea@gmail.com> Co-authored-by: William Falcon <waf2107@columbia.edu> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz> 2020-05-09 12:28:36 +00:00			to `.fit()`.

			`The scaling algorithm has a number of parameters that the user can control by`
			invoking the trainer method `.scale_batch_size` themself (see description below).

			`.. code-block:: python`

			`# Use default in trainer construction`
			`trainer = Trainer()`

			`# Invoke method`
			`new_batch_size = trainer.scale_batch_size(model, ...)`

			`# Override old batch size`
			`model.hparams.batch_size = new_batch_size`

			`# Fit as normal`
			`trainer.fit(model)`

			`The algorithm in short works by:`
			`1. Dumping the current state of the model and trainer`
			2. Iteratively until convergence or maximum number of tries `max_trials` (default 25) has been reached:
			- Call `fit()` method of trainer. This evaluates `steps_per_trial` (default 3) number of
			`training steps. Each training step can trigger an OOM error if the tensors`
			`(training batch, weights, gradients ect.) allocated during the steps have a`
			`too large memory footprint.`
			`- If an OOM error is encountered, decrease batch size else increase it.`
			`How much the batch size is increased/decreased is determined by the choosen`
			`stratrgy.`
			3. The found batch size is saved to `model.hparams.batch_size`
			`4. Restore the initial state of model and trainer`

			`.. autoclass:: pytorch_lightning.trainer.training_tricks.TrainerTrainingTricksMixin`
			`:members: scale_batch_size`
			`:noindex:`
docs dpp warn (#1835) * add warn * Apply suggestions from code review 2020-05-14 15:06:03 +00:00
			`.. warning:: Batch size finder is not supported for DDP yet, it is coming soon.`