lightning/docs/source/performance.rst

.. _performance:

Fast performance tips
=====================
Lightning builds in all the micro-optimizations we can find to increase your performance.
But we can only automate so much.

Here are some additional things you can do to increase your performance.

----------

Dataloaders
-----------
When building your DataLoader set ``num_workers > 0`` and ``pin_memory=True`` (only for GPUs).

.. code-block:: python

    Dataloader(dataset, num_workers=8, pin_memory=True)

num_workers
^^^^^^^^^^^
The question of how many ``num_workers`` is tricky. Here's a summary of
some references, [`1 <https://discuss.pytorch.org/t/guidelines-for-assigning-num-workers-to-dataloader/813>`_], and our suggestions.

1. ``num_workers=0`` means ONLY the main process will load batches (that can be a bottleneck).
2. ``num_workers=1`` means ONLY one worker (just not the main process) will load data but it will still be slow.
3. The ``num_workers`` depends on the batch size and your machine.
4. A general place to start is to set ``num_workers`` equal to the number of CPUs on that machine.

.. warning:: Increasing ``num_workers`` will ALSO increase your CPU memory consumption.

The best thing to do is to increase the ``num_workers`` slowly and stop once you see no more improvement in your training speed.

Spawn
^^^^^
When using ``accelerator=ddp_spawn`` (the ddp default) or TPU training, the way multiple GPUs/TPU cores are used is by calling ``.spawn()`` under the hood.
The problem is that PyTorch has issues with ``num_workers > 0`` when using ``.spawn()``. For this reason we recommend you
use ``accelerator=ddp`` so you can increase the ``num_workers``, however your script has to be callable like so:

.. code-block:: bash

    python my_program.py --gpus X

----------

.item(), .numpy(), .cpu()
-------------------------
Don't call ``.item()`` anywhere in your code. Use ``.detach()`` instead to remove the connected graph calls. Lightning
takes a great deal of care to be optimized for this.

----------

empty_cache()
-------------
Don't call this unnecessarily! Every time you call this ALL your GPUs have to wait to sync.

----------

Construct tensors directly on the device
----------------------------------------
LightningModules know what device they are on! Construct tensors on the device directly to avoid CPU->Device transfer.

.. code-block:: python

    # bad
    t = torch.rand(2, 2).cuda()

    # good (self is LightningModule)
    t = torch.rand(2, 2, device=self.device)


For tensors that need to be model attributes, it is best practice to register them as buffers in the modules's
``__init__`` method:

.. code-block:: python

    # bad
    self.t = torch.rand(2, 2, device=self.device)

    # good
    self.register_buffer("t", torch.rand(2, 2))

----------

Use DDP not DP
--------------
DP performs three GPU transfers for EVERY batch:

1. Copy model to device.
2. Copy data to device.
3. Copy outputs of each device back to master.

|

Whereas DDP only performs 1 transfer to sync gradients. Because of this, DDP is MUCH faster than DP.

----------

16-bit precision
----------------
Use 16-bit to decrease the memory consumption (and thus increase your batch size). On certain GPUs (V100s, 2080tis), 16-bit calculations are also faster.
However, know that 16-bit and multi-processing (any DDP) can have issues. Here are some common problems.

1. `CUDA error: an illegal memory access was encountered <https://github.com/pytorch/pytorch/issues/21819>`_.
    The solution is likely setting a specific CUDA, CUDNN, PyTorch version combination.
2. ``CUDA error: device-side assert triggered``. This is a general catch-all error. To see the actual error run your script like so:

.. code-block:: bash

    # won't see what the error is
    python main.py

    # will see what the error is
    CUDA_LAUNCH_BLOCKING=1 python main.py

.. tip:: We also recommend using 16-bit native found in PyTorch 1.6. Just install this version and Lightning will automatically use it.

----------

Use Sharded DDP for GPU memory and scaling optimization
-------------------------------------------------------

Sharded DDP is a lightning integration of `DeepSpeed ZeRO <https://arxiv.org/abs/1910.02054>`_ and
`ZeRO-2 <https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/>`_
provided by `Fairscale <https://github.com/facebookresearch/fairscale>`_.

When training on multiple GPUs sharded DDP can assist to increase memory efficiency substantially, and in some cases performance on multi-node is better than traditional DDP.
This is due to efficient communication and parallelization under the hood.

To use Optimizer Sharded Training, refer to :ref:`model-parallelism`.

Sharded DDP can work across all DDP variants by adding the additional ``--plugins ddp_sharded`` flag.

Refer to the :ref:`distributed computing guide for more details <multi_gpu>`.


Sequential Model Parallelism with Checkpointing
---------------------------------------------------------------------
PyTorch Lightning integration for Sequential Model Parallelism using `FairScale <https://github.com/facebookresearch/fairscale>`_.
Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially.

For more information, refer to :ref:`sequential-parallelism`.
Add labels to sphinx docs (#2964) * Add label * add ref * add ref * add ref * add label * add label * add label * add label * Update fast_training.rst * label * label * label * label * label * label * label * label * label * label * label * Update performance.rst * Update production_inference.rst * Update profiler.rst * Update results.rst * Update sequences.rst * Update single_gpu.rst * Update slurm.rst * Update test_set.rst * Update tpu.rst * Update trainer.rst * Update training_tricks.rst * Update transfer_learning.rst * Update weights_loading.rst * Update governance.rst * Update hooks.rst * Update bolts.rst * Update child_modules.rst * Update hyperparameters.rst * Update transfer_learning.rst 2020-08-13 22:56:51 +00:00			`.. _performance:`

docs (#4093) * enabled manual returns * style * docs 2020-10-12 15:13:26 +00:00			`Fast performance tips`
			`=====================`
			`Lightning builds in all the micro-optimizations we can find to increase your performance.`
			`But we can only automate so much.`

			`Here are some additional things you can do to increase your performance.`
Performance docs (#2191) * add workers fix * add workers fix 2020-06-15 12:02:19 +00:00
Release2 (#2262) * fix missing arg * fix missing arg * fix missing arg * fix missing arg * fix missing arg * fix missing arg * fix missing arg 2020-06-19 06:38:10 +00:00			`----------`

Performance docs (#2191) * add workers fix * add workers fix 2020-06-15 12:02:19 +00:00			`Dataloaders`
			`-----------`
Some docs update (#3794) * docs update * docs update * suggestions * Update docs/source/introduction_guide.rst Co-authored-by: William Falcon <waf2107@columbia.edu> 2020-10-03 12:15:07 +00:00			When building your DataLoader set ``num_workers > 0`` and ``pin_memory=True`` (only for GPUs).
Performance docs (#2191) * add workers fix * add workers fix 2020-06-15 12:02:19 +00:00
			`.. code-block:: python`

			`Dataloader(dataset, num_workers=8, pin_memory=True)`

			`num_workers`
			`^^^^^^^^^^^`
Some docs update (#3794) * docs update * docs update * suggestions * Update docs/source/introduction_guide.rst Co-authored-by: William Falcon <waf2107@columbia.edu> 2020-10-03 12:15:07 +00:00			The question of how many ``num_workers`` is tricky. Here's a summary of
Performance docs (#2191) * add workers fix * add workers fix 2020-06-15 12:02:19 +00:00			some references, [`1 <https://discuss.pytorch.org/t/guidelines-for-assigning-num-workers-to-dataloader/813>`_], and our suggestions.

Some docs update (#3794) * docs update * docs update * suggestions * Update docs/source/introduction_guide.rst Co-authored-by: William Falcon <waf2107@columbia.edu> 2020-10-03 12:15:07 +00:00			1. ``num_workers=0`` means ONLY the main process will load batches (that can be a bottleneck).
			2. ``num_workers=1`` means ONLY one worker (just not the main process) will load data but it will still be slow.
			3. The ``num_workers`` depends on the batch size and your machine.
			4. A general place to start is to set ``num_workers`` equal to the number of CPUs on that machine.
Performance docs (#2191) * add workers fix * add workers fix 2020-06-15 12:02:19 +00:00
Some docs update (#3794) * docs update * docs update * suggestions * Update docs/source/introduction_guide.rst Co-authored-by: William Falcon <waf2107@columbia.edu> 2020-10-03 12:15:07 +00:00			.. warning:: Increasing ``num_workers`` will ALSO increase your CPU memory consumption.
Performance docs (#2191) * add workers fix * add workers fix 2020-06-15 12:02:19 +00:00
Some docs update (#3794) * docs update * docs update * suggestions * Update docs/source/introduction_guide.rst Co-authored-by: William Falcon <waf2107@columbia.edu> 2020-10-03 12:15:07 +00:00			The best thing to do is to increase the ``num_workers`` slowly and stop once you see no more improvement in your training speed.
Performance docs (#2191) * add workers fix * add workers fix 2020-06-15 12:02:19 +00:00
			`Spawn`
			`^^^^^`
[docs] distributed_backend -> accelerator (#4429) * distributed_backend -> accelerator * distributed_backend -> accelerator * use_amp -> precision * format Co-authored-by: rohitgr7 <rohitgr1998@gmail.com> 2020-10-29 18:15:24 +00:00			When using ``accelerator=ddp_spawn`` (the ddp default) or TPU training, the way multiple GPUs/TPU cores are used is by calling ``.spawn()`` under the hood.
Some docs update (#3794) * docs update * docs update * suggestions * Update docs/source/introduction_guide.rst Co-authored-by: William Falcon <waf2107@columbia.edu> 2020-10-03 12:15:07 +00:00			The problem is that PyTorch has issues with ``num_workers > 0`` when using ``.spawn()``. For this reason we recommend you
[docs] distributed_backend -> accelerator (#4429) * distributed_backend -> accelerator * distributed_backend -> accelerator * use_amp -> precision * format Co-authored-by: rohitgr7 <rohitgr1998@gmail.com> 2020-10-29 18:15:24 +00:00			use ``accelerator=ddp`` so you can increase the ``num_workers``, however your script has to be callable like so:
Performance docs (#2191) * add workers fix * add workers fix 2020-06-15 12:02:19 +00:00
			`.. code-block:: bash`

			`python my_program.py --gpus X`

Release2 (#2262) * fix missing arg * fix missing arg * fix missing arg * fix missing arg * fix missing arg * fix missing arg * fix missing arg 2020-06-19 06:38:10 +00:00			`----------`

Performance docs (#2191) * add workers fix * add workers fix 2020-06-15 12:02:19 +00:00			`.item(), .numpy(), .cpu()`
			`-------------------------`
Some docs update (#3794) * docs update * docs update * suggestions * Update docs/source/introduction_guide.rst Co-authored-by: William Falcon <waf2107@columbia.edu> 2020-10-03 12:15:07 +00:00			Don't call ``.item()`` anywhere in your code. Use ``.detach()`` instead to remove the connected graph calls. Lightning
Performance docs (#2191) * add workers fix * add workers fix 2020-06-15 12:02:19 +00:00			`takes a great deal of care to be optimized for this.`

Release2 (#2262) * fix missing arg * fix missing arg * fix missing arg * fix missing arg * fix missing arg * fix missing arg * fix missing arg 2020-06-19 06:38:10 +00:00			`----------`

Performance docs (#2191) * add workers fix * add workers fix 2020-06-15 12:02:19 +00:00			`empty_cache()`
			`-------------`
			`Don't call this unnecessarily! Every time you call this ALL your GPUs have to wait to sync.`

Release2 (#2262) * fix missing arg * fix missing arg * fix missing arg * fix missing arg * fix missing arg * fix missing arg * fix missing arg 2020-06-19 06:38:10 +00:00			`----------`

docs: add hint about register_buffer (#2577) * register buffer hint * testcode * clarify init * code block * make it unambigous 2020-07-11 20:52:20 +00:00			`Construct tensors directly on the device`
			`----------------------------------------`
			`LightningModules know what device they are on! Construct tensors on the device directly to avoid CPU->Device transfer.`
Performance docs (#2191) * add workers fix * add workers fix 2020-06-15 12:02:19 +00:00
			`.. code-block:: python`

			`# bad`
docs: add hint about register_buffer (#2577) * register buffer hint * testcode * clarify init * code block * make it unambigous 2020-07-11 20:52:20 +00:00			`t = torch.rand(2, 2).cuda()`
Performance docs (#2191) * add workers fix * add workers fix 2020-06-15 12:02:19 +00:00
docs: add hint about register_buffer (#2577) * register buffer hint * testcode * clarify init * code block * make it unambigous 2020-07-11 20:52:20 +00:00			`# good (self is LightningModule)`
			`t = torch.rand(2, 2, device=self.device)`


			`For tensors that need to be model attributes, it is best practice to register them as buffers in the modules's`
Some docs update (#3794) * docs update * docs update * suggestions * Update docs/source/introduction_guide.rst Co-authored-by: William Falcon <waf2107@columbia.edu> 2020-10-03 12:15:07 +00:00			``__init__`` method:
docs: add hint about register_buffer (#2577) * register buffer hint * testcode * clarify init * code block * make it unambigous 2020-07-11 20:52:20 +00:00
			`.. code-block:: python`

			`# bad`
			`self.t = torch.rand(2, 2, device=self.device)`

			`# good`
			`self.register_buffer("t", torch.rand(2, 2))`
Performance docs (#2191) * add workers fix * add workers fix 2020-06-15 12:02:19 +00:00
Release2 (#2262) * fix missing arg * fix missing arg * fix missing arg * fix missing arg * fix missing arg * fix missing arg * fix missing arg 2020-06-19 06:38:10 +00:00			`----------`

Performance docs (#2191) * add workers fix * add workers fix 2020-06-15 12:02:19 +00:00			`Use DDP not DP`
			`--------------`
			`DP performs three GPU transfers for EVERY batch:`

			`1. Copy model to device.`
			`2. Copy data to device.`
			`3. Copy outputs of each device back to master.`

Some docs update (#3794) * docs update * docs update * suggestions * Update docs/source/introduction_guide.rst Co-authored-by: William Falcon <waf2107@columbia.edu> 2020-10-03 12:15:07 +00:00			`\|`

Performance docs (#2191) * add workers fix * add workers fix 2020-06-15 12:02:19 +00:00			`Whereas DDP only performs 1 transfer to sync gradients. Because of this, DDP is MUCH faster than DP.`

Release2 (#2262) * fix missing arg * fix missing arg * fix missing arg * fix missing arg * fix missing arg * fix missing arg * fix missing arg 2020-06-19 06:38:10 +00:00			`----------`

Performance docs (#2191) * add workers fix * add workers fix 2020-06-15 12:02:19 +00:00			`16-bit precision`
			`----------------`
Some docs update (#3794) * docs update * docs update * suggestions * Update docs/source/introduction_guide.rst Co-authored-by: William Falcon <waf2107@columbia.edu> 2020-10-03 12:15:07 +00:00			`Use 16-bit to decrease the memory consumption (and thus increase your batch size). On certain GPUs (V100s, 2080tis), 16-bit calculations are also faster.`
Performance docs (#2191) * add workers fix * add workers fix 2020-06-15 12:02:19 +00:00			`However, know that 16-bit and multi-processing (any DDP) can have issues. Here are some common problems.`

			1. `CUDA error: an illegal memory access was encountered <https://github.com/pytorch/pytorch/issues/21819>`_.
			`The solution is likely setting a specific CUDA, CUDNN, PyTorch version combination.`
Some docs update (#3794) * docs update * docs update * suggestions * Update docs/source/introduction_guide.rst Co-authored-by: William Falcon <waf2107@columbia.edu> 2020-10-03 12:15:07 +00:00			2. ``CUDA error: device-side assert triggered``. This is a general catch-all error. To see the actual error run your script like so:
Performance docs (#2191) * add workers fix * add workers fix 2020-06-15 12:02:19 +00:00
Some docs update (#3794) * docs update * docs update * suggestions * Update docs/source/introduction_guide.rst Co-authored-by: William Falcon <waf2107@columbia.edu> 2020-10-03 12:15:07 +00:00			`.. code-block:: bash`
Performance docs (#2191) * add workers fix * add workers fix 2020-06-15 12:02:19 +00:00
Some docs update (#3794) * docs update * docs update * suggestions * Update docs/source/introduction_guide.rst Co-authored-by: William Falcon <waf2107@columbia.edu> 2020-10-03 12:15:07 +00:00			`# won't see what the error is`
			`python main.py`
Performance docs (#2191) * add workers fix * add workers fix 2020-06-15 12:02:19 +00:00
Some docs update (#3794) * docs update * docs update * suggestions * Update docs/source/introduction_guide.rst Co-authored-by: William Falcon <waf2107@columbia.edu> 2020-10-03 12:15:07 +00:00			`# will see what the error is`
			`CUDA_LAUNCH_BLOCKING=1 python main.py`
Performance docs (#2191) * add workers fix * add workers fix 2020-06-15 12:02:19 +00:00
Some docs update (#3794) * docs update * docs update * suggestions * Update docs/source/introduction_guide.rst Co-authored-by: William Falcon <waf2107@columbia.edu> 2020-10-03 12:15:07 +00:00			`.. tip:: We also recommend using 16-bit native found in PyTorch 1.6. Just install this version and Lightning will automatically use it.`
Sharded DDP Docs (#4920) * Add doc fixes * Remove space * Add performance doc, fix flag * Fix up docs * Add install instructions * Update link * Add section for model parallelism, refactor into section * Address code review * fixed underline * Update docs/source/multi_gpu.rst Co-authored-by: edenlightning <66261195+edenlightning@users.noreply.github.com> * Address code review points * Added caveat, increase performance * Update docs/source/multi_gpu.rst Co-authored-by: edenlightning <66261195+edenlightning@users.noreply.github.com> * Update docs/source/multi_gpu.rst Co-authored-by: edenlightning <66261195+edenlightning@users.noreply.github.com> * Add cross reference * Swapped to just fairscale since new release contains all required code * Revert "Swapped to just fairscale since new release contains all required code" This reverts commit 21038e72 * Update docs/source/multi_gpu.rst Co-authored-by: chaton <thomas@grid.ai> * Fairscale install has been fixed Co-authored-by: edenlightning <66261195+edenlightning@users.noreply.github.com> Co-authored-by: chaton <thomas@grid.ai> 2020-12-02 11:54:46 +00:00
			`----------`

			`Use Sharded DDP for GPU memory and scaling optimization`
			`-------------------------------------------------------`

			Sharded DDP is a lightning integration of `DeepSpeed ZeRO <https://arxiv.org/abs/1910.02054>`_ and
			`ZeRO-2 <https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/>`_
			provided by `Fairscale <https://github.com/facebookresearch/fairscale>`_.

			`When training on multiple GPUs sharded DDP can assist to increase memory efficiency substantially, and in some cases performance on multi-node is better than traditional DDP.`
			`This is due to efficient communication and parallelization under the hood.`

			To use Optimizer Sharded Training, refer to :ref:`model-parallelism`.

			Sharded DDP can work across all DDP variants by adding the additional ``--plugins ddp_sharded`` flag.

[feat] 3/n pp (#5036) * add pp doc * udpate doc * update doc * update doc * Update docs * update doc * udpate * update doc * update doc * Formatting, update sharded zip link * Update docs/source/multi_gpu.rst Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Apply suggestions from code review * Reference directly to section Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> 2020-12-09 16:31:18 +00:00			Refer to the :ref:`distributed computing guide for more details <multi_gpu>`.


			`Sequential Model Parallelism with Checkpointing`
			`---------------------------------------------------------------------`
			PyTorch Lightning integration for Sequential Model Parallelism using `FairScale <https://github.com/facebookresearch/fairscale>`_.
			`Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially.`

			For more information, refer to :ref:`sequential-parallelism`.