2020-08-13 22:56:51 +00:00
.. _performance:
2020-10-12 15:13:26 +00:00
Fast performance tips
=====================
Lightning builds in all the micro-optimizations we can find to increase your performance.
But we can only automate so much.
Here are some additional things you can do to increase your performance.
2020-06-15 12:02:19 +00:00
2020-06-19 06:38:10 +00:00
----------
2020-06-15 12:02:19 +00:00
Dataloaders
-----------
2020-10-03 12:15:07 +00:00
When building your DataLoader set `` num_workers > 0 `` and `` pin_memory=True `` (only for GPUs).
2020-06-15 12:02:19 +00:00
.. code-block :: python
Dataloader(dataset, num_workers=8, pin_memory=True)
num_workers
^^^^^^^^^^^
2020-10-03 12:15:07 +00:00
The question of how many `` num_workers `` is tricky. Here's a summary of
2020-06-15 12:02:19 +00:00
some references, [`1 <https://discuss.pytorch.org/t/guidelines-for-assigning-num-workers-to-dataloader/813> `_ ], and our suggestions.
2020-10-03 12:15:07 +00:00
1. `` num_workers=0 `` means ONLY the main process will load batches (that can be a bottleneck).
2. `` num_workers=1 `` means ONLY one worker (just not the main process) will load data but it will still be slow.
3. The `` num_workers `` depends on the batch size and your machine.
4. A general place to start is to set `` num_workers `` equal to the number of CPUs on that machine.
2020-06-15 12:02:19 +00:00
2020-10-03 12:15:07 +00:00
.. warning :: Increasing `` num_workers `` will ALSO increase your CPU memory consumption.
2020-06-15 12:02:19 +00:00
2020-10-03 12:15:07 +00:00
The best thing to do is to increase the `` num_workers `` slowly and stop once you see no more improvement in your training speed.
2020-06-15 12:02:19 +00:00
Spawn
^^^^^
2020-10-29 18:15:24 +00:00
When using `` accelerator=ddp_spawn `` (the ddp default) or TPU training, the way multiple GPUs/TPU cores are used is by calling `` .spawn() `` under the hood.
2020-10-03 12:15:07 +00:00
The problem is that PyTorch has issues with `` num_workers > 0 `` when using `` .spawn() `` . For this reason we recommend you
2020-10-29 18:15:24 +00:00
use `` accelerator=ddp `` so you can increase the `` num_workers `` , however your script has to be callable like so:
2020-06-15 12:02:19 +00:00
.. code-block :: bash
python my_program.py --gpus X
2020-06-19 06:38:10 +00:00
----------
2020-06-15 12:02:19 +00:00
.item(), .numpy(), .cpu()
-------------------------
2020-10-03 12:15:07 +00:00
Don't call `` .item() `` anywhere in your code. Use `` .detach() `` instead to remove the connected graph calls. Lightning
2020-06-15 12:02:19 +00:00
takes a great deal of care to be optimized for this.
2020-06-19 06:38:10 +00:00
----------
2020-06-15 12:02:19 +00:00
empty_cache()
-------------
Don't call this unnecessarily! Every time you call this ALL your GPUs have to wait to sync.
2020-06-19 06:38:10 +00:00
----------
2020-07-11 20:52:20 +00:00
Construct tensors directly on the device
----------------------------------------
LightningModules know what device they are on! Construct tensors on the device directly to avoid CPU->Device transfer.
2020-06-15 12:02:19 +00:00
.. code-block :: python
# bad
2020-07-11 20:52:20 +00:00
t = torch.rand(2, 2).cuda()
2020-06-15 12:02:19 +00:00
2020-07-11 20:52:20 +00:00
# good (self is LightningModule)
t = torch.rand(2, 2, device=self.device)
For tensors that need to be model attributes, it is best practice to register them as buffers in the modules's
2020-10-03 12:15:07 +00:00
`` __init__ `` method:
2020-07-11 20:52:20 +00:00
.. code-block :: python
# bad
self.t = torch.rand(2, 2, device=self.device)
# good
self.register_buffer("t", torch.rand(2, 2))
2020-06-15 12:02:19 +00:00
2020-06-19 06:38:10 +00:00
----------
2020-06-15 12:02:19 +00:00
Use DDP not DP
--------------
DP performs three GPU transfers for EVERY batch:
1. Copy model to device.
2. Copy data to device.
3. Copy outputs of each device back to master.
2020-10-03 12:15:07 +00:00
|
2020-06-15 12:02:19 +00:00
Whereas DDP only performs 1 transfer to sync gradients. Because of this, DDP is MUCH faster than DP.
2020-06-19 06:38:10 +00:00
----------
2020-06-15 12:02:19 +00:00
16-bit precision
----------------
2020-10-03 12:15:07 +00:00
Use 16-bit to decrease the memory consumption (and thus increase your batch size). On certain GPUs (V100s, 2080tis), 16-bit calculations are also faster.
2020-06-15 12:02:19 +00:00
However, know that 16-bit and multi-processing (any DDP) can have issues. Here are some common problems.
1. `CUDA error: an illegal memory access was encountered <https://github.com/pytorch/pytorch/issues/21819> `_ .
The solution is likely setting a specific CUDA, CUDNN, PyTorch version combination.
2020-10-03 12:15:07 +00:00
2. `` CUDA error: device-side assert triggered `` . This is a general catch-all error. To see the actual error run your script like so:
2020-06-15 12:02:19 +00:00
2020-10-03 12:15:07 +00:00
.. code-block :: bash
2020-06-15 12:02:19 +00:00
2020-10-03 12:15:07 +00:00
# won't see what the error is
python main.py
2020-06-15 12:02:19 +00:00
2020-10-03 12:15:07 +00:00
# will see what the error is
CUDA_LAUNCH_BLOCKING=1 python main.py
2020-06-15 12:02:19 +00:00
2020-10-03 12:15:07 +00:00
.. tip :: We also recommend using 16-bit native found in PyTorch 1.6. Just install this version and Lightning will automatically use it.
2020-12-02 11:54:46 +00:00
----------
Use Sharded DDP for GPU memory and scaling optimization
-------------------------------------------------------
Sharded DDP is a lightning integration of `DeepSpeed ZeRO <https://arxiv.org/abs/1910.02054> `_ and
`ZeRO-2 <https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/> `_
provided by `Fairscale <https://github.com/facebookresearch/fairscale> `_ .
When training on multiple GPUs sharded DDP can assist to increase memory efficiency substantially, and in some cases performance on multi-node is better than traditional DDP.
This is due to efficient communication and parallelization under the hood.
To use Optimizer Sharded Training, refer to :ref: `model-parallelism` .
Sharded DDP can work across all DDP variants by adding the additional `` --plugins ddp_sharded `` flag.
2020-12-09 16:31:18 +00:00
Refer to the :ref: `distributed computing guide for more details <multi_gpu>` .
Sequential Model Parallelism with Checkpointing
---------------------------------------------------------------------
PyTorch Lightning integration for Sequential Model Parallelism using `FairScale <https://github.com/facebookresearch/fairscale> `_ .
Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially.
For more information, refer to :ref: `sequential-parallelism` .