# train on multiple GPUs across nodes (uses 8 gpus in total)
trainer = Trainer(gpus=2, num_nodes=4)
GPU Training Speedup Tips
-------------------------
When training on single or multiple GPU machines, Lightning offers a host of advanced optimizations to improve throughput, memory efficiency, and model scaling.
Refer to :doc:`Advanced GPU Optimized Training for more details <../advanced/advanced_gpu>`.
Prefer DDP over DP
^^^^^^^^^^^^^^^^^^
:class:`~pytorch_lightning.plugins.training_type.DataParallelPlugin` performs three GPU transfers for EVERY batch:
1. Copy model to device.
2. Copy data to device.
3. Copy outputs of each device back to master.
Whereas :class:`~pytorch_lightning.plugins.training_type.DDPPlugin` only performs 1 transfer to sync gradients, making DDP MUCH faster than DP.
By default we have set ``find_unused_parameters`` to True for compatibility reasons that have been observed in the past (see the `discussion <https://github.com/PyTorchLightning/pytorch-lightning/discussions/6219>`_ for more details).
`NCCL <https://developer.nvidia.com/nccl>`__ is the NVIDIA Collective Communications Library which is used under the hood by PyTorch to handle communication across nodes and GPUs. There are reported benefits in terms of speedups when adjusting NCCL parameters as seen in this `issue <https://github.com/PyTorchLightning/pytorch-lightning/issues/7179>`__. In the issue we see a 30% speed improvement when training the Transformer XLM-RoBERTa and a 15% improvement in training with Detectron2.
NCCL parameters can be adjusted via environment variables.
..note::
AWS and GCP already set default values for these on their clusters. This is typically useful for custom cluster setups.
The question of how many workers to specify in ``num_workers`` is tricky. Here's a summary of
some references, [`1 <https://discuss.pytorch.org/t/guidelines-for-assigning-num-workers-to-dataloader/813>`_], and our suggestions:
1.``num_workers=0`` means ONLY the main process will load batches (that can be a bottleneck).
2.``num_workers=1`` means ONLY one worker (just not the main process) will load data but it will still be slow.
3. The ``num_workers`` depends on the batch size and your machine.
4. A general place to start is to set ``num_workers`` equal to the number of CPU cores on that machine. You can get the number of CPU cores in python using `os.cpu_count()`, but note that depending on your batch size, you may overflow RAM memory.
..warning:: Increasing ``num_workers`` will ALSO increase your CPU memory consumption.
The best thing to do is to increase the ``num_workers`` slowly and stop once you see no more improvement in your training speed.
Mixed precision combines the use of both 32 and 16 bit floating points to reduce memory footprint during model training, resulting in improved performance, achieving +3X speedups on modern GPUs.
**Use when:** You run a hyperparameter search to find good initial parameters and want to save time, cost (money), or power (environment).
It can allow you to be more cost efficient and also run more experiments at the same time.
You can use Trainer flags to force training for a minimum number of epochs or limit to a max number of epochs. Use the `min_epochs` and `max_epochs` Trainer flags to set the number of epochs to run.
..testcode::
# DEFAULT
trainer = Trainer(min_epochs=1, max_epochs=1000)
If running iteration based training, i.e. infinite / iterable dataloader, you can also control the number of steps with the `min_steps` and `max_steps` flags:
..testcode::
trainer = Trainer(max_steps=1000)
trainer = Trainer(min_steps=100)
You can also interupt training based on training time:
..testcode::
# Stop after 12 hours of training or when reaching 10 epochs (string)
If you also pass ``shuffle=True`` to the dataloader, a different random subset of your dataset will be used for each epoch; otherwise the same subset will be used for all epochs.
..note::``limit_train_batches``, ``limit_val_batches`` and ``limit_test_batches`` will be overwritten by ``overfit_batches`` if ``overfit_batches`` > 0. ``limit_val_batches`` will be ignored if ``fast_dev_run=True``.
..note:: If you set ``limit_val_batches=0``, validation will be disabled.
Learn more in our :ref:`trainer_flags` guide.
-----
*********************
Preload Data Into RAM
*********************
**Use when:** You need access to all samples in a dataset at once.
When your training or preprocessing requires many operations to be performed on entire dataset(s), it can
sometimes be beneficial to store all data in RAM given there is enough space.
However, loading all data at the beginning of the training script has the disadvantage that it can take a long
time and hence it slows down the development process. Another downside is that in multiprocessing (e.g. DDP)
the data would get copied in each process.
One can overcome these problems by copying the data into RAM in advance.
Most UNIX-based operating systems provide direct access to tmpfs through a mount point typically named ``/dev/shm``.
0. Increase shared memory if necessary. Refer to the documentation of your OS how to do this.
1. Copy training data to shared memory:
..code-block:: bash
cp -r /path/to/data/on/disk /dev/shm/
2. Refer to the new data root in your script or command line arguments: