shard doc improvements (#4993)
* Rewording * Update fairscale install link to include bucket fix, add benchmark results * Added percentage gain * Address codereview * Update docs/source/multi_gpu.rst Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update multi_gpu.rst Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
This commit is contained in:
parent
fedc0d1360
commit
68ba4931f0
|
@ -611,30 +611,62 @@ This is useful when dealing with large Transformer based models, or in environme
|
|||
|
||||
Lightning currently offers the following methods to leverage model parallelism:
|
||||
|
||||
- Optimizer Sharded Training (partitioning your gradients and optimizer state across multiple GPUs, for reduced memory overhead)
|
||||
- Sharded Training (partitioning your gradients and optimizer state across multiple GPUs, for reduced memory overhead with **no performance loss**)
|
||||
|
||||
Optimizer Sharded Training
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
Lightning integration of optimizer sharded training provided by `Fairscale <https://github.com/facebookresearch/fairscale>`_.
|
||||
Sharded Training
|
||||
^^^^^^^^^^^^^^^^
|
||||
Lightning integration of optimizer sharded training provided by `FairScale <https://github.com/facebookresearch/fairscale>`_.
|
||||
The technique can be found within `DeepSpeed ZeRO <https://arxiv.org/abs/1910.02054>`_ and
|
||||
`ZeRO-2 <https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/>`_,
|
||||
however the implementation is built from the ground up to be pytorch compatible and standalone.
|
||||
Sharded Training allows you to maintain GPU scaling efficiency, whilst reducing memory overhead drastically. In short, expect normal linear scaling, and significantly reduced memory usage when training large models.
|
||||
|
||||
Optimizer Sharded Training still utilizes Data Parallel Training under the hood, except the optimizer state and gradients which are sharded across GPUs.
|
||||
Sharded Training still utilizes Data Parallel Training under the hood, except optimizer states and gradients are sharded across GPUs.
|
||||
This means the memory overhead per GPU is lower, as each GPU only has to maintain a partition of your optimizer state and gradients.
|
||||
|
||||
The benefits vary by model and parameter sizes, but we've recorded up to a 63% memory reduction per GPU allowing us to double our model sizes. Because of extremely efficient communication,
|
||||
these benefits in multi-GPU setups are almost free and throughput scales well with multi-node setups.
|
||||
|
||||
It is highly recommended to use Optimizer Sharded Training in multi-GPU environments where memory is limited, or where training larger models are beneficial (rough minimum of 500+ million parameter models).
|
||||
Optimizer Sharded Training is typically not suited for smaller models, or where large batch sizes are important.
|
||||
This is primarily because with larger batch sizes, storing activations for the backwards pass becomes the bottleneck in training. Sharding optimizer state as a result becomes less impactful.
|
||||
Below we use the `NeMo Transformer Lightning Language Modeling example <https://github.com/NVIDIA/NeMo/tree/main/examples/nlp/language_modeling>`_ to benchmark the maximum batch size and model size that can be fit on 8 A100 GPUs for DDP vs Sharded Training.
|
||||
Note that the benefits can still be obtained using 2 or more GPUs, and for even larger batch sizes you can scale to multiple nodes.
|
||||
|
||||
To use Optimizer Sharded Training, you need to first install Fairscale using the command below or install all extras using ``pip install pytorch-lightning["extra"]``.
|
||||
**Increase Your Batch Size**
|
||||
|
||||
Use Sharded Training to scale your batch size further using the same compute. This will reduce your overall epoch time.
|
||||
|
||||
+----------------------+-----------------------+----------------+---------------------+
|
||||
| Distributed Training | Model Size (Millions) | Max Batch Size | Percentage Gain (%) |
|
||||
+======================+=======================+================+=====================+
|
||||
| Native DDP | 930 | 32 | - |
|
||||
+----------------------+-----------------------+----------------+---------------------+
|
||||
| Sharded DDP | 930 | **52** | **48%** |
|
||||
+----------------------+-----------------------+----------------+---------------------+
|
||||
|
||||
**Increase Your Model Size**
|
||||
|
||||
Use Sharded Training to scale your model size further using the same compute.
|
||||
|
||||
+----------------------+------------+---------------------------+---------------------+
|
||||
| Distributed Training | Batch Size | Max Model Size (Millions) | Percentage Gain (%) |
|
||||
+======================+============+===========================+=====================+
|
||||
| Native DDP | 32 | 930 | - |
|
||||
+----------------------+------------+---------------------------+---------------------+
|
||||
| Sharded DDP | 32 | **1404** | **41%** |
|
||||
+----------------------+------------+---------------------------+---------------------+
|
||||
| Native DDP | 8 | 1572 | - |
|
||||
+----------------------+------------+---------------------------+---------------------+
|
||||
| Sharded DDP | 8 | **2872** | **59%** |
|
||||
+----------------------+------------+---------------------------+---------------------+
|
||||
|
||||
It is highly recommended to use Sharded Training in multi-GPU environments where memory is limited, or where training larger models are beneficial (500M+ parameter models).
|
||||
A technical note: as batch size scales, storing activations for the backwards pass becomes the bottleneck in training. As a result, sharding optimizer state and gradients becomes less impactful.
|
||||
Work within the future will bring optional sharding to activations and model parameters to reduce memory further, but come with a speed cost.
|
||||
|
||||
To use Sharded Training, you need to first install FairScale using the command below or install all extras using ``pip install pytorch-lightning["extra"]``.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install fairscale
|
||||
pip install https://github.com/facebookresearch/fairscale/archive/bb468670838b98dc8f8d67be4eabf195042a7994.zip
|
||||
|
||||
|
||||
.. code-block:: python
|
||||
|
@ -642,7 +674,7 @@ To use Optimizer Sharded Training, you need to first install Fairscale using the
|
|||
# train using Sharded DDP
|
||||
trainer = Trainer(accelerator='ddp', plugins='ddp_sharded')
|
||||
|
||||
Optimizer Sharded Training can work across all DDP variants by adding the additional ``--plugins ddp_sharded`` flag.
|
||||
Sharded Training can work across all DDP variants by adding the additional ``--plugins ddp_sharded`` flag.
|
||||
|
||||
Internally we re-initialize your optimizers and shard them across your machines and processes. We handle all communication using PyTorch distributed, so no code changes are required.
|
||||
|
||||
|
|
Loading…
Reference in New Issue