shard doc improvements (#4993)

* Rewording

* Update fairscale install link to include bucket fix, add benchmark results

* Added percentage gain

* Address codereview

* Update docs/source/multi_gpu.rst

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update multi_gpu.rst

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
This commit is contained in:
Sean Naren 2020-12-07 18:01:59 +00:00 committed by GitHub
parent fedc0d1360
commit 68ba4931f0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 43 additions and 11 deletions

View File

@ -611,30 +611,62 @@ This is useful when dealing with large Transformer based models, or in environme
Lightning currently offers the following methods to leverage model parallelism:
- Optimizer Sharded Training (partitioning your gradients and optimizer state across multiple GPUs, for reduced memory overhead)
- Sharded Training (partitioning your gradients and optimizer state across multiple GPUs, for reduced memory overhead with **no performance loss**)
Optimizer Sharded Training
^^^^^^^^^^^^^^^^^^^^^^^^^^
Lightning integration of optimizer sharded training provided by `Fairscale <https://github.com/facebookresearch/fairscale>`_.
Sharded Training
^^^^^^^^^^^^^^^^
Lightning integration of optimizer sharded training provided by `FairScale <https://github.com/facebookresearch/fairscale>`_.
The technique can be found within `DeepSpeed ZeRO <https://arxiv.org/abs/1910.02054>`_ and
`ZeRO-2 <https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/>`_,
however the implementation is built from the ground up to be pytorch compatible and standalone.
Sharded Training allows you to maintain GPU scaling efficiency, whilst reducing memory overhead drastically. In short, expect normal linear scaling, and significantly reduced memory usage when training large models.
Optimizer Sharded Training still utilizes Data Parallel Training under the hood, except the optimizer state and gradients which are sharded across GPUs.
Sharded Training still utilizes Data Parallel Training under the hood, except optimizer states and gradients are sharded across GPUs.
This means the memory overhead per GPU is lower, as each GPU only has to maintain a partition of your optimizer state and gradients.
The benefits vary by model and parameter sizes, but we've recorded up to a 63% memory reduction per GPU allowing us to double our model sizes. Because of extremely efficient communication,
these benefits in multi-GPU setups are almost free and throughput scales well with multi-node setups.
It is highly recommended to use Optimizer Sharded Training in multi-GPU environments where memory is limited, or where training larger models are beneficial (rough minimum of 500+ million parameter models).
Optimizer Sharded Training is typically not suited for smaller models, or where large batch sizes are important.
This is primarily because with larger batch sizes, storing activations for the backwards pass becomes the bottleneck in training. Sharding optimizer state as a result becomes less impactful.
Below we use the `NeMo Transformer Lightning Language Modeling example <https://github.com/NVIDIA/NeMo/tree/main/examples/nlp/language_modeling>`_ to benchmark the maximum batch size and model size that can be fit on 8 A100 GPUs for DDP vs Sharded Training.
Note that the benefits can still be obtained using 2 or more GPUs, and for even larger batch sizes you can scale to multiple nodes.
To use Optimizer Sharded Training, you need to first install Fairscale using the command below or install all extras using ``pip install pytorch-lightning["extra"]``.
**Increase Your Batch Size**
Use Sharded Training to scale your batch size further using the same compute. This will reduce your overall epoch time.
+----------------------+-----------------------+----------------+---------------------+
| Distributed Training | Model Size (Millions) | Max Batch Size | Percentage Gain (%) |
+======================+=======================+================+=====================+
| Native DDP | 930 | 32 | - |
+----------------------+-----------------------+----------------+---------------------+
| Sharded DDP | 930 | **52** | **48%** |
+----------------------+-----------------------+----------------+---------------------+
**Increase Your Model Size**
Use Sharded Training to scale your model size further using the same compute.
+----------------------+------------+---------------------------+---------------------+
| Distributed Training | Batch Size | Max Model Size (Millions) | Percentage Gain (%) |
+======================+============+===========================+=====================+
| Native DDP | 32 | 930 | - |
+----------------------+------------+---------------------------+---------------------+
| Sharded DDP | 32 | **1404** | **41%** |
+----------------------+------------+---------------------------+---------------------+
| Native DDP | 8 | 1572 | - |
+----------------------+------------+---------------------------+---------------------+
| Sharded DDP | 8 | **2872** | **59%** |
+----------------------+------------+---------------------------+---------------------+
It is highly recommended to use Sharded Training in multi-GPU environments where memory is limited, or where training larger models are beneficial (500M+ parameter models).
A technical note: as batch size scales, storing activations for the backwards pass becomes the bottleneck in training. As a result, sharding optimizer state and gradients becomes less impactful.
Work within the future will bring optional sharding to activations and model parameters to reduce memory further, but come with a speed cost.
To use Sharded Training, you need to first install FairScale using the command below or install all extras using ``pip install pytorch-lightning["extra"]``.
.. code-block:: bash
pip install fairscale
pip install https://github.com/facebookresearch/fairscale/archive/bb468670838b98dc8f8d67be4eabf195042a7994.zip
.. code-block:: python
@ -642,7 +674,7 @@ To use Optimizer Sharded Training, you need to first install Fairscale using the
# train using Sharded DDP
trainer = Trainer(accelerator='ddp', plugins='ddp_sharded')
Optimizer Sharded Training can work across all DDP variants by adding the additional ``--plugins ddp_sharded`` flag.
Sharded Training can work across all DDP variants by adding the additional ``--plugins ddp_sharded`` flag.
Internally we re-initialize your optimizers and shard them across your machines and processes. We handle all communication using PyTorch distributed, so no code changes are required.