From 68ba4931f0f588c2795e77f0cb858de5e72cc526 Mon Sep 17 00:00:00 2001
From: Sean Naren <sean.narenthiran@gmail.com>
Date: Mon, 7 Dec 2020 18:01:59 +0000
Subject: [PATCH] shard doc improvements (#4993)

* Rewording

* Update fairscale install link to include bucket fix, add benchmark results

* Added percentage gain

* Address codereview

* Update docs/source/multi_gpu.rst

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update multi_gpu.rst

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
---
 docs/source/multi_gpu.rst | 54 +++++++++++++++++++++++++++++++--------
 1 file changed, 43 insertions(+), 11 deletions(-)

diff --git a/docs/source/multi_gpu.rst b/docs/source/multi_gpu.rst
index 5555e53409..b6ffad20e7 100644
--- a/docs/source/multi_gpu.rst
+++ b/docs/source/multi_gpu.rst
@@ -611,30 +611,62 @@ This is useful when dealing with large Transformer based models, or in environme
 
 Lightning currently offers the following methods to leverage model parallelism:
 
-- Optimizer Sharded Training (partitioning your gradients and optimizer state across multiple GPUs, for reduced memory overhead)
+- Sharded Training (partitioning your gradients and optimizer state across multiple GPUs, for reduced memory overhead with **no performance loss**)
 
-Optimizer Sharded Training
-^^^^^^^^^^^^^^^^^^^^^^^^^^
-Lightning integration of optimizer sharded training provided by `Fairscale <https://github.com/facebookresearch/fairscale>`_.
+Sharded Training
+^^^^^^^^^^^^^^^^
+Lightning integration of optimizer sharded training provided by `FairScale <https://github.com/facebookresearch/fairscale>`_.
 The technique can be found within `DeepSpeed ZeRO <https://arxiv.org/abs/1910.02054>`_ and
 `ZeRO-2 <https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/>`_,
 however the implementation is built from the ground up to be pytorch compatible and standalone.
+Sharded Training allows you to maintain GPU scaling efficiency, whilst reducing memory overhead drastically. In short, expect normal linear scaling, and significantly reduced memory usage when training large models.
 
-Optimizer Sharded Training still utilizes Data Parallel Training under the hood, except the optimizer state and gradients which are sharded across GPUs.
+Sharded Training still utilizes Data Parallel Training under the hood, except optimizer states and gradients are sharded across GPUs.
 This means the memory overhead per GPU is lower, as each GPU only has to maintain a partition of your optimizer state and gradients.
 
 The benefits vary by model and parameter sizes, but we've recorded up to a 63% memory reduction per GPU allowing us to double our model sizes. Because of extremely efficient communication,
 these benefits in multi-GPU setups are almost free and throughput scales well with multi-node setups.
 
-It is highly recommended to use Optimizer Sharded Training in multi-GPU environments where memory is limited, or where training larger models are beneficial (rough minimum of 500+ million parameter models).
-Optimizer Sharded Training is typically not suited for smaller models, or where large batch sizes are important.
-This is primarily because with larger batch sizes, storing activations for the backwards pass becomes the bottleneck in training. Sharding optimizer state as a result becomes less impactful.
+Below we use the `NeMo Transformer Lightning Language Modeling example <https://github.com/NVIDIA/NeMo/tree/main/examples/nlp/language_modeling>`_ to benchmark the maximum batch size and model size that can be fit on 8 A100 GPUs for DDP vs Sharded Training.
+Note that the benefits can still be obtained using 2 or more GPUs, and for even larger batch sizes you can scale to multiple nodes.
 
-To use Optimizer Sharded Training, you need to first install Fairscale using the command below or install all extras using ``pip install pytorch-lightning["extra"]``.
+**Increase Your Batch Size**
+
+Use Sharded Training to scale your batch size further using the same compute. This will reduce your overall epoch time.
+
++----------------------+-----------------------+----------------+---------------------+
+| Distributed Training | Model Size (Millions) | Max Batch Size | Percentage Gain (%) |
++======================+=======================+================+=====================+
+| Native DDP           | 930                   | 32             | -                   |
++----------------------+-----------------------+----------------+---------------------+
+| Sharded DDP          | 930                   | **52**         | **48%**             |
++----------------------+-----------------------+----------------+---------------------+
+
+**Increase Your Model Size**
+
+Use Sharded Training to scale your model size further using the same compute.
+
++----------------------+------------+---------------------------+---------------------+
+| Distributed Training | Batch Size | Max Model Size (Millions) | Percentage Gain (%) |
++======================+============+===========================+=====================+
+| Native DDP           | 32         | 930                       | -                   |
++----------------------+------------+---------------------------+---------------------+
+| Sharded DDP          | 32         | **1404**                  | **41%**             |
++----------------------+------------+---------------------------+---------------------+
+| Native DDP           | 8          | 1572                      | -                   |
++----------------------+------------+---------------------------+---------------------+
+| Sharded DDP          | 8          | **2872**                  | **59%**             |
++----------------------+------------+---------------------------+---------------------+
+
+It is highly recommended to use Sharded Training in multi-GPU environments where memory is limited, or where training larger models are beneficial (500M+ parameter models).
+A technical note: as batch size scales, storing activations for the backwards pass becomes the bottleneck in training. As a result, sharding optimizer state and gradients becomes less impactful.
+Work within the future will bring optional sharding to activations and model parameters to reduce memory further, but come with a speed cost.
+
+To use Sharded Training, you need to first install FairScale using the command below or install all extras using ``pip install pytorch-lightning["extra"]``.
 
 .. code-block:: bash
 
-    pip install fairscale
+    pip install https://github.com/facebookresearch/fairscale/archive/bb468670838b98dc8f8d67be4eabf195042a7994.zip
 
 
 .. code-block:: python
@@ -642,7 +674,7 @@ To use Optimizer Sharded Training, you need to first install Fairscale using the
     # train using Sharded DDP
     trainer = Trainer(accelerator='ddp', plugins='ddp_sharded')
 
-Optimizer Sharded Training can work across all DDP variants by adding the additional ``--plugins ddp_sharded`` flag.
+Sharded Training can work across all DDP variants by adding the additional ``--plugins ddp_sharded`` flag.
 
 Internally we re-initialize your optimizers and shard them across your machines and processes. We handle all communication using PyTorch distributed, so no code changes are required.