From 20b806a94552a24aaf82861bf9b021674ef4bcf0 Mon Sep 17 00:00:00 2001 From: chaton Date: Wed, 9 Dec 2020 16:31:18 +0000 Subject: [PATCH] [feat] 3/n pp (#5036) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * add pp doc * udpate doc * update doc * update doc * Update docs * update doc * udpate * update doc * update doc * Formatting, update sharded zip link * Update docs/source/multi_gpu.rst Co-authored-by: Carlos MocholĂ­ * Apply suggestions from code review * Reference directly to section Co-authored-by: SeanNaren Co-authored-by: Carlos MocholĂ­ Co-authored-by: Jirka Borovec --- .pre-commit-config.yaml | 2 +- docs/source/multi_gpu.rst | 87 ++++++++++++++++++++++++++++++--- docs/source/performance.rst | 10 +++- docs/source/training_tricks.rst | 8 +++ 4 files changed, 99 insertions(+), 8 deletions(-) diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 5df6aecd06..1a4cbed8ef 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -32,6 +32,6 @@ repos: types: [python] - repo: https://github.com/pre-commit/mirrors-mypy - rev: master + rev: v0.790 hooks: - id: mypy diff --git a/docs/source/multi_gpu.rst b/docs/source/multi_gpu.rst index b6ffad20e7..2ce66c9a71 100644 --- a/docs/source/multi_gpu.rst +++ b/docs/source/multi_gpu.rst @@ -612,6 +612,7 @@ This is useful when dealing with large Transformer based models, or in environme Lightning currently offers the following methods to leverage model parallelism: - Sharded Training (partitioning your gradients and optimizer state across multiple GPUs, for reduced memory overhead with **no performance loss**) +- Sequential Model Parallelism with Checkpointing (partition your :class:`nn.Sequential ` module across multiple GPUs, leverage checkpointing and microbatching for further memory improvements and device utilization) Sharded Training ^^^^^^^^^^^^^^^^ @@ -666,7 +667,7 @@ To use Sharded Training, you need to first install FairScale using the command b .. code-block:: bash - pip install https://github.com/facebookresearch/fairscale/archive/bb468670838b98dc8f8d67be4eabf195042a7994.zip + pip install https://github.com/PyTorchLightning/fairscale/archive/pl_1.1.0.zip .. code-block:: python @@ -678,6 +679,80 @@ Sharded Training can work across all DDP variants by adding the additional ``--p Internally we re-initialize your optimizers and shard them across your machines and processes. We handle all communication using PyTorch distributed, so no code changes are required. +---------- + +.. _sequential-parallelism: + +Sequential Model Parallelism with Checkpointing +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +PyTorch Lightning integration for Sequential Model Parallelism using `FairScale `_. +Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially. +We also provide auto-balancing techniques through FairScale, to find optimal balances for the model across GPUs. +In addition, we use Gradient Checkpointing to reduce GPU memory requirements further, and micro-batches to minimizing device under-utilization automatically. + +Reference: https://arxiv.org/abs/1811.06965 + +.. note:: DDPSequentialPlugin is currently supported only for Pytorch 1.6. + +To get started, install FairScale through extras using with ``pip install pytorch-lightning["extra"]`` + +or directly using + +.. code-block:: bash + + pip install https://github.com/PyTorchLightning/fairscale/archive/pl_1.1.0.zip + +To use Sequential Model Parallelism, you must define a :class:`nn.Sequential ` module that defines the layers you wish to parallelize across GPUs. +This should be kept within the ``sequential_module`` variable within your ``LightningModule`` like below. + +.. code-block:: python + + from pytorch_lightning.plugins.ddp_sequential_plugin import DDPSequentialPlugin + from pytorch_lightning import LightningModule + + class MyModel(LightningModule): + def __init__(self): + ... + self.sequential_module = torch.nn.Sequential(my_layers) + + # Split my module across 4 gpus, one layer each + model = MyModel() + plugin = DDPSequentialPlugin(balance=[1, 1, 1, 1]) + trainer = Trainer(accelerator='ddp', gpus=4, plugins=[plugin]) + trainer.fit(model) + + +We provide a minimal example of Sequential Model Parallelism using a convolutional model training on cifar10, split onto GPUs `here `_. +To run the example, you need to install `Bolts `_. Install with ``pip install pytorch-lightning-bolts``. + +When running the Sequential Model Parallelism example on 2 GPUS we achieve these memory savings. + +.. list-table:: GPU Memory Utilization + :widths: 25 25 50 + :header-rows: 1 + + * - GPUS + - Without Balancing + - With Balancing + * - Gpu 0 + - 4436 MB + - 1554 MB + * - Gpu 1 + - ~0 + - 994 MB + +To run the example with Sequential Model Parallelism: + +.. code-block:: bash + + python pl_examples/basic_examples/conv_sequential_example.py --batch_size 1024 --gpus 2 --accelerator ddp --use_ddp_sequential + +To run the same example without Sequential Model Parallelism: + +.. code-block:: bash + + python pl_examples/basic_examples/conv_sequential_example.py --batch_size 1024 --gpus 1 + Batch size ---------- @@ -728,8 +803,8 @@ Lightning supports the use of TorchElastic to enable fault-tolerant and elastic .. code-block:: python Trainer(gpus=8, accelerator='ddp') - - + + Following the `TorchElastic Quickstart documentation `_, you then need to start a single-node etcd server on one of the hosts: .. code-block:: bash @@ -737,8 +812,8 @@ Following the `TorchElastic Quickstart documentation `_ for details on installation and more use cases. diff --git a/docs/source/performance.rst b/docs/source/performance.rst index 0f97942128..394f6e5f3c 100644 --- a/docs/source/performance.rst +++ b/docs/source/performance.rst @@ -131,4 +131,12 @@ To use Optimizer Sharded Training, refer to :ref:`model-parallelism`. Sharded DDP can work across all DDP variants by adding the additional ``--plugins ddp_sharded`` flag. -Refer to the :ref:`distributed computing guide for more details `. \ No newline at end of file +Refer to the :ref:`distributed computing guide for more details `. + + +Sequential Model Parallelism with Checkpointing +--------------------------------------------------------------------- +PyTorch Lightning integration for Sequential Model Parallelism using `FairScale `_. +Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially. + +For more information, refer to :ref:`sequential-parallelism`. diff --git a/docs/source/training_tricks.rst b/docs/source/training_tricks.rst index 6ff9dfd0a3..10ee668a97 100644 --- a/docs/source/training_tricks.rst +++ b/docs/source/training_tricks.rst @@ -123,3 +123,11 @@ The algorithm in short works by: :members: scale_batch_size .. warning:: Batch size finder is not supported for DDP yet, it is coming soon. + + +Sequential Model Parallelism with Checkpointing +--------------------------------------------------------------------- +PyTorch Lightning integration for Sequential Model Parallelism using `FairScale `_. +Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially. + +For more information, refer to :ref:`sequential-parallelism`. \ No newline at end of file