docs: update references to ext. integrations (#19248)

* drop stale projects
* hpu 1.3.0
* copy & index
* prune

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
This commit is contained in:
Jirka Borovec 2024-01-09 13:12:53 +01:00 committed by GitHub
parent 8663460423
commit f62e312185
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
9 changed files with 57 additions and 338 deletions

View File

@ -442,9 +442,20 @@ class AssistantCLI:
target_dir: str = "docs/source-pytorch/XXX",
checkout: str = "refs/tags/1.0.0",
source_dir: str = "docs/source",
single_page: Optional[str] = None,
as_orphan: bool = False,
) -> None:
"""Pull docs pages from external source and append to local docs."""
"""Pull docs pages from external source and append to local docs.
Args:
gh_user_repo: standard GitHub user/repo string
target_dir: relative location inside the docs folder
checkout: specific tag or branch to checkout
source_dir: relative location inside the remote / external repo
single_page: copy only single page from the remote repo and name it as the repo name
as_orphan: append orphan statement to the page
"""
import zipfile
zip_url = f"https://github.com/{gh_user_repo}/archive/{checkout}.zip"
@ -464,6 +475,14 @@ class AssistantCLI:
assert len(zip_dirs) == 1
repo_dir = zip_dirs[0]
if single_page: # special case for copying single page
single_page = os.path.join(repo_dir, source_dir, single_page)
assert os.path.isfile(single_page), f"File '{single_page}' does not exist."
name = re.sub(r"lightning[-_]?", "", gh_user_repo.split("/")[-1])
new_rst = os.path.join(_PROJECT_ROOT, target_dir, f"{name}.rst")
AssistantCLI._copy_rst(single_page, new_rst, as_orphan=as_orphan)
return
# continue with copying all pages
ls_pages = glob.glob(os.path.join(repo_dir, source_dir, "*.rst"))
ls_pages += glob.glob(os.path.join(repo_dir, source_dir, "**", "*.rst"))
for rst in ls_pages:

View File

@ -117,41 +117,4 @@ Third-party strategies
**********************
Cutting-edge Lightning strategies are being developed by third-parties outside of Lightning.
If you want to try some of the latest and greatest features for model-parallel training, check out these integrations:
.. raw:: html
<div class="display-card-container">
<div class="row">
.. Add callout items below this line
.. displayitem::
:header: Colossal-AI
:description: Has advanced distributed training algorithms and system optimizations
:col_css: col-md-4
:button_link: ../integrations/strategies/colossalai.html
:height: 160
:tag: advanced
.. displayitem::
:header: Bagua
:description: Has advanced distributed training algorithms and system optimizations
:col_css: col-md-4
:button_link: ../integrations/strategies/bagua.html
:height: 160
:tag: advanced
.. displayitem::
:header: Hivemind
:description: For training on unreliable mixed GPUs across the internet
:col_css: col-md-4
:button_link: ../integrations/strategies/hivemind.html
:height: 160
:tag: advanced
.. raw:: html
</div>
</div>
If you want to try some of the latest and greatest features for model-parallel training, check out these :doc:`strategies <../integrations/strategies/index>`.

View File

@ -88,11 +88,11 @@ _transform_changelog(
os.path.join(_PATH_HERE, _FOLDER_GENERATED, "CHANGELOG.md"),
)
# Copy Accelerator docs
assist_local.AssistantCLI.pull_docs_files(
gh_user_repo="Lightning-AI/lightning-Habana",
target_dir="docs/source-pytorch/integrations/hpu",
checkout="4eca3d9a9744e24e67924ba1534f79b55b59e5cd", # this is post `refs/tags/1.2.0`
checkout="refs/tags/1.3.0",
)
assist_local.AssistantCLI.pull_docs_files(
gh_user_repo="Lightning-AI/lightning-Graphcore",
@ -107,6 +107,13 @@ for img in ["_static/images/ipu/profiler.png"]:
os.makedirs(os.path.dirname(img_), exist_ok=True)
urllib.request.urlretrieve(f"{URL_RAW_DOCS_GRAPHCORE}/{img}", img_)
# Copy strategies docs as single pages
assist_local.AssistantCLI.pull_docs_files(
gh_user_repo="Lightning-Universe/lightning-Hivemind",
target_dir="docs/source-pytorch/integrations/strategies",
checkout="3b14f766200aff8fe7153be19a7bd92440dea3cf", # this is post release version including moved overview page
single_page="overview.rst",
)
if _FETCH_S3_ASSETS:
fetch_external_assets(

View File

@ -105,24 +105,7 @@ Third-party Strategies
**********************
There are powerful third-party strategies that integrate well with Lightning but aren't maintained as part of the ``lightning`` package.
.. list-table:: List of third-party strategy implementations
:widths: 20 20 20
:header-rows: 1
* - Name
- Package
- Description
* - ColossalAI
- `Lightning-AI/lightning-colossalai <https://github.com/Lightning-AI/lightning-colossalai>`_
- Colossal-AI provides a collection of parallel components for you. It aims to support you to write your distributed deep learning models just like how you write your model on your laptop. `Learn more. <https://www.colossalai.org/>`__
* - Bagua
- `Lightning-AI/lightning-Bagua <https://github.com/Lightning-AI/lightning-Bagua>`_
- Bagua is a deep learning training acceleration framework for PyTorch, with advanced distributed training algorithms and system optimizations. `Learn more. <https://tutorials.baguasys.com/>`__
* - hivemind
- `Lightning-AI/lightning-hivemind <https://github.com/Lightning-AI/lightning-hivemind>`_
- Hivemind is a PyTorch library for decentralized deep learning across the Internet. Its intended usage is training one large model on hundreds of computers from different universities, companies, and volunteers. `Learn more. <https://github.com/learning-at-home/hivemind>`__
Checkout the gallery over :doc:`here <../integrations/strategies/index>`.
----

View File

@ -38,6 +38,7 @@
Remote filesystem and FSSPEC <../common/remote_fs>
Strategy <../extensions/strategy>
Strategy registry <../advanced/strategy_registry>
Strategy integrations <../integrations/strategies/index>
Style guide <../starter/style_guide>
SWA <../advanced/training_tricks>
SLURM <../clouds/cluster_advanced>

View File

@ -1,53 +0,0 @@
:orphan:
#####
Bagua
#####
The `Bagua strategy <https://github.com/Lightning-AI/lightning-Bagua>`_ speeds up PyTorch training from a single node to large scale.
Bagua is a deep learning training acceleration framework for PyTorch, with advanced distributed training algorithms and system optimizations.
Bagua currently supports:
- **Advanced Distributed Training Algorithms**: Users can extend the training on a single GPU to multi-GPUs (may across multiple machines) by simply adding a few lines of code (optionally in `elastic mode <https://tutorials.baguasys.com/elastic-training/>`_). One prominent feature of Bagua is to provide a flexible system abstraction that supports state-of-the-art system relaxation techniques of distributed training. So far, Bagua has integrated communication primitives including
- Centralized Synchronous Communication (e.g. `Gradient AllReduce <https://tutorials.baguasys.com/algorithms/gradient-allreduce>`_)
- Decentralized Synchronous Communication (e.g. `Decentralized SGD <https://tutorials.baguasys.com/algorithms/decentralized>`_)
- Low Precision Communication (e.g. `ByteGrad <https://tutorials.baguasys.com/algorithms/bytegrad>`_)
- Asynchronous Communication (e.g. `Async Model Average <https://tutorials.baguasys.com/algorithms/async-model-average>`_)
- `Cached Dataset <https://tutorials.baguasys.com/more-optimizations/cached-dataset>`_: When samples in a dataset need tedious preprocessing, or reading the dataset itself is slow, they could become a major bottleneck of the whole training process. Bagua provides cached dataset to speedup this process by caching data samples in memory, so that reading these samples after the first time can be much faster.
- `TCP Communication Acceleration (Bagua-Net) <https://tutorials.baguasys.com/more-optimizations/bagua-net>`_: Bagua-Net is a low level communication acceleration feature provided by Bagua. It can greatly improve the throughput of AllReduce on TCP network. You can enable Bagua-Net optimization on any distributed training job that uses NCCL to do GPU communication (this includes PyTorch-DDP, Horovod, DeepSpeed, and more).
- `Performance Autotuning <https://tutorials.baguasys.com/performance-autotuning/>`_: Bagua can automatically tune system parameters to achieve the highest throughput.
- `Generic Fused Optimizer <https://tutorials.baguasys.com/more-optimizations/generic-fused-optimizer>`_: Bagua provides generic fused optimizer which improves the performance of optimizers, by fusing the optimizer `.step()` operation on multiple layers. It can be applied to arbitrary PyTorch optimizer, in contrast to `NVIDIA Apex <https://nvidia.github.io/apex/optimizers.html>`_'s approach, where only some specific optimizers are implemented.
- `Load Balanced Data Loader <https://tutorials.baguasys.com/more-optimizations/load-balanced-data-loader>`_: When the computation complexity of samples in training data are different, for example in NLP and speech tasks, where each sample have different lengths, distributed training throughput can be greatly improved by using Bagua's load balanced data loader, which distributes samples in a way that each worker's workload are similar.
You can install the Bagua integration by running
.. code-block:: bash
pip install lightning-bagua
This will install both the `bagua <https://pypi.org/project/bagua/>`_ package as well as the ``BaguaStrategy`` for the Lightning Trainer:
.. code-block:: python
trainer = Trainer(strategy="bagua", accelerator="gpu", devices=...)
You can tune several settings by instantiating the strategy objects and pass options in:
.. code-block:: python
from lightning_bagua import BaguaStrategy
strategy = BaguaStrategy(algorithm="bytegrad")
trainer = Trainer(strategy=strategy, accelerator="gpu", devices=...)
.. note::
* Bagua is only supported on Linux systems with GPU(s).
See `Bagua Tutorials <https://tutorials.baguasys.com/>`_ for more details on installation and advanced features.

View File

@ -1,112 +0,0 @@
:orphan:
###########
Colossal-AI
###########
The `Colossal-AI strategy <https://github.com/Lightning-AI/lightning-colossalai>`_ implements ZeRO-DP with chunk-based memory management.
With this chunk mechanism, really large models can be trained with a small number of GPUs.
It supports larger trainable model size and batch size than usual heterogeneous training by reducing CUDA memory fragments and CPU memory consumption.
Also, it speeds up this kind of heterogeneous training by fully utilizing all kinds of resources.
.. warning:: This is an :ref:`experimental <versioning:Experimental API>` feature.
When enabling chunk mechanism, a set of consecutive parameters are stored in a chunk, and then the chunk is sharded across different processes.
This can reduce communication and data transmission frequency and fully utilize communication and PCI-E bandwidth, which makes training faster.
Unlike traditional implementations, which adopt static memory partition, we implemented a dynamic heterogeneous memory management system named Gemini.
During the first training step, the warmup phase will sample the maximum non-model data memory (memory usage expect parameters, gradients, and optimizer states).
In later training, it will use the collected memory usage information to evict chunks dynamically.
Gemini allows you to fit much larger models with limited GPU memory.
According to our benchmark results, we can train models with up to 24 billion parameters in 1 GPU.
You can install the Colossal-AI integration by running
.. code-block:: bash
pip install lightning-colossalai
This will install both the `colossalai <https://colossalai.org/docs/get_started/installation>`_ package as well as the ``ColossalAIStrategy`` for the Lightning Trainer:
.. code-block:: python
trainer = Trainer(strategy="colossalai", precision=16, devices=...)
You can tune several settings by instantiating the strategy objects and pass options in:
.. code-block:: python
from lightning_colossalai import ColossalAIStrategy
strategy = ColossalAIStrategy(...)
trainer = Trainer(strategy=strategy, precision=16, devices=...)
See a full example of a benchmark with the a `GPT-2 model <https://github.com/hpcaitech/ColossalAI-Pytorch-lightning/tree/main/benchmark/gpt>`_ of up to 24 billion parameters
.. note::
* The only accelerator which ColossalAI supports is ``"gpu"``. But CPU resources will be used when the placement policy is set to "auto" or "cpu".
* The only precision which ColossalAI allows is 16-bit mixed precision (FP16).
* It only supports a single optimizer, which must be ``colossalai.nn.optimizer.CPUAdam`` or ``colossalai.nn.optimizer.
HybridAdam`` now. You can set ``adamw_mode`` to False to use normal Adam. Noticing that ``HybridAdam`` is highly optimized, it uses fused CUDA kernel and parallel CPU kernel.
It is recommended to use ``HybridAdam``, since it updates parameters in GPU and CPU both.
* Your model must be created using the :meth:`~lightning.pytorch.core.LightningModule.configure_model` method.
* ``ColossalaiStrategy`` doesn't support gradient accumulation as of now.
.. _colossal_placement_policy:
Model Definition
================
ColossalAI requires the layers of your model to be created in the special :meth:`~lightning.pytorch.core.LightningModule.configure_model` hook.
This allows the strategy to efficiently shard your model before materializing the weight tensors.
.. code-block:: python
class MyModel(LightningModule):
def __init__(self):
super().__init__()
# don't instantiate layers here
# move the creation of layers to `configure_model`
def configure_model(self):
# create all your layers here
self.layers = nn.Sequential(...)
Placement Policy
================
Placement policies can help users fully exploit their GPU-CPU heterogeneous memory space for better training efficiency.
There are three options for the placement policy.
They are "cpu", "cuda" and "auto" respectively.
When the placement policy is set to "cpu", all participated parameters will be offloaded into CPU memory immediately at the end of every auto-grad operation.
In this way, "cpu" placement policy uses the least CUDA memory.
It is the best choice for users who want to exceptionally enlarge their model size or training batch size.
When using "cuda" option, all parameters are placed in the CUDA memory, no CPU resources will be used during the training.
It is for users who get plenty of CUDA memory.
The third option, "auto", enables Gemini.
It monitors the consumption of CUDA memory during the warmup phase and collects CUDA memory usage of all auto-grad operations.
In later training steps, Gemini automatically manages the data transmission between GPU and CPU according to collected CUDA memory usage information.
It is the fastest option when CUDA memory is enough.
Here's an example of changing the placement policy to "cpu".
.. code-block:: python
from lightning_colossalai import ColossalAIStrategy
model = MyModel()
my_strategy = ColossalAIStrategy(placement_policy="cpu")
trainer = Trainer(accelerator="gpu", devices=4, precision=16, strategy=my_strategy)
trainer.fit(model)

View File

@ -1,114 +0,0 @@
:orphan:
################################################################
Hivemind - training on unreliable mixed GPUs across the internet
################################################################
Collaborative Training tries to solve the need for top-tier multi-GPU servers by allowing you to train across unreliable machines,
such as local machines or even preemptible cloud compute across the internet.
Under the hood, we use `Hivemind <https://github.com/learning-at-home/hivemind>`__ which provides de-centralized training across the internet.
.. warning:: This is an :ref:`experimental <versioning:Experimental API>` feature.
To use Collaborative Training, you need to first this extension.
.. code-block:: bash
pip install lightning-hivemind
This will install both the `Hivemind <https://pypi.org/project/hivemind/>`__ package as well as the ``HivemindStrategy`` for the Lightning Trainer:
Reducing Communication By Overlapping Communication
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We can reduce the impact of communication across all machines by overlapping communication with our training iterations. In short, we enable communication to happen
in the background of training.
Overlap Gradient and State Averaging
""""""""""""""""""""""""""""""""""""
When the target batch size is reached, all processes that are included in the step send gradients and model states to each other. By enabling some flags through
the strategy, communication can happen in the background. This allows training to continue (with slightly outdated weights) but provides us the means
to overlap communication with computation.
.. warning::
Enabling overlapping communication means convergence will slightly be affected.
.. note::
Enabling these flags means that you must pass in a ``scheduler_fn`` to the ``HivemindStrategy`` instead of relying on a scheduler from ``configure_optimizers``.
The optimizer is re-created by Hivemind, and as a result, the scheduler has to be re-created.
.. code-block:: python
import torch
from functools import partial
from lightning import Trainer
from lightning_hivemind.strategy import HivemindStrategy
trainer = Trainer(
strategy=HivemindStrategy(
target_batch_size=8192,
delay_state_averaging=True,
delay_grad_averaging=True,
delay_optimizer_step=True,
offload_optimizer=True, # required to delay averaging
scheduler_fn=partial(torch.optim.lr_scheduler.ExponentialLR, gamma=...),
),
accelerator="gpu",
devices=1,
)
Reducing GPU Memory requirements by re-using buffers & CPU offloading
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We can also offload the optimizer state to the CPU whilst re-using gradient buffers to reduce the memory requirement for machines.
Offloading Optimizer State to the CPU
"""""""""""""""""""""""""""""""""""""
Offloading the Optimizer state to the CPU works the same as DeepSpeed Zero-stage-2-offload, where we save GPU memory by keeping all optimizer states on the CPU.
.. note::
Enabling these flags means that you must pass in a ``scheduler_fn`` to the ``HivemindStrategy`` instead of relying on a scheduler from ``configure_optimizers``.
The optimizer is re-created by Hivemind, and as a result, the scheduler has to be re-created.
We suggest enabling offloading and overlapping communication to hide the additional overhead from having to communicate with the CPU.
.. code-block:: python
import torch
from functools import partial
from lightning import Trainer
from lightning_hivemind.strategy import HivemindStrategy
trainer = Trainer(
strategy=HivemindStrategy(
target_batch_size=8192,
offload_optimizer=True,
scheduler_fn=partial(torch.optim.lr_scheduler.ExponentialLR, gamma=...),
),
accelerator="gpu",
devices=1,
)
Re-using Gradient Buffers
"""""""""""""""""""""""""
By default, Hivemind accumulates gradients in a separate buffer. This means additional GPU memory is required to store gradients. You can enable re-using the model parameter gradient buffers by passing ``reuse_grad_buffers=True`` to the ``HivemindStrategy``.
.. warning::
The ``HivemindStrategy`` will override ``zero_grad`` in your ``LightningModule`` to have no effect. This is because gradients are accumulated in the model
and Hivemind manages when they need to be cleared.
.. code-block:: python
from pytorch_lightning import Trainer
from lightning_hivemind.strategy import HivemindStrategy
trainer = Trainer(
strategy=HivemindStrategy(target_batch_size=8192, reuse_grad_buffers=True), accelerator="gpu", devices=1
)

View File

@ -0,0 +1,25 @@
.. _strategy-integrations:
Additional external Strategy integrations
=========================================
.. raw:: html
<div class="display-card-container">
<div class="row">
.. Add callout items below this line
.. displayitem::
:header: Hivemind
:description: Collaborative Training tries to solve the need for top-tier multi-GPU servers by allowing you to train across unreliable machines.
:col_css: col-md-4
:button_link: Hivemind.html
:height: 150
:tag: hivemind
.. raw:: html
</div>
</div>