lightning/docs/source-pytorch/strategies/hivemind_basic.rst

44 lines
1.7 KiB
ReStructuredText

:orphan:
.. _hivemind_basic:
Training on unreliable mixed GPUs across the internet (Basic)
=============================================================
Collaborative Training tries to solve the need for top-tier multi-GPU servers by allowing you to train across unreliable machines,
such as local machines or even preemptible cloud compute across the internet.
Under the hood, we use `Hivemind <https://github.com/learning-at-home/hivemind>`_ which provides de-centralized training across the internet.
To use Collaborative Training, you need to first install Hivemind.
.. code-block:: bash
pip install hivemind
The ``HivemindStrategy`` accumulates gradients from all processes that are collaborating until they reach a ``target_batch_size``. By default, we use the batch size
of the first batch to determine what each local machine batch contributes towards the ``target_batch_size``. Once the ``target_batch_size`` is reached, an optimizer step
is made on all processes.
.. warning::
When using ``HivemindStrategy`` note that you cannot use gradient accumulation (``accumulate_grad_batches``). This is because Hivemind manages accumulation internally.
.. code-block:: python
import pytorch_lightning as pl
from pytorch_lightning.strategies import HivemindStrategy
trainer = pl.Trainer(strategy=HivemindStrategy(target_batch_size=8192), accelerator="gpu", devices=1)
.. code-block:: bash
python train.py
# Other machines can connect running the same command:
# INITIAL_PEERS=... python train.py
# or passing the peers to the strategy:"
# HivemindStrategy(initial_peers=...)"
A helper message is printed once your training begins, which shows you how to start training on other machines using the same code.