44 lines
1.7 KiB
ReStructuredText
44 lines
1.7 KiB
ReStructuredText
:orphan:
|
|
|
|
.. _hivemind_basic:
|
|
|
|
Training on unreliable mixed GPUs across the internet (Basic)
|
|
=============================================================
|
|
|
|
Collaborative Training tries to solve the need for top-tier multi-GPU servers by allowing you to train across unreliable machines,
|
|
such as local machines or even preemptible cloud compute across the internet.
|
|
|
|
Under the hood, we use `Hivemind <https://github.com/learning-at-home/hivemind>`_ which provides de-centralized training across the internet.
|
|
|
|
To use Collaborative Training, you need to first install Hivemind.
|
|
|
|
.. code-block:: bash
|
|
|
|
pip install hivemind
|
|
|
|
The ``HivemindStrategy`` accumulates gradients from all processes that are collaborating until they reach a ``target_batch_size``. By default, we use the batch size
|
|
of the first batch to determine what each local machine batch contributes towards the ``target_batch_size``. Once the ``target_batch_size`` is reached, an optimizer step
|
|
is made on all processes.
|
|
|
|
.. warning::
|
|
|
|
When using ``HivemindStrategy`` note that you cannot use gradient accumulation (``accumulate_grad_batches``). This is because Hivemind manages accumulation internally.
|
|
|
|
.. code-block:: python
|
|
|
|
import pytorch_lightning as pl
|
|
from pytorch_lightning.strategies import HivemindStrategy
|
|
|
|
trainer = pl.Trainer(strategy=HivemindStrategy(target_batch_size=8192), accelerator="gpu", devices=1)
|
|
|
|
.. code-block:: bash
|
|
|
|
python train.py
|
|
# Other machines can connect running the same command:
|
|
# INITIAL_PEERS=... python train.py
|
|
# or passing the peers to the strategy:"
|
|
# HivemindStrategy(initial_peers=...)"
|
|
|
|
|
|
A helper message is printed once your training begins, which shows you how to start training on other machines using the same code.
|