lightning/pytorch_lightning/strategies/ddp2.py

# Copyright The PyTorch Lightning team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Dict

import torch

from pytorch_lightning.strategies.ddp import DDPStrategy
from pytorch_lightning.utilities.apply_func import apply_to_collection
from pytorch_lightning.utilities.types import _METRIC_COLLECTION


class DDP2Strategy(DDPStrategy):
    """DDP2 behaves like DP in one node, but synchronization across nodes behaves like in DDP."""

    strategy_name = "ddp2"

    @property
    def global_rank(self) -> int:
        return self.node_rank

    @property
    def world_size(self) -> int:
        return self.num_nodes

    def reduce(self, collection: _METRIC_COLLECTION, *args, **kwargs) -> _METRIC_COLLECTION:
        """Reduces a collection of tensors from all processes. It can be applied to just a single tensor. In DDP2,
        the reduction here is only across local devices within the node.

        Args:
            collection: The collection of tensors to sync and reduce.
            *args: ignored for DDP2
            **kwargs: ignored for DDP2

        Return:
            Reduced tensor values or the same value if it was not or did not contain a tensor.
        """

        def mean(t: torch.Tensor) -> torch.Tensor:
            original_dtype = t.dtype
            return t.float().mean().to(original_dtype)

        return apply_to_collection(collection, torch.Tensor, mean)

    @property
    def root_device(self):
        return self.parallel_devices[0]

    def model_to_device(self):
        # no need to do anything when model is wrapped in torch.nn.DataParallel
        pass

    @property
    def distributed_sampler_kwargs(self):
        distributed_sampler_kwargs = dict(num_replicas=self.num_nodes, rank=self.global_rank)
        return distributed_sampler_kwargs

    @property
    def _is_single_process_single_device(self) -> bool:
        return False

    def set_world_ranks(self) -> None:
        if self.cluster_environment is None:
            return
        self.cluster_environment.set_global_rank(self.node_rank)
        self.cluster_environment.set_world_size(self.num_nodes)

    @classmethod
    def register_strategies(cls, strategy_registry: Dict) -> None:
        strategy_registry.register(
            cls.strategy_name,
            cls,
            description=f"{cls.__class__.__name__}",
        )
accelerator refactor - add parallel plugins (#5714) Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> 2021-01-31 12:48:14 +00:00			`# Copyright The PyTorch Lightning team.`
			`#`
			`# Licensed under the Apache License, Version 2.0 (the "License");`
			`# you may not use this file except in compliance with the License.`
			`# You may obtain a copy of the License at`
			`#`
			`# http://www.apache.org/licenses/LICENSE-2.0`
			`#`
			`# Unless required by applicable law or agreed to in writing, software`
			`# distributed under the License is distributed on an "AS IS" BASIS,`
			`# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.`
			`# See the License for the specific language governing permissions and`
			`# limitations under the License.`
Rewrite accelerator_connector (#11448) 2022-02-17 23:38:39 +00:00			`from typing import Dict`

accelerator refactor - add parallel plugins (#5714) Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> 2021-01-31 12:48:14 +00:00			`import torch`

Introduce strategies directory for Training Strategies (#11226) Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> 2021-12-22 20:23:30 +00:00			`from pytorch_lightning.strategies.ddp import DDPStrategy`
Extend support for logging a collection (#7771) 2021-06-01 11:51:50 +00:00			`from pytorch_lightning.utilities.apply_func import apply_to_collection`
			`from pytorch_lightning.utilities.types import _METRIC_COLLECTION`
accelerator refactor - add parallel plugins (#5714) Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> 2021-01-31 12:48:14 +00:00

Renamed the DDP2Plugin to DDP2Strategy (#11185) Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> 2021-12-21 19:21:00 +00:00			`class DDP2Strategy(DDPStrategy):`
Replace `yapf` with `black` (#7783) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> 2021-07-26 11:37:35 +00:00			`"""DDP2 behaves like DP in one node, but synchronization across nodes behaves like in DDP."""`
accelerator refactor - add parallel plugins (#5714) Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> 2021-01-31 12:48:14 +00:00
Rewrite accelerator_connector (#11448) 2022-02-17 23:38:39 +00:00			`strategy_name = "ddp2"`
Fix `distrib_type` not being set when Plugin instances being passed to Trainer (#10251) 2021-11-01 11:41:57 +00:00
Clean up environment access in plugins (#6941) Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> 2021-04-13 18:07:40 +00:00			`@property`
			`def global_rank(self) -> int:`
			`return self.node_rank`

			`@property`
			`def world_size(self) -> int:`
			`return self.num_nodes`

Extend support for logging a collection (#7771) 2021-06-01 11:51:50 +00:00			`def reduce(self, collection: _METRIC_COLLECTION, args, *kwargs) -> _METRIC_COLLECTION:`
CI: precommit - docformatter (#8584) * CI: precommit - docformatter * fix deprecated Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> 2021-09-06 12:49:09 +00:00			`"""Reduces a collection of tensors from all processes. It can be applied to just a single tensor. In DDP2,`
			`the reduction here is only across local devices within the node.`
accelerator refactor - add parallel plugins (#5714) Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> 2021-01-31 12:48:14 +00:00
consistent behavior for reduce method across all Plugins (#6011) * reduction docs * docs for abstract base method * make mean the default * add preliminary chlog Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> 2021-02-20 12:30:21 +00:00			`Args:`
Extend support for logging a collection (#7771) 2021-06-01 11:51:50 +00:00			`collection: The collection of tensors to sync and reduce.`
consistent behavior for reduce method across all Plugins (#6011) * reduction docs * docs for abstract base method * make mean the default * add preliminary chlog Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> 2021-02-20 12:30:21 +00:00			`*args: ignored for DDP2`
			`**kwargs: ignored for DDP2`
accelerator refactor - add parallel plugins (#5714) Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> 2021-01-31 12:48:14 +00:00
consistent behavior for reduce method across all Plugins (#6011) * reduction docs * docs for abstract base method * make mean the default * add preliminary chlog Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> 2021-02-20 12:30:21 +00:00			`Return:`
Extend support for logging a collection (#7771) 2021-06-01 11:51:50 +00:00			`Reduced tensor values or the same value if it was not or did not contain a tensor.`
consistent behavior for reduce method across all Plugins (#6011) * reduction docs * docs for abstract base method * make mean the default * add preliminary chlog Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> 2021-02-20 12:30:21 +00:00			`"""`

Extend support for logging a collection (#7771) 2021-06-01 11:51:50 +00:00			`def mean(t: torch.Tensor) -> torch.Tensor:`
			`original_dtype = t.dtype`
			`return t.float().mean().to(original_dtype)`
consistent behavior for reduce method across all Plugins (#6011) * reduction docs * docs for abstract base method * make mean the default * add preliminary chlog Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> 2021-02-20 12:30:21 +00:00
Extend support for logging a collection (#7771) 2021-06-01 11:51:50 +00:00			`return apply_to_collection(collection, torch.Tensor, mean)`
accelerator refactor - add parallel plugins (#5714) Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> 2021-01-31 12:48:14 +00:00
			`@property`
			`def root_device(self):`
			`return self.parallel_devices[0]`

			`def model_to_device(self):`
			`# no need to do anything when model is wrapped in torch.nn.DataParallel`
			`pass`

			`@property`
			`def distributed_sampler_kwargs(self):`
			`distributed_sampler_kwargs = dict(num_replicas=self.num_nodes, rank=self.global_rank)`
			`return distributed_sampler_kwargs`

Supporting Adding DDP Communication Hooks (#6736) * Fix some test errors Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * checkpoint consolidation * Update ddp_spawn.py * Update test_metric_result_integration.py * Update test_results.py * Update utils.py * Update utils.py * Update test_all_gather_grad.py * Update test_all_gather_grad.py * Update test_results.py * Revert "Update test_results.py" This reverts commit 9d4a2b891d2a4b37e21529a444bda1883d1b5ed1. * Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkpoint_consolidate" This reverts commit c5053da789f9d04d2c967a65adf4fb026dc134b8, reversing changes made to 0d23d75bc91e4e0b7805712e394cb093fac22841. * Revert "Update test_all_gather_grad.py" This reverts commit 0d23d75bc91e4e0b7805712e394cb093fac22841. * Revert "Update utils.py" This reverts commit 70fe5da9c66ceff2fcf4be5b9efdd23a9af8389c. * Revert "Update utils.py" This reverts commit a9aae99f6ed6e9388ecf1d8a7bd79966176a65af. * Revert "Update test_results.py" This reverts commit ea749068785bbad689a12066544893b1605f20c5. * Revert "Update test_metric_result_integration.py" This reverts commit bf70e431b3ce4893de804e0f3b5d59e79346d6d7. * Revert "Update ddp_spawn.py" This reverts commit f17210183b84f90c9a62d1ff9b3e05e1fbe5f33b. * Revert "checkpoint consolidation" This reverts commit 536c1323b0e6715fb5919196ea48b0fcddddcd66. * Revert "Revert "checkpoint consolidation"" This reverts commit 3a9fde915ad4c69620a6ccc411f5890cb38ba5ac. * Revert "Revert "Revert "checkpoint consolidation""" This reverts commit 7a369f47e1a94d701fce48c994cc3f2da266dad0. * Revert "Revert "Update ddp_spawn.py"" This reverts commit 8222dc98ead37d961a52b7366070aa10f66d92d1. * Revert "Revert "Update test_metric_result_integration.py"" This reverts commit 6c095b2370a2afe9d24918a5798ce1ebffed7e0d. * Revert "Revert "Update test_results.py"" This reverts commit 250d0aaaa2e6c6a6a3407bc6c8b83c0fe2479c0b. * Revert "Revert "Update utils.py"" This reverts commit 8651d54d79396eaaba16d7eb1e769a1e91d5702e. * Revert "Revert "Update test_all_gather_grad.py"" This reverts commit dcdcd29731061c919b15ab0b56669259817a81c4. * modify distributed environment to make test pass * add DDP communication hook * remove test related setting * remove more test related setting * fix ddp comm hook util import issue * comments * one more fix for test_custom_plugin * fix ddp spwan * fix sgd * address comments and add tests * 1. add is gpu checking 2. modify test a bit 3. formatting * formatting nit * fix conda 3.7 1.7 issue for no torch.distributed.algorithms module * need at least 1.8.0 * minor fix * modify changelog * changelog should link to PR number instead of issue number * refine a bit on doc for register_ddp_comm_hook function, like ddp_comm_wrapper explanation and add hyperparameter for power sgd states in example usge * move single device checking before call register_ddp_comm_hook * formatting * comments * typo * pre-commit formatting 2021-04-07 11:35:57 +00:00			`@property`
			`def _is_single_process_single_device(self) -> bool:`
			`return False`

Set `num_nodes` and `sync_batchnorm` From Trainer for Manually Passed Training Type Plugin (#7026) Co-authored-by: thomas chaton <thomas@grid.ai> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> 2021-05-08 11:25:51 +00:00			`def set_world_ranks(self) -> None:`
			`if self.cluster_environment is None:`
			`return`
Clean up environment access in plugins (#6941) Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> 2021-04-13 18:07:40 +00:00			`self.cluster_environment.set_global_rank(self.node_rank)`
			`self.cluster_environment.set_world_size(self.num_nodes)`
Rewrite accelerator_connector (#11448) 2022-02-17 23:38:39 +00:00
			`@classmethod`
			`def register_strategies(cls, strategy_registry: Dict) -> None:`
			`strategy_registry.register(`
			`cls.strategy_name,`
			`cls,`
			`description=f"{cls.__class__.__name__}",`
			`)`