lightning/tests/models/data/horovod/train_default_model.py

93 lines
3.2 KiB
Python
Raw Normal View History

"""
This script is meant to be executed from `../../test_horovod.py`.
Because Horovod uses a parallel programming model similar to MPI, unit tests for collective
ops like allreduce need to be run in parallel. The most common approach for running parallel
Horovod workers is to launch multiple replicas of the training script via the `horovodrun`
command-line tool:
.. code-block:: bash
horovodrun -np 2 python train_default_model.py ...
Individual test parameters are configured by the serialized `--trainer-options` JSON object.
An non-zero exit code from this script on any rank will indicate failure, while a zero exit code
across all ranks indicates success.
"""
import argparse
import json
import os
import sys
# this is need as e.g. Conda do not uses `PYTHONPATH` env var as pip or/and virtualenv
from pytorch_lightning.trainer.states import TrainerState
sys.path = os.getenv('PYTHONPATH').split(':') + sys.path
from pytorch_lightning import Trainer # noqa: E402
from pytorch_lightning.callbacks import ModelCheckpoint # noqa: E402
from pytorch_lightning.utilities import _HOROVOD_AVAILABLE # noqa: E402
if _HOROVOD_AVAILABLE:
import horovod.torch as hvd # noqa: E402
else:
print('You requested to import Horovod which is missing or not supported for your OS.')
from tests.base import EvalModelTemplate # noqa: E402
from tests.base.develop_pipelines import run_prediction # noqa: E402
from tests.base.develop_utils import reset_seed, set_random_master_port # noqa: E402
parser = argparse.ArgumentParser()
parser.add_argument('--trainer-options', required=True)
parser.add_argument('--on-gpu', action='store_true', default=False)
def run_test_from_config(trainer_options):
"""Trains the default model with the given config."""
set_random_master_port()
reset_seed()
ckpt_path = trainer_options['weights_save_path']
trainer_options.update(callbacks=[ModelCheckpoint(dirpath=ckpt_path)])
model = EvalModelTemplate()
trainer = Trainer(**trainer_options)
trainer.fit(model)
assert trainer.state == TrainerState.FINISHED, f"Training failed with {trainer.state}"
# Horovod should be initialized following training. If not, this will raise an exception.
assert hvd.size() == 2
if trainer.global_rank > 0:
return
# test model loading
pretrained_model = EvalModelTemplate.load_from_checkpoint(trainer.checkpoint_callback.best_model_path)
# test new model accuracy
test_loaders = model.test_dataloader()
if not isinstance(test_loaders, list):
test_loaders = [test_loaders]
for dataloader in test_loaders:
test_cpu and test_gpu EvalModelTemplate deprecation (#4820) * test_cpu refactoring - BoringModel and checkpoints; test_gpu refactoring - BoringModelboring_model refactoring - validation, testing; Fix - run_prediction as dispatcher for testing BoringModel * Removed EvalModelTemplate import from test_cpu and test_gpu * Reverting unintended changes * Issues with checkpointing * Fixed tests for logging and checkpointing * Fix for dispatcher * test_cpu refactoring - BoringModel and checkpoints; test_gpu refactoring - BoringModelboring_model refactoring - validation, testing; Fix - run_prediction as dispatcher for testing BoringModel * Removed EvalModelTemplate import from test_cpu and test_gpu * Reverting unintended changes * Issues with checkpointing * Fixed tests for logging and checkpointing * Fix for dispatcher * Fixed acc check for stocasticity of seeds * Fixed according to @borda suggestions * Hparams for boring_model * Deprecated RuntimeParamChagneModelAssing (functionality is tested in RuntimeParamChangeModelSaving) * Reduced boring_model parameters to just in and out features, test_cpu modelsinherit BoringModel to specify additional parameters (e.g., optimizer) * Fix PEP8 * Update tests/base/develop_pipelines.py Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Update tests/base/boring_model.py Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Update tests/base/develop_pipelines.py Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Update tests/models/test_cpu.py Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Update tests/models/test_cpu.py Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Merged test_early_stopping with all_features; added TODO for self.log * Fixed test_all_features trainer options * Ready for review! * Update tests/models/test_cpu.py Thank you! :) Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Update tests/models/test_cpu.py Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Update tests/models/test_cpu.py Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Update tests/models/test_cpu.py Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Update tests/models/test_cpu.py Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * added optimizer_name, lr, and batch_size as hparams for save_hparameters() * Fixes for reducing PR size * Reverse test_hparams (removed DEPRECATED test for hparams direct assignment) * Changes for in_features * Fixed hparams * Fixed parameters for boring_model * Update tests/models/test_cpu.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Update tests/models/test_cpu.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Update tests/models/test_cpu.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * fix for pep8 * Fixed run_predction and TODO * fix min acc for darwin/windows without pl_opt * eval as DEFAULT run_prediction strategy * Updated val_dataloader for running_test_no_val Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-01-07 10:50:08 +00:00
run_prediction(pretrained_model, dataloader)
# test HPC saving
trainer.checkpoint_connector.hpc_save(ckpt_path, trainer.logger)
# test HPC loading
checkpoint_path = trainer.checkpoint_connector.get_max_ckpt_path_from_folder(ckpt_path)
trainer.checkpoint_connector.hpc_load(checkpoint_path, on_gpu=args.on_gpu)
if args.on_gpu:
trainer = Trainer(gpus=1, accelerator='horovod', max_epochs=1)
# Test the root_gpu property
assert trainer.root_gpu == hvd.local_rank()
if __name__ == "__main__":
args = parser.parse_args()
run_test_from_config(json.loads(args.trainer_options))