3.9 KiB
Lightning supports model training on a cluster managed by SLURM in the following cases:
- Training on a single cpu or single GPU.
- Train on multiple GPUs on the same node using DataParallel or DistributedDataParallel
- Training across multiple GPUs on multiple different nodes via DistributedDataParallel.
Note: A node means a machine with multiple GPUs
Running grid search on a cluster
To use lightning to run a hyperparameter search (grid-search or random-search) on a cluster do 4 things:
(1). Define the parameters for the grid search
from test_tube import HyperOptArgumentParser
# subclass of argparse
parser = HyperOptArgumentParser(strategy='random_search')
parser.add_argument('--learning_rate', default=0.002, type=float, help='the learning rate')
# let's enable optimizing over the number of layers in the network
parser.opt_list('--nb_layers', default=2, type=int, tunable=True, options=[2, 4, 8])
hparams = parser.parse_args()
NOTE You must set Tunable=True
for that argument to be considered in the permutation set. Otherwise
test-tube will use the default value. This flag is useful when you don't want to search over an argument and
want to use the default instead.
(2). Define the cluster options in the SlurmCluster object (over 5 nodes and 8 gpus)
from test_tube.hpc import SlurmCluster
# hyperparameters is a test-tube hyper params object
# see https://williamfalcon.github.io/test-tube/hyperparameter_optimization/HyperOptArgumentParser/
hyperparams = args.parse()
# init cluster
cluster = SlurmCluster(
hyperparam_optimizer=hyperparams,
log_path='/path/to/log/results/to',
python_cmd='python3'
)
# let the cluster know where to email for a change in job status (ie: complete, fail, etc...)
cluster.notify_job_status(email='some@email.com', on_done=True, on_fail=True)
# set the job options. In this instance, we'll run 20 different models
# each with its own set of hyperparameters giving each one 1 GPU (ie: taking up 20 GPUs)
cluster.per_experiment_nb_gpus = 8
cluster.per_experiment_nb_nodes = 5
# we'll request 10GB of memory per node
cluster.memory_mb_per_node = 10000
# set a walltime of 10 minues
cluster.job_time = '10:00'
(3). Make a main function with your model and trainer. Each job will call this function with a particular hparams configuration.
from pytorch_lightning import Trainer
def train_fx(trial_hparams, cluster_manager, _):
# hparams has a specific set of hyperparams
my_model = MyLightningModel()
# give the trainer the cluster object
trainer = Trainer()
trainer.fit(my_model)
(3). Start the grid/random search
# run the models on the cluster
cluster.optimize_parallel_cluster_gpu(
train_fx,
nb_trials=20,
job_name='my_grid_search_exp_name',
job_display_name='my_exp')
NOTE nb_trials specifies how many of the possible permutations to use. If using grid_search
it will use
the depth first ordering. If using random_search
it will use the first k shuffled options. FYI, random search
has been shown to be just as good as any Bayesian optimization method when using a reasonable number of samples (60),
see this paper for more information.
Walltime auto-resubmit
Lightning automatically resubmits jobs when they reach the walltime. Make sure to set the SIGUSR1 signal in your SLURM script.
# 90 seconds before training ends
#SBATCH --signal=SIGUSR1@90
When lightning receives the SIGUSR1 signal it will:
- save a checkpoint with 'hpc_ckpt' in the name.
- resubmit the job using the SLURM_JOB_ID
When the script starts again, Lightning will:
- search for a 'hpc_ckpt' checkpoint.
- restore the model, optimizers, schedulers, epoch, etc...