cleaned up demos

This commit is contained in:
William Falcon 2019-10-05 14:21:12 -04:00
parent 94f89e8e10
commit e739c79819
6 changed files with 14 additions and 395 deletions

View File

@ -1,107 +1,7 @@
# Multi-node examples
Use these templates for multi-node training.
The main complexity around cluster training is how you submit the SLURM jobs.
# Multi-node example
## Test-tube
Lightning uses test-tube to submit SLURM jobs and to run hyperparameter searches on a cluster.
Run this module to launch a job which runs on 2 nodes each using 2 GPUs.
To run a hyperparameter search, we normally add the values to search to the Hyperparameter optimizer
```python
from test_tube import HyperOptArgumentParser
parser = HyperOptArgumentParser(strategy='grid_search')
parser.opt_list('--drop_prob', default=0.2, options=[0.2, 0.5], type=float, tunable=True)
parser.opt_list('--learning_rate', default=0.001, type=float,
options=[0.0001, 0.0005, 0.001],
tunable=True)
# give your model a chance to add its own parameters
parser = LightningTemplateModel.add_model_specific_args(parent_parser, root_dir)
# parse args
hyperparams = parser.parse_args()
```
The above sets up a grid search on learning rate and drop probability. You can now add this object to the
cluster object to perform the grid search:
```python
cluster = SlurmCluster(
hyperparam_optimizer=hyperparams,
log_path='/path/to/log/slurm/files',
)
# ... configure cluster options
# run grid search on cluster
nb_trials = 6 # (2 drop probs * 3 lrs)
cluster.optimize_parallel_cluster_gpu(
YourMainFunction,
nb_trials=nb_trials,
job_name=hyperparams.experiment_name
)
```
Running the above will launch 6 jobs, each with a different drop prob and learning rate combination.
The ```tunable``` parameter must be set to True to add that argument to the space of options, otherwise
Test-Tube will use the ```default=value```.
## SLURM Flags
However you decide to submit your jobs, debugging requires a few flags. Without these flags, you'll
see a nccl error instead of the actual error which caused the bug.
```sh
export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1
```
On some clusters you might need to set the network interface with this flag.
```sh
export NCCL_SOCKET_IFNAME=^docker0,lo
```
You might also need to load the latest version of NCCL
```sh
module load NCCL/2.4.7-1-cuda.10.0
```
Finally, you must set the master port (usually a random number between 12k and 20k).
```sh
# random port between 12k and 20k
export MASTER_PORT=$((12000 + RANDOM % 20000))$
```
## Simplest example.
1. Modify this script with your CoolModel file.
2. Update and submit [this bash script](https://github.com/williamFalcon/pytorch-lightning/blob/master/examples/new_project_templates/multi_node_examples/minimal_multi_node_demo_script.sh)
```bash
squeue minimal_multi_node_demo_script.sh
```
## Grid search on a cluster
#### Option 1: Run on cluster using your own SLURM script
The trainer and model will work on a cluster if you configure your SLURM script correctly.
1. Update [this demo slurm script](https://github.com/williamFalcon/pytorch-lightning/blob/master/examples/new_project_templates/multi_node_examples/demo_script.sh).
2. Submit the script
```bash
$ squeue demo_script.sh
```
Most people have some way they automatically generate their own scripts.
To run a grid search this way, you'd need a way to automatically generate scripts using all the combinations of
hyperparameters to search over.
#### Option 2: Use test-tube for SLURM script
With test tube we can automatically generate slurm scripts for different hyperparameter options.
To run this demo:
```bash
source activate YourCondaEnv
python multi_node_cluster_auto_slurm.py --email your@email.com --gpu_partition your_partition --conda_env YourCondaEnv
```
That will submit 6 jobs. Each job will have a specific combination of hyperparams. Each job will also run on 2 nodes
where each node has 8 gpus.
bash job_submit.sh
```

View File

@ -1,66 +0,0 @@
#!/bin/bash
#
# Auto-generated by test-tube (https://github.com/williamFalcon/test-tube)
#################
# set a job name
#SBATCH --job-name=lightning_test
#################
# a file for job output, you can check job progress
#SBATCH --output=/slurm_output_%j.out
#################
# a file for errors
#SBATCH --error=/slurm_output_%j.err
#################
# time needed for job
#SBATCH --time=01:00:00
#################
# gpus per node
#SBATCH --gres=gpu:8
#################
# cpus per job
#SBATCH --cpus-per-task=10
#################
# number of requested nodes
#SBATCH --nodes=2
#################
# memory per node (0 means all)
#SBATCH --mem=0
#################
# slurm will send a signal this far out before it kills the job
#SBATCH --signal=USR1@300
#################
# comment
#SBATCH --comment=lightning_demo
#################
# 1 task per gpu
#SBATCH --ntasks-per-node=8
#################
source activate YourEnv
# debugging flags (optional)
export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1
# on your cluster you might need these:
# set the network interface
export NCCL_SOCKET_IFNAME=^docker0,lo
# might need the latest cuda
module load NCCL/2.4.7-1-cuda.10.0
# random port between 12k and 20k
export MASTER_PORT=$((12000 + RANDOM % 20000))$
srun python multi_node_own_slurm_script.py

View File

@ -1,9 +1,9 @@
#!/bin/bash -l
# SLURM SUBMIT SCRIPT
#SBATCH --nodes=4
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --nodes=2
#SBATCH --gres=gpu:2
#SBATCH --ntasks-per-node=2
#SBATCH --mem=0
#SBATCH --time=0-02:00:00
@ -23,8 +23,5 @@ conda activate my_env
# module load NCCL/2.4.7-1-cuda.10.0
# -------------------------
# random port between 12k and 20k
export MASTER_PORT=$((12000 + RANDOM % 20000))
# run script from above
python minimal_multi_node_demo.py
python multi_node_demo.py

View File

@ -1,24 +0,0 @@
from pytorch_lightning import Trainer
from test_tube import Experiment
import os
def main():
# use the cool model from the main README.md
model = CoolModel() # noqa: F821
exp = Experiment(save_dir=os.getcwd())
# train on 4 GPUs across 4 nodes
trainer = Trainer(
experiment=exp,
distributed_backend='ddp',
max_nb_epochs=10,
gpus=4,
nb_gpu_nodes=4
)
trainer.fit(model)
if __name__ == '__main__':
main()

View File

@ -1,172 +0,0 @@
"""
Multi-node example (GPU)
"""
import os
import numpy as np
from time import sleep
import torch
from test_tube import HyperOptArgumentParser, Experiment, SlurmCluster
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import EarlyStopping, ModelCheckpoint
from examples.basic_examples.lightning_module_template import LightningTemplateModel
PORT = np.random.randint(12000, 20000, 1)[0]
SEED = 2334
torch.manual_seed(SEED)
np.random.seed(SEED)
def main_local(hparams):
main(hparams, None, None)
def main(hparams, cluster):
"""
Main training routine specific for this project
:param hparams:
:return:
"""
# ------------------------
# 1 INIT LIGHTNING MODEL
# ------------------------
print('loading model...')
model = LightningTemplateModel(hparams)
print('model built')
# ------------------------
# 2 INIT TEST TUBE EXP
# ------------------------
# when using grid search, it's possible for all models to start at once
# and use the same test tube experiment version
relative_node_id = int(os.environ['SLURM_NODEID'])
sleep(relative_node_id + 1)
# init experiment
exp = Experiment(
name=hyperparams.experiment_name,
save_dir=hyperparams.test_tube_save_path,
autosave=False,
version=hparams.hpc_exp_number, # match the slurm job version number
description='test demo'
)
exp.argparse(hparams)
exp.save()
# ------------------------
# 4 INIT TRAINER
# ------------------------
trainer = Trainer(
experiment=exp,
gpus=hparams.per_experiment_nb_gpus,
nb_gpu_nodes=hyperparams.nb_gpu_nodes,
distributed_backend=hyperparams.distributed_backend
)
# ------------------------
# 5 START TRAINING
# ------------------------
trainer.fit(model)
def optimize_on_cluster(hyperparams):
# enable cluster training
# log all scripts to the test tube folder
cluster = SlurmCluster(
hyperparam_optimizer=hyperparams,
log_path=hyperparams.slurm_log_path,
)
# email for cluster coms
cluster.notify_job_status(email=hyperparams.email, on_done=True, on_fail=True)
# configure cluster
cluster.per_experiment_nb_gpus = hyperparams.per_experiment_nb_gpus
cluster.per_experiment_nb_nodes = hyperparams.nb_gpu_nodes
cluster.job_time = '2:00:00'
cluster.gpu_type = hyperparams.gpu_type
cluster.memory_mb_per_node = 0
# any modules for code to run in env
cluster.add_command(f'source activate {hyperparams.conda_env}')
# set DDP master port
cluster.add_command(f'export MASTER_PORT={PORT}')
# OPTIONAL for debugging
# without these flags errors in your code will
# appear to be nccl errors
cluster.add_command('export NCCL_DEBUG=INFO')
cluster.add_command('export PYTHONFAULTHANDLER=1')
# depending on your cluster config, you probably want
# to limit the wired connection device
# cluster.add_command('export NCCL_SOCKET_IFNAME=^docker0,lo')
# depending on your cluster, you might need to load
# the latest NCCL version
# cluster.load_modules(['NCCL/2.4.7-1-cuda.10.0'])
# run only on 32GB voltas
cluster.add_slurm_cmd(cmd='partition', value=hyperparams.gpu_partition,
comment='your cluster might need this argument')
# run hopt
# creates and submits jobs to slurm
cluster.optimize_parallel_cluster_gpu(
main,
nb_trials=hyperparams.num_hyperparam_trials,
job_name=hyperparams.experiment_name
)
if __name__ == '__main__':
# use default args
root_dir = os.path.dirname(os.path.realpath(__file__))
demo_log_dir = os.path.join(root_dir, 'pt_lightning_demo_logs')
checkpoint_dir = os.path.join(demo_log_dir, 'model_weights')
test_tube_dir = os.path.join(demo_log_dir, 'test_tube_data')
slurm_out_dir = os.path.join(demo_log_dir, 'slurm_scripts')
parent_parser = HyperOptArgumentParser(strategy='grid_search', add_help=False)
# cluster args not defined inside the model
parent_parser.add_argument('--per_experiment_nb_gpus', type=int,
default=8, help='how many gpus to use in a node')
parent_parser.add_argument('--nb_gpu_nodes', type=int, default=2,
help='how many nodes to use in a cluster')
parent_parser.add_argument('--test_tube_save_path', type=str, default=test_tube_dir,
help='where to save logs')
parent_parser.add_argument('--slurm_log_path', type=str, default=slurm_out_dir,
help='where to save slurm meta')
parent_parser.add_argument('--model_save_path', type=str, default=checkpoint_dir,
help='where to save model')
parent_parser.add_argument('--distributed_backend', type=str, default='ddp',
help='ddp or ddp2')
parent_parser.add_argument('--experiment_name', type=str, default='pt_lightning_exp_a',
help='test tube exp name')
parent_parser.add_argument('--num_hyperparam_trials', type=int, default=6,
help='how many grid search trials to run')
parent_parser.add_argument('--email', type=str, default='add@email.com',
help='email for jobs')
parent_parser.add_argument('--conda_env', type=str, default='base',
help='email for jobs')
parent_parser.add_argument('--gpu_partition', type=str, help='consult your cluster manual')
parent_parser.add_argument('--gpu_type', type=str, default='2080ti', help='consult your cluster manual')
# allow model to overwrite or extend args
parser = LightningTemplateModel.add_model_specific_args(parent_parser, root_dir)
hyperparams = parser.parse_args()
# ---------------------
# RUN TRAINING
# ---------------------
# run on HPC cluster
print('RUNNING ON SLURM CLUSTER')
optimize_on_cluster(hyperparams)

View File

@ -5,7 +5,7 @@ import os
import numpy as np
import torch
from test_tube import HyperOptArgumentParser, Experiment
from argparse import ArgumentParser
from pytorch_lightning import Trainer
from examples.basic_examples.lightning_module_template import LightningTemplateModel
@ -25,42 +25,26 @@ def main(hparams):
# ------------------------
model = LightningTemplateModel(hparams)
# ------------------------
# 2 INIT TEST TUBE EXP
# ------------------------
# init experiment
exp = Experiment(
name='test_exp',
save_dir=hyperparams.log_dir,
autosave=False,
description='test demo'
)
# ------------------------
# 2 INIT TRAINER
# ------------------------
trainer = Trainer(
experiment=exp,
gpus=8,
gpus=2,
nb_gpu_nodes=2
)
# ------------------------
# 5 START TRAINING
# 3 START TRAINING
# ------------------------
trainer.fit(model)
if __name__ == '__main__':
# use current dir for logging
root_dir = os.path.dirname(os.path.realpath(__file__))
log_dir = os.path.join(root_dir, 'pt_lightning_demo_logs')
parent_parser = ArgumentParser(add_help=False)
parent_parser = HyperOptArgumentParser(strategy='grid_search', add_help=False)
parent_parser.add_argument('--log_dir', type=str, default=log_dir,
help='where to save logs')
# allow model to overwrite or extend args
# each LightningModule defines arguments relevant to it
parser = LightningTemplateModel.add_model_specific_args(parent_parser, root_dir)
hyperparams = parser.parse_args()