added docs for cluster grid search

This commit is contained in:
William Falcon 2019-09-26 12:02:03 -04:00
parent 3cab3b2f8c
commit acb4ebea56
1 changed files with 86 additions and 21 deletions

View File

@ -8,6 +8,8 @@ None of the flags below require changing anything about your lightningModel defi
Lightning supports two backends. DataParallel and DistributedDataParallel. Both can be used for single-node multi-GPU training.
For multi-node training you must use DistributedDataParallel.
**Warning: Your cluster must have NCCL installed and you must load it when submitting your SLURM script**
You can toggle between each mode by setting this flag.
``` {.python}
# DEFAULT (when using single GPU or no GPUs)
@ -117,39 +119,50 @@ trainer = Trainer(gpus=8, distributed_backend='ddp')
---
#### Multi-node
Multi-node training is easily done by specifying these flags.
Multi-node training is easily done by specifying these flags.
```python
# train on 12*8 GPUs
trainer = Trainer(gpus=8, nb_gpu_nodes=12, distributed_backend='ddp')
```
In addition, make sure to set up your SLURM job correctly via the [SlurmClusterObject](https://williamfalcon.github.io/test-tube/hpc/SlurmCluster/). In particular, specify the number of tasks per node correctly.
You must configure your job submission script correctly for the trainer to work. Here is an example
script for the above trainer configuration.
```python
cluster = SlurmCluster(
hyperparam_optimizer=test_tube.HyperOptArgumentParser(),
log_path='/some/path/to/save',
)
```sh
#!/bin/bash -l
# OPTIONAL FLAGS WHICH MAY BE CLUSTER DEPENDENT
# which interface your nodes use for communication
cluster.add_command('export NCCL_SOCKET_IFNAME=^docker0,lo')
# SLURM SUBMIT SCRIPT
#SBATCH --nodes=12
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8
#SBATCH --mem=0
#SBATCH --time=0-02:00:00
# see output of the NCCL connection process
# NCCL is how the nodes talk to each other
cluster.add_command('export NCCL_DEBUG=INFO')
# activate conda env
conda activate my_env
# setting a master port here is a good idea.
cluster.add_command('export MASTER_PORT=%r' % PORT)
# REQUIRED: Load the latest NCCL version
# the nccl version must match the cuda used to build your PyTorch distribution
# (ie: which instructions did you follow when installing PyTorch)
# module load NCCL/2.4.7-1-cuda.10.0
# good to load the latest NCCL version
cluster.load_modules(['NCCL/2.4.7-1-cuda.10.0'])
# -------------------------
# OPTIONAL
# -------------------------
# debugging flags (optional)
# export NCCL_DEBUG=INFO
# export PYTHONFAULTHANDLER=1
# configure cluster
cluster.per_experiment_nb_nodes = 12
cluster.per_experiment_nb_gpus = 8
# on your cluster you might need these:
# set the network interface
# export NCCL_SOCKET_IFNAME=^docker0,lo
# -------------------------
cluster.add_slurm_cmd(cmd='ntasks-per-node', value=8, comment='1 task per gpu')
# random port between 12k and 20k
export MASTER_PORT=$((12000 + RANDOM % 20000))
# run script from above
python my_main_file.py
```
**NOTE:** When running in DDP mode, any errors in your code will show up as an NCCL issue.
@ -169,6 +182,58 @@ dist_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
dataloader = Dataloader(dataset, sampler=dist_sampler)
```
#### Auto-slurm-job-submission
Instead of manually building SLURM scripts, you can use the [SlurmCluster object](https://williamfalcon.github.io/test-tube/hpc/SlurmCluster/) to
do this for you. The SlurmCluster can also run a grid search if you pass in a [HyperOptArgumentParser](https://williamfalcon.github.io/test-tube/hyperparameter_optimization/HyperOptArgumentParser/).
Here is an example where you run a grid search of 9 combinations of hyperparams.
[The full examples are here](https://github.com/williamFalcon/pytorch-lightning/tree/master/examples/new_project_templates/multi_node_examples).
```python
# grid search 3 values of learning rate and 3 values of number of layers for your net
# this generates 9 experiments (lr=1e-3, layers=16), (lr=1e-3, layers=32), (lr=1e-3, layers=64), ... (lr=1e-1, layers=64)
parser = HyperOptArgumentParser(strategy='grid_search', add_help=False)
parser.opt_list('--learning_rate', default=0.001, type=float, options=[1e-3, 1e-2, 1e-1], tunable=True)
parser.opt_list('--layers', default=1, type=float, options=[16, 32, 64], tunable=True)
hyperparams = parser.parse_args()
# Slurm cluster submits 9 jobs, each with a set of hyperparams
cluster = SlurmCluster(
hyperparam_optimizer=hyperparams,
log_path='/some/path/to/save',
)
# OPTIONAL FLAGS WHICH MAY BE CLUSTER DEPENDENT
# which interface your nodes use for communication
cluster.add_command('export NCCL_SOCKET_IFNAME=^docker0,lo')
# see output of the NCCL connection process
# NCCL is how the nodes talk to each other
cluster.add_command('export NCCL_DEBUG=INFO')
# setting a master port here is a good idea.
cluster.add_command('export MASTER_PORT=%r' % PORT)
# ************** DON'T FORGET THIS ***************
# MUST load the latest NCCL version
cluster.load_modules(['NCCL/2.4.7-1-cuda.10.0'])
# configure cluster
cluster.per_experiment_nb_nodes = 12
cluster.per_experiment_nb_gpus = 8
cluster.add_slurm_cmd(cmd='ntasks-per-node', value=8, comment='1 task per gpu')
# submit a script with 9 combinations of hyper params
# (lr=1e-3, layers=16), (lr=1e-3, layers=32), (lr=1e-3, layers=64), ... (lr=1e-1, layers=64)
cluster.optimize_parallel_cluster_gpu(
main,
nb_trials=9, # how many permutations of the grid search to run
job_name='name_for_squeue'
)
```
The other option is that you generate scripts on your own via a bash command or use another library...
---
#### Self-balancing architecture
Here lightning distributes parts of your module across available GPUs to optimize for speed and memory.