Update README.md

This commit is contained in:
William Falcon 2019-09-14 09:55:42 -04:00 committed by GitHub
parent 50f5e4bec8
commit e339799a0a
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 71 additions and 1 deletions

View File

@ -1,5 +1,75 @@
# Multi-node examples
Use these templates for multi-node training
Use these templates for multi-node training.
The main complexity around cluster training is how you submit the SLURM jobs.
## Test-tube
Lightning uses test-tube to submit SLURM jobs and to run hyperparameter searches on a cluster.
To run a hyperparameter search, we normally add the values to search to the Hyperparameter optimizer
```python
from test_tube import HyperOptArgumentParser
parser = HyperOptArgumentParser(strategy='grid_search')
parser.opt_list('--drop_prob', default=0.2, options=[0.2, 0.5], type=float, tunable=True)
parser.opt_list('--learning_rate', default=0.001, type=float,
options=[0.0001, 0.0005, 0.001],
tunable=True)
# give your model a chance to add its own parameters
parser = LightningTemplateModel.add_model_specific_args(parent_parser, root_dir)
# parse args
hyperparams = parser.parse_args()
```
The above sets up a grid search on learning rate and drop probability. You can now add this object to the
cluster object to perform the grid search:
```python
cluster = SlurmCluster(
hyperparam_optimizer=hyperparams,
log_path='/path/to/log/slurm/files',
)
# ... configure cluster options
# run grid search on cluster
nb_trials = 6 # (2 drop probs * 3 lrs)
cluster.optimize_parallel_cluster_gpu(
YourMainFunction,
nb_trials=nb_trials,
job_name=hyperparams.experiment_name
)
```
Running the above will launch 6 jobs, each with a different drop prob and learning rate combination.
The ```tunable``` parameter must be set to True to add that argument to the space of options, otherwise
Test-Tube will use the ```default=value```.
## SLURM Flags
However you decide to submit your jobs, debugging requires a few flags. Without these flags, you'll
see a nccl error instead of the actual error which caused the bug.
```sh
export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1
```
On some clusters you might need to set the network interface with this flag.
```sh
export NCCL_SOCKET_IFNAME=^docker0,lo
```
You might also need to load the latest version of NCCL
```sh
module load NCCL/2.4.7-1-cuda.10.0
```
Finally, you must set the master port (usually a random number between 12k and 20k).
```sh
# random port between 12k and 20k
export MASTER_PORT=$((12000 + RANDOM % 20000))$
```
## Simplest example.
1. Modify this script with your CoolModel file.