From e339799a0a46bd2171307409d4d61ba07fb6cff3 Mon Sep 17 00:00:00 2001 From: William Falcon Date: Sat, 14 Sep 2019 09:55:42 -0400 Subject: [PATCH] Update README.md --- .../multi_node_examples/README.md | 72 ++++++++++++++++++- 1 file changed, 71 insertions(+), 1 deletion(-) diff --git a/examples/new_project_templates/multi_node_examples/README.md b/examples/new_project_templates/multi_node_examples/README.md index 164e3dbf27..03f1926cfd 100644 --- a/examples/new_project_templates/multi_node_examples/README.md +++ b/examples/new_project_templates/multi_node_examples/README.md @@ -1,5 +1,75 @@ # Multi-node examples -Use these templates for multi-node training +Use these templates for multi-node training. +The main complexity around cluster training is how you submit the SLURM jobs. + +## Test-tube +Lightning uses test-tube to submit SLURM jobs and to run hyperparameter searches on a cluster. + +To run a hyperparameter search, we normally add the values to search to the Hyperparameter optimizer +```python +from test_tube import HyperOptArgumentParser + +parser = HyperOptArgumentParser(strategy='grid_search') +parser.opt_list('--drop_prob', default=0.2, options=[0.2, 0.5], type=float, tunable=True) +parser.opt_list('--learning_rate', default=0.001, type=float, + options=[0.0001, 0.0005, 0.001], + tunable=True) + +# give your model a chance to add its own parameters +parser = LightningTemplateModel.add_model_specific_args(parent_parser, root_dir) + +# parse args +hyperparams = parser.parse_args() +``` + +The above sets up a grid search on learning rate and drop probability. You can now add this object to the +cluster object to perform the grid search: +```python +cluster = SlurmCluster( + hyperparam_optimizer=hyperparams, + log_path='/path/to/log/slurm/files', +) + +# ... configure cluster options + +# run grid search on cluster +nb_trials = 6 # (2 drop probs * 3 lrs) +cluster.optimize_parallel_cluster_gpu( + YourMainFunction, + nb_trials=nb_trials, + job_name=hyperparams.experiment_name +) +``` + +Running the above will launch 6 jobs, each with a different drop prob and learning rate combination. +The ```tunable``` parameter must be set to True to add that argument to the space of options, otherwise +Test-Tube will use the ```default=value```. + + +## SLURM Flags +However you decide to submit your jobs, debugging requires a few flags. Without these flags, you'll +see a nccl error instead of the actual error which caused the bug. + +```sh +export NCCL_DEBUG=INFO +export PYTHONFAULTHANDLER=1 +``` + +On some clusters you might need to set the network interface with this flag. +```sh +export NCCL_SOCKET_IFNAME=^docker0,lo +``` + +You might also need to load the latest version of NCCL +```sh +module load NCCL/2.4.7-1-cuda.10.0 +``` + +Finally, you must set the master port (usually a random number between 12k and 20k). +```sh +# random port between 12k and 20k +export MASTER_PORT=$((12000 + RANDOM % 20000))$ +``` ## Simplest example. 1. Modify this script with your CoolModel file.