Update README.md

2019-09-14 09:55:42 -04:00 · 2019-09-14 09:55:42 -04:00 · e339799a0a
parent 50f5e4bec8
commit e339799a0a
1 changed files with 71 additions and 1 deletions
--- a/examples/new_project_templates/multi_node_examples/README.md
+++ b/examples/new_project_templates/multi_node_examples/README.md
@ -1,5 +1,75 @@
 # Multi-node examples
-Use these templates for multi-node training
+Use these templates for multi-node training. 
+The main complexity around cluster training is how you submit the SLURM jobs.  
+
+## Test-tube   
+Lightning uses test-tube to submit SLURM jobs and to run hyperparameter searches on a cluster.  
+
+To run a hyperparameter search, we normally add the values to search to the Hyperparameter optimizer 
+```python
+from test_tube import HyperOptArgumentParser
+
+parser = HyperOptArgumentParser(strategy='grid_search')
+parser.opt_list('--drop_prob', default=0.2, options=[0.2, 0.5], type=float, tunable=True)
+parser.opt_list('--learning_rate', default=0.001, type=float,
+                        options=[0.0001, 0.0005, 0.001],
+                        tunable=True)
+                        
+# give your model a chance to add its own parameters
+parser = LightningTemplateModel.add_model_specific_args(parent_parser, root_dir)
+
+# parse args
+hyperparams = parser.parse_args()
+```
+
+The above sets up a grid search on learning rate and drop probability. You can now add this object to the 
+cluster object to perform the grid search:   
+```python
+cluster = SlurmCluster(
+    hyperparam_optimizer=hyperparams,
+    log_path='/path/to/log/slurm/files',
+)
+
+# ... configure cluster options
+
+# run grid search on cluster
+nb_trials = 6   # (2 drop probs * 3 lrs)
+cluster.optimize_parallel_cluster_gpu(
+    YourMainFunction,
+    nb_trials=nb_trials,
+    job_name=hyperparams.experiment_name
+)
+```
+
+Running the above will launch 6 jobs, each with a different drop prob and learning rate combination.   
+The ```tunable``` parameter must be set to True to add that argument to the space of options, otherwise
+Test-Tube will use the ```default=value```.    
+
+
+## SLURM Flags   
+However you decide to submit your jobs, debugging requires a few flags. Without these flags, you'll
+see a nccl error instead of the actual error which caused the bug.   
+
+```sh
+export NCCL_DEBUG=INFO
+export PYTHONFAULTHANDLER=1
+```
+
+On some clusters you might need to set the network interface with this flag.   
+```sh
+export NCCL_SOCKET_IFNAME=^docker0,lo
+```   
+
+You might also need to load the latest version of NCCL  
+```sh
+module load NCCL/2.4.7-1-cuda.10.0
+```
+
+Finally, you must set the master port (usually a random number between 12k and 20k).   
+```sh
+# random port between 12k and 20k
+export MASTER_PORT=$((12000 + RANDOM % 20000))$   
+```

 ## Simplest example.   
 1. Modify this script with your CoolModel file.