lightning/docs/Trainer/SLURM Managed Cluster.md

Lightning supports model training on a cluster managed by SLURM in the following cases:    

1. Training on a single cpu or single GPU.
2. Train on multiple GPUs on the same node using DataParallel or DistributedDataParallel
3. Training across multiple GPUs on multiple different nodes via DistributedDataParallel.

**Note: A node means a machine with multiple GPUs**

---
#### Running grid search on a cluster
To use lightning to run a hyperparameter search (grid-search or random-search) on a cluster do 4 things:   

(1). Define the parameters for the grid search    
    
```{.python}
from test_tube import HyperOptArgumentParser

# subclass of argparse
parser = HyperOptArgumentParser(strategy='random_search')
parser.add_argument('--learning_rate', default=0.002, type=float, help='the learning rate')

# let's enable optimizing over the number of layers in the network
parser.opt_list('--nb_layers', default=2, type=int, tunable=True, options=[2, 4, 8])

hparams = parser.parse_args()    
```    
    
**NOTE** You must set ```Tunable=True``` for that argument to be considered in the permutation set. Otherwise
test-tube will use the default value. This flag is useful when you don't want to search over an argument and
want to use the default instead.   
     
(2). Define the cluster options in the [SlurmCluster object](https://williamfalcon.github.io/test-tube/hpc/SlurmCluster/) (over 5 nodes and 8 gpus)    

```{.python}
from test_tube.hpc import SlurmCluster

# hyperparameters is a test-tube hyper params object
# see https://williamfalcon.github.io/test-tube/hyperparameter_optimization/HyperOptArgumentParser/
hyperparams = args.parse()

# init cluster
cluster = SlurmCluster(
    hyperparam_optimizer=hyperparams,
    log_path='/path/to/log/results/to',
    python_cmd='python3'
)

# let the cluster know where to email for a change in job status (ie: complete, fail, etc...)
cluster.notify_job_status(email='some@email.com', on_done=True, on_fail=True)

# set the job options. In this instance, we'll run 20 different models
# each with its own set of hyperparameters giving each one 1 GPU (ie: taking up 20 GPUs)
cluster.per_experiment_nb_gpus = 8
cluster.per_experiment_nb_nodes = 5

# we'll request 10GB of memory per node
cluster.memory_mb_per_node = 10000

# set a walltime of 10 minues
cluster.job_time = '10:00'
```

(3). Make a main function with your model and trainer. Each job will call this function with a particular
hparams configuration.    
```{.python}
from pytorch_lightning import Trainer

def train_fx(trial_hparams, cluster_manager, _):
    # hparams has a specific set of hyperparams
    
    my_model = MyLightningModel()
    
    # give the trainer the cluster object
    trainer = Trainer()
    trainer.fit(my_model)

```

(3). Start the grid/random search     
```{.python}
# run the models on the cluster
cluster.optimize_parallel_cluster_gpu(
    train_fx, 
    nb_trials=20, 
    job_name='my_grid_search_exp_name', 
    job_display_name='my_exp')
```

**NOTE** nb_trials specifies how many of the possible permutations to use. If using ```grid_search``` it will use
the depth first ordering. If using ```random_search``` it will use the first k shuffled options. FYI, random search
has been shown to be just as good as any Bayesian optimization method when using a reasonable number of samples (60),
[see this paper for more information](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf).

---
#### Walltime auto-resubmit
Lightning automatically resubmits jobs when they reach the walltime. Make sure to set the SIGUSR1 signal in 
your SLURM script.   

```bash
# 90 seconds before training ends
#SBATCH --signal=SIGUSR1@90
``` 

When lightning receives the SIGUSR1 signal it will:
1. save a checkpoint with 'hpc_ckpt' in the name.
2. resubmit the job using the SLURM_JOB_ID  

When the script starts again, Lightning will:
1. search for a 'hpc_ckpt' checkpoint. 
2. restore the model, optimizers, schedulers, epoch, etc...
changed read me 2019-06-28 20:17:48 +00:00			`Lightning supports model training on a cluster managed by SLURM in the following cases:`

Moves hpc auto-resubmit to trainer from test-tube (#207) * added slurm signal handler * added restore weight functions * set slurm signal handling inside process * added resubmit docs * added resubmit docs * fixed missing param * Update trainer.py * fixed missing param * fixed missing param * debugging tests * debugging tests * debugging tests * debugging tests * debugging tests * debugging tests * debugging tests 2019-09-06 15:54:51 +00:00			`1. Training on a single cpu or single GPU.`
			`2. Train on multiple GPUs on the same node using DataParallel or DistributedDataParallel`
			`3. Training across multiple GPUs on multiple different nodes via DistributedDataParallel.`

			`Note: A node means a machine with multiple GPUs`
changed read me 2019-06-28 20:17:48 +00:00
			`---`
			`#### Running grid search on a cluster`
			`To use lightning to run a hyperparameter search (grid-search or random-search) on a cluster do 4 things:`

			`(1). Define the parameters for the grid search`

			```{.python}
			`from test_tube import HyperOptArgumentParser`

			`# subclass of argparse`
			`parser = HyperOptArgumentParser(strategy='random_search')`
			`parser.add_argument('--learning_rate', default=0.002, type=float, help='the learning rate')`

			`# let's enable optimizing over the number of layers in the network`
			`parser.opt_list('--nb_layers', default=2, type=int, tunable=True, options=[2, 4, 8])`

			`hparams = parser.parse_args()`
			```

updated docs 2019-09-16 15:04:40 +00:00			NOTE You must set ```Tunable=True``` for that argument to be considered in the permutation set. Otherwise
			`test-tube will use the default value. This flag is useful when you don't want to search over an argument and`
			`want to use the default instead.`
changed read me 2019-06-28 20:17:48 +00:00
distributed docs 2019-06-28 20:51:47 +00:00			`(2). Define the cluster options in the [SlurmCluster object](https://williamfalcon.github.io/test-tube/hpc/SlurmCluster/) (over 5 nodes and 8 gpus)`
changed read me 2019-06-28 20:17:48 +00:00
			```{.python}
			`from test_tube.hpc import SlurmCluster`

			`# hyperparameters is a test-tube hyper params object`
			`# see https://williamfalcon.github.io/test-tube/hyperparameter_optimization/HyperOptArgumentParser/`
			`hyperparams = args.parse()`

			`# init cluster`
			`cluster = SlurmCluster(`
			`hyperparam_optimizer=hyperparams,`
			`log_path='/path/to/log/results/to',`
			`python_cmd='python3'`
			`)`

			`# let the cluster know where to email for a change in job status (ie: complete, fail, etc...)`
			`cluster.notify_job_status(email='some@email.com', on_done=True, on_fail=True)`

			`# set the job options. In this instance, we'll run 20 different models`
			`# each with its own set of hyperparameters giving each one 1 GPU (ie: taking up 20 GPUs)`
			`cluster.per_experiment_nb_gpus = 8`
			`cluster.per_experiment_nb_nodes = 5`

			`# we'll request 10GB of memory per node`
			`cluster.memory_mb_per_node = 10000`

			`# set a walltime of 10 minues`
			`cluster.job_time = '10:00'`
			```

Moves hpc auto-resubmit to trainer from test-tube (#207) * added slurm signal handler * added restore weight functions * set slurm signal handling inside process * added resubmit docs * added resubmit docs * fixed missing param * Update trainer.py * fixed missing param * fixed missing param * debugging tests * debugging tests * debugging tests * debugging tests * debugging tests * debugging tests * debugging tests 2019-09-06 15:54:51 +00:00			`(3). Make a main function with your model and trainer. Each job will call this function with a particular`
			`hparams configuration.`
changed read me 2019-06-28 20:17:48 +00:00			```{.python}
			`from pytorch_lightning import Trainer`

			`def train_fx(trial_hparams, cluster_manager, _):`
			`# hparams has a specific set of hyperparams`

			`my_model = MyLightningModel()`

			`# give the trainer the cluster object`
Moves hpc auto-resubmit to trainer from test-tube (#207) * added slurm signal handler * added restore weight functions * set slurm signal handling inside process * added resubmit docs * added resubmit docs * fixed missing param * Update trainer.py * fixed missing param * fixed missing param * debugging tests * debugging tests * debugging tests * debugging tests * debugging tests * debugging tests * debugging tests 2019-09-06 15:54:51 +00:00			`trainer = Trainer()`
changed read me 2019-06-28 20:17:48 +00:00			`trainer.fit(my_model)`

			```

Moves hpc auto-resubmit to trainer from test-tube (#207) * added slurm signal handler * added restore weight functions * set slurm signal handling inside process * added resubmit docs * added resubmit docs * fixed missing param * Update trainer.py * fixed missing param * fixed missing param * debugging tests * debugging tests * debugging tests * debugging tests * debugging tests * debugging tests * debugging tests 2019-09-06 15:54:51 +00:00			`(3). Start the grid/random search`
changed read me 2019-06-28 20:17:48 +00:00			```{.python}
			`# run the models on the cluster`
			`cluster.optimize_parallel_cluster_gpu(`
			`train_fx,`
			`nb_trials=20,`
			`job_name='my_grid_search_exp_name',`
			`job_display_name='my_exp')`
			```

updated docs 2019-09-16 15:07:16 +00:00			NOTE nb_trials specifies how many of the possible permutations to use. If using ```grid_search``` it will use
			the depth first ordering. If using ```random_search``` it will use the first k shuffled options. FYI, random search
			`has been shown to be just as good as any Bayesian optimization method when using a reasonable number of samples (60),`
			`[see this paper for more information](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf).`

changed read me 2019-06-28 20:17:48 +00:00			`---`
			`#### Walltime auto-resubmit`
Moves hpc auto-resubmit to trainer from test-tube (#207) * added slurm signal handler * added restore weight functions * set slurm signal handling inside process * added resubmit docs * added resubmit docs * fixed missing param * Update trainer.py * fixed missing param * fixed missing param * debugging tests * debugging tests * debugging tests * debugging tests * debugging tests * debugging tests * debugging tests 2019-09-06 15:54:51 +00:00			`Lightning automatically resubmits jobs when they reach the walltime. Make sure to set the SIGUSR1 signal in`
			`your SLURM script.`
changed read me 2019-06-28 20:17:48 +00:00
Moves hpc auto-resubmit to trainer from test-tube (#207) * added slurm signal handler * added restore weight functions * set slurm signal handling inside process * added resubmit docs * added resubmit docs * fixed missing param * Update trainer.py * fixed missing param * fixed missing param * debugging tests * debugging tests * debugging tests * debugging tests * debugging tests * debugging tests * debugging tests 2019-09-06 15:54:51 +00:00			```bash
			`# 90 seconds before training ends`
			`#SBATCH --signal=SIGUSR1@90`
changed read me 2019-06-28 20:17:48 +00:00			```

Moves hpc auto-resubmit to trainer from test-tube (#207) * added slurm signal handler * added restore weight functions * set slurm signal handling inside process * added resubmit docs * added resubmit docs * fixed missing param * Update trainer.py * fixed missing param * fixed missing param * debugging tests * debugging tests * debugging tests * debugging tests * debugging tests * debugging tests * debugging tests 2019-09-06 15:54:51 +00:00			`When lightning receives the SIGUSR1 signal it will:`
			`1. save a checkpoint with 'hpc_ckpt' in the name.`
			`2. resubmit the job using the SLURM_JOB_ID`

			`When the script starts again, Lightning will:`
			`1. search for a 'hpc_ckpt' checkpoint.`
			`2. restore the model, optimizers, schedulers, epoch, etc...`
changed read me 2019-06-28 20:17:48 +00:00