updated docs

This commit is contained in:
William Falcon 2019-07-21 08:17:12 -04:00
parent 2357815640
commit 9311812829
1 changed files with 21 additions and 2 deletions

View File

@ -40,13 +40,32 @@ In this setting, the model will run on all 8 GPUs at once using DataParallel und
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"
# DEFAULT
trainer = Trainer(gpus=[0,1,2,3,4,5,6,7])
```
---
#### Multi-node
COMING SOON.
Multi-node training is easily done by specifying these flags.
```python
# train on 12*8 GPUs
trainer = Trainer(gpus=[0,1,2,3,4,5,6,7], nb_gpu_nodes=12)
```
In addition, make sure to set up your SLURM job correctly via the [SlurmClusterObject](https://williamfalcon.github.io/test-tube/hpc/SlurmCluster/). In particular, specify the number of tasks per node correctly.
```python
cluster = SlurmCluster(
hyperparam_optimizer=test_tube.HyperOptArgumentParser(),
log_path='/some/path/to/save',
)
# configure cluster
cluster.per_experiment_nb_nodes = 12
cluster.per_experiment_nb_gpus = 8
cluster.add_slurm_cmd(cmd='ntasks-per-node', value=8, comment='1 task per gpu')
```
---
#### Self-balancing architecture