updated docs

2019-07-21 08:29:12 -04:00 · 2019-07-21 08:29:12 -04:00 · d273271b4b
parent babaa088d7
commit d273271b4b
1 changed files with 33 additions and 0 deletions
--- a/docs/Trainer/Distributed
+++ b/docs/Trainer/Distributed
@ -3,6 +3,26 @@ Lightning makes multi-gpu training and 16 bit training trivial.
 *Note:*   
 None of the flags below require changing anything about your lightningModel definition. 

+---
+#### Choosing a backend  
+Lightning supports two backends. DataParallel and DistributedDataParallel. Both can be used for single-node multi-GPU training.
+For multi-node training you must use DistributedDataParallel.   
+
+You can toggle between each mode by setting this flag.
+``` {.python}
+# DEFAULT uses DataParallel
+trainer = Trainer(distributed_backend='dp')
+
+# change to distributed data parallel
+trainer = Trainer(distributed_backend='ddp')
+```
+
+If you request multiple nodes, the back-end will auto-switch to ddp.
+We recommend you use DistributedDataparallel even for single-node multi-GPU training. It is MUCH faster than DP but *may*
+have configuration issues depending on your cluster.
+
+For a deeper understanding of what lightning is doing, feel free to read [this guide](https://medium.com/@_willfalcon/9-tips-for-training-lightning-fast-neural-networks-in-pytorch-8e63a502f565).   
+
 ---
 #### 16-bit mixed precision
 16 bit precision can cut your memory footprint by half. If using volta architecture GPUs it can give a dramatic training speed-up as well.    
@ -67,6 +87,19 @@ cluster.per_experiment_nb_gpus = 8
 cluster.add_slurm_cmd(cmd='ntasks-per-node', value=8, comment='1 task per gpu')
 ```

+Finally, make sure to add a distributed sampler to your dataset.    
+
+```python
+# ie: this:
+dataset = myDataset()
+dataloader = Dataloader(dataset)
+
+# becomes:
+dataset = myDataset()
+dist_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
+dataloader = Dataloader(dataset, sampler=dist_sampler)
+```
+
 ---
 #### Self-balancing architecture
 Here lightning distributes parts of your module across available GPUs to optimize for speed and memory.