245 lines
8.2 KiB
Markdown
245 lines
8.2 KiB
Markdown
Lightning makes multi-gpu training and 16 bit training trivial.
|
|
|
|
*Note:*
|
|
None of the flags below require changing anything about your lightningModel definition.
|
|
|
|
---
|
|
#### Choosing a backend
|
|
Lightning supports two backends. DataParallel and DistributedDataParallel. Both can be used for single-node multi-GPU training.
|
|
For multi-node training you must use DistributedDataParallel.
|
|
|
|
**Warning: Your cluster must have NCCL installed and you must load it when submitting your SLURM script**
|
|
|
|
You can toggle between each mode by setting this flag.
|
|
``` {.python}
|
|
# DEFAULT (when using single GPU or no GPUs)
|
|
trainer = Trainer(distributed_backend=None)
|
|
|
|
# Change to DataParallel (gpus > 1)
|
|
trainer = Trainer(distributed_backend='dp')
|
|
|
|
# change to distributed data parallel (gpus > 1)
|
|
trainer = Trainer(distributed_backend='ddp')
|
|
```
|
|
|
|
If you request multiple nodes, the back-end will auto-switch to ddp.
|
|
We recommend you use DistributedDataparallel even for single-node multi-GPU training. It is MUCH faster than DP but *may*
|
|
have configuration issues depending on your cluster.
|
|
|
|
For a deeper understanding of what lightning is doing, feel free to read [this guide](https://medium.com/@_willfalcon/9-tips-for-training-lightning-fast-neural-networks-in-pytorch-8e63a502f565).
|
|
|
|
---
|
|
#### Distributed and 16-bit precision.
|
|
Due to an issue with apex and DistributedDataParallel (PyTorch and NVIDIA issue), Lightning does
|
|
not allow 16-bit and DP training. We tried to get this to work, but it's an issue on their end.
|
|
|
|
Below are the possible configurations we support.
|
|
|
|
| 1 GPU | 1+ GPUs | DP | DDP | 16-bit | command |
|
|
|---|---|---|---|---|---|
|
|
| Y | | | | | ```Trainer(gpus=1)``` |
|
|
| Y | | | | Y | ```Trainer(gpus=1, use_amp=True)``` |
|
|
| | Y | Y | | | ```Trainer(gpus=k, distributed_backend='dp')``` |
|
|
| | Y | | Y | | ```Trainer(gpus=k, distributed_backend='ddp')``` |
|
|
| | Y | | Y | Y | ```Trainer(gpus=k, distributed_backend='ddp', use_amp=True)``` |
|
|
|
|
You also have the option of specifying which GPUs to use by passing a list:
|
|
|
|
```python
|
|
# DEFAULT (int)
|
|
Trainer(gpus=k)
|
|
|
|
# You specify which GPUs (don't use if running on cluster)
|
|
Trainer(gpus=[0, 1])
|
|
|
|
# can also be a string
|
|
Trainer(gpus='0, 1')
|
|
```
|
|
|
|
---
|
|
#### CUDA flags
|
|
CUDA flags make certain GPUs visible to your script.
|
|
Lightning sets these for you automatically, there's NO NEED to do this yourself.
|
|
```python
|
|
# lightning will set according to what you give the trainer
|
|
# os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
|
|
# os.environ["CUDA_VISIBLE_DEVICES"] = "0"
|
|
```
|
|
|
|
However, when using a cluster, Lightning will NOT set these flags (and you should not either).
|
|
SLURM will set these for you.
|
|
|
|
---
|
|
#### 16-bit mixed precision
|
|
16 bit precision can cut your memory footprint by half. If using volta architecture GPUs it can give a dramatic training speed-up as well.
|
|
First, install apex (if install fails, look [here](https://github.com/NVIDIA/apex)):
|
|
```bash
|
|
$ git clone https://github.com/NVIDIA/apex
|
|
$ cd apex
|
|
|
|
# ------------------------
|
|
# OPTIONAL: on your cluster you might need to load cuda 10 or 9
|
|
# depending on how you installed PyTorch
|
|
|
|
# see available modules
|
|
module avail
|
|
|
|
# load correct cuda before install
|
|
module load cuda-10.0
|
|
# ------------------------
|
|
|
|
# make sure you've loaded the latest cuda version on your system
|
|
module load gcc-9.1
|
|
|
|
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
|
|
```
|
|
|
|
then set this use_amp to True.
|
|
``` {.python}
|
|
# DEFAULT
|
|
trainer = Trainer(amp_level='O2', use_amp=False)
|
|
```
|
|
|
|
---
|
|
#### Single-gpu
|
|
Make sure you're on a GPU machine.
|
|
```python
|
|
# DEFAULT
|
|
trainer = Trainer(gpus=1)
|
|
```
|
|
|
|
---
|
|
#### multi-gpu
|
|
Make sure you're on a GPU machine. You can set as many GPUs as you want.
|
|
In this setting, the model will run on all 8 GPUs at once using DataParallel under the hood.
|
|
```python
|
|
# to use DataParallel
|
|
trainer = Trainer(gpus=8, distributed_backend='dp')
|
|
|
|
# RECOMMENDED use DistributedDataParallel
|
|
trainer = Trainer(gpus=8, distributed_backend='ddp')
|
|
```
|
|
|
|
---
|
|
#### Multi-node
|
|
Multi-node training is easily done by specifying these flags.
|
|
```python
|
|
# train on 12*8 GPUs
|
|
trainer = Trainer(gpus=8, nb_gpu_nodes=12, distributed_backend='ddp')
|
|
```
|
|
|
|
You must configure your job submission script correctly for the trainer to work. Here is an example
|
|
script for the above trainer configuration.
|
|
|
|
```sh
|
|
#!/bin/bash -l
|
|
|
|
# SLURM SUBMIT SCRIPT
|
|
#SBATCH --nodes=12
|
|
#SBATCH --gres=gpu:8
|
|
#SBATCH --ntasks-per-node=8
|
|
#SBATCH --mem=0
|
|
#SBATCH --time=0-02:00:00
|
|
|
|
# activate conda env
|
|
conda activate my_env
|
|
|
|
# REQUIRED: Load the latest NCCL version
|
|
# the nccl version must match the cuda used to build your PyTorch distribution
|
|
# (ie: which instructions did you follow when installing PyTorch)
|
|
# module load NCCL/2.4.7-1-cuda.10.0
|
|
|
|
# -------------------------
|
|
# OPTIONAL
|
|
# -------------------------
|
|
# debugging flags (optional)
|
|
# export NCCL_DEBUG=INFO
|
|
# export PYTHONFAULTHANDLER=1
|
|
|
|
# on your cluster you might need these:
|
|
# set the network interface
|
|
# export NCCL_SOCKET_IFNAME=^docker0,lo
|
|
# -------------------------
|
|
|
|
# random port between 12k and 20k
|
|
export MASTER_PORT=$((12000 + RANDOM % 20000))
|
|
|
|
# run script from above
|
|
python my_main_file.py
|
|
```
|
|
|
|
**NOTE:** When running in DDP mode, any errors in your code will show up as an NCCL issue.
|
|
Set the ```NCCL_DEBUG=INFO``` flag to see the ACTUAL error.
|
|
|
|
Finally, make sure to add a distributed sampler to your dataset. The distributed sampler copies a
|
|
portion of your dataset onto each GPU. (World_size = gpus_per_node * nb_nodes).
|
|
|
|
```python
|
|
# ie: this:
|
|
dataset = myDataset()
|
|
dataloader = Dataloader(dataset)
|
|
|
|
# becomes:
|
|
dataset = myDataset()
|
|
dist_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
|
|
dataloader = Dataloader(dataset, sampler=dist_sampler)
|
|
```
|
|
|
|
#### Auto-slurm-job-submission
|
|
Instead of manually building SLURM scripts, you can use the [SlurmCluster object](https://williamfalcon.github.io/test-tube/hpc/SlurmCluster/) to
|
|
do this for you. The SlurmCluster can also run a grid search if you pass in a [HyperOptArgumentParser](https://williamfalcon.github.io/test-tube/hyperparameter_optimization/HyperOptArgumentParser/).
|
|
|
|
Here is an example where you run a grid search of 9 combinations of hyperparams.
|
|
[The full examples are here](https://github.com/williamFalcon/pytorch-lightning/tree/master/examples/new_project_templates/multi_node_examples).
|
|
```python
|
|
# grid search 3 values of learning rate and 3 values of number of layers for your net
|
|
# this generates 9 experiments (lr=1e-3, layers=16), (lr=1e-3, layers=32), (lr=1e-3, layers=64), ... (lr=1e-1, layers=64)
|
|
parser = HyperOptArgumentParser(strategy='grid_search', add_help=False)
|
|
parser.opt_list('--learning_rate', default=0.001, type=float, options=[1e-3, 1e-2, 1e-1], tunable=True)
|
|
parser.opt_list('--layers', default=1, type=float, options=[16, 32, 64], tunable=True)
|
|
hyperparams = parser.parse_args()
|
|
|
|
# Slurm cluster submits 9 jobs, each with a set of hyperparams
|
|
cluster = SlurmCluster(
|
|
hyperparam_optimizer=hyperparams,
|
|
log_path='/some/path/to/save',
|
|
)
|
|
|
|
# OPTIONAL FLAGS WHICH MAY BE CLUSTER DEPENDENT
|
|
# which interface your nodes use for communication
|
|
cluster.add_command('export NCCL_SOCKET_IFNAME=^docker0,lo')
|
|
|
|
# see output of the NCCL connection process
|
|
# NCCL is how the nodes talk to each other
|
|
cluster.add_command('export NCCL_DEBUG=INFO')
|
|
|
|
# setting a master port here is a good idea.
|
|
cluster.add_command('export MASTER_PORT=%r' % PORT)
|
|
|
|
# ************** DON'T FORGET THIS ***************
|
|
# MUST load the latest NCCL version
|
|
cluster.load_modules(['NCCL/2.4.7-1-cuda.10.0'])
|
|
|
|
# configure cluster
|
|
cluster.per_experiment_nb_nodes = 12
|
|
cluster.per_experiment_nb_gpus = 8
|
|
|
|
cluster.add_slurm_cmd(cmd='ntasks-per-node', value=8, comment='1 task per gpu')
|
|
|
|
# submit a script with 9 combinations of hyper params
|
|
# (lr=1e-3, layers=16), (lr=1e-3, layers=32), (lr=1e-3, layers=64), ... (lr=1e-1, layers=64)
|
|
cluster.optimize_parallel_cluster_gpu(
|
|
main,
|
|
nb_trials=9, # how many permutations of the grid search to run
|
|
job_name='name_for_squeue'
|
|
)
|
|
```
|
|
|
|
The other option is that you generate scripts on your own via a bash command or use another library...
|
|
|
|
---
|
|
#### Self-balancing architecture
|
|
Here lightning distributes parts of your module across available GPUs to optimize for speed and memory.
|
|
|
|
COMING SOON.
|