2019-10-05 18:21:12 +00:00
|
|
|
# Multi-node example
|
2019-09-14 13:55:42 +00:00
|
|
|
|
2019-10-05 21:37:17 +00:00
|
|
|
This demo launches a job using 2 GPUs on 2 different nodes (4 GPUs total).
|
|
|
|
To run this demo do the following:
|
2019-09-14 13:55:42 +00:00
|
|
|
|
2019-10-05 18:28:08 +00:00
|
|
|
1. Log into the jumphost node of your SLURM-managed cluster.
|
|
|
|
2. Create a conda environment with Lightning and a GPU PyTorch version.
|
2019-10-05 20:39:05 +00:00
|
|
|
3. Choose a script to submit
|
|
|
|
|
|
|
|
#### DDP
|
|
|
|
Submit this job to run with distributedDataParallel (2 nodes, 2 gpus each)
|
|
|
|
```bash
|
|
|
|
sbatch ddp_job_submit.sh YourEnv
|
|
|
|
```
|
|
|
|
|
|
|
|
#### DDP2
|
|
|
|
Submit this job to run with a different implementation of distributedDataParallel.
|
|
|
|
In this version, each node acts like DataParallel but syncs across nodes like DDP.
|
2019-09-08 22:17:33 +00:00
|
|
|
```bash
|
2019-10-05 20:39:05 +00:00
|
|
|
sbatch ddp2_job_submit.sh YourEnv
|
2019-10-05 18:48:22 +00:00
|
|
|
```
|