3a58937d8b
* rename nb -> num * flake8 * batch_nb, epoch_nb, gpu_nb, split_nb * add _num deprecations |
||
---|---|---|
.. | ||
README.md | ||
__init__.py | ||
ddp2_job_submit.sh | ||
ddp_job_submit.sh | ||
multi_node_ddp2_demo.py | ||
multi_node_ddp_demo.py |
README.md
Multi-node example
This demo launches a job using 2 GPUs on 2 different nodes (4 GPUs total). To run this demo do the following:
- Log into the jumphost node of your SLURM-managed cluster.
- Create a conda environment with Lightning and a GPU PyTorch version.
- Choose a script to submit
DDP
Submit this job to run with distributedDataParallel (2 nodes, 2 gpus each)
sbatch ddp_job_submit.sh YourEnv
DDP2
Submit this job to run with a different implementation of distributedDataParallel. In this version, each node acts like DataParallel but syncs across nodes like DDP.
sbatch ddp2_job_submit.sh YourEnv