History

Jirka Borovec 09167efdb5 Checkpointing interval (#1272 ) * formatting * formatting * fix interval * fix train loop * fix test * parametrize test * Apply suggestions from code review Co-Authored-By: Adrian Wälchli <adrian.waelchli@students.unibe.ch> * fix calling * flake8 * add types Co-authored-by: Adrian Wälchli <adrian.waelchli@students.unibe.ch> Co-authored-by: William Falcon <waf2107@columbia.edu>		2020-03-30 18:37:02 -04:00
..
README.md	changes examples to pl_examples for name connflict	2019-10-19 00:41:17 +02:00
__init__.py	changes examples to pl_examples for name connflict	2019-10-19 00:41:17 +02:00
ddp2_job_submit.sh	changes examples to pl_examples for name connflict	2019-10-19 00:41:17 +02:00
ddp_job_submit.sh	changes examples to pl_examples for name connflict	2019-10-19 00:41:17 +02:00
multi_node_ddp2_demo.py	update Docs [links & formatting] (#769 )	2020-02-09 17:39:10 -05:00
multi_node_ddp_demo.py	Checkpointing interval (#1272 )	2020-03-30 18:37:02 -04:00

README.md

Multi-node example

This demo launches a job using 2 GPUs on 2 different nodes (4 GPUs total). To run this demo do the following:

Log into the jumphost node of your SLURM-managed cluster.
Create a conda environment with Lightning and a GPU PyTorch version.
Choose a script to submit

DDP

Submit this job to run with distributedDataParallel (2 nodes, 2 gpus each)

sbatch ddp_job_submit.sh YourEnv

DDP2

Submit this job to run with a different implementation of distributedDataParallel. In this version, each node acts like DataParallel but syncs across nodes like DDP.

sbatch ddp2_job_submit.sh YourEnv