4ed3027309
* Set precision=16 when use_amp is passed as True * Update CHANGELOG.md * add use_amp to deprecated API * Update trainer.py * Update trainer.py * move the use_amp attribute to deprecated API * move use_amp deprecation back to Trainer's __init__ * drop unsed * drop deprecated * reorder imports * typing Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: William Falcon <waf2107@columbia.edu> Co-authored-by: J. Borovec <jirka.borovec@seznam.cz> |
||
---|---|---|
.. | ||
README.md | ||
__init__.py | ||
cpu_template.py | ||
gpu_template.py | ||
multi_node_ddp2_demo.py | ||
multi_node_ddp_demo.py | ||
submit_ddp2_job.sh | ||
submit_ddp_job.sh |
README.md
Basic Examples
Use these examples to test how lightning works.
Test on CPU
python cpu_template.py
Train on a single GPU
python gpu_template.py --gpus 1
DataParallel (dp)
Train on multiple GPUs using DataParallel.
python gpu_template.py --gpus 2 --distributed_backend dp
DistributedDataParallel (ddp)
Train on multiple GPUs using DistributedDataParallel
python gpu_template.py --gpus 2 --distributed_backend ddp
DistributedDataParallel+DP (ddp2)
Train on multiple GPUs using DistributedDataParallel + DataParallel. On a single node, uses all GPUs for 1 model. Then shares gradient information across nodes.
python gpu_template.py --gpus 2 --distributed_backend ddp2
Multi-node example
This demo launches a job using 2 GPUs on 2 different nodes (4 GPUs total). To run this demo do the following:
- Log into the jumphost node of your SLURM-managed cluster.
- Create a conda environment with Lightning and a GPU PyTorch version.
- Choose a script to submit
DDP
Submit this job to run with DistributedDataParallel (2 nodes, 2 gpus each)
sbatch ddp_job_submit.sh YourEnv
DDP2
Submit this job to run with a different implementation of DistributedDataParallel. In this version, each node acts like DataParallel but syncs across nodes like DDP.
sbatch ddp2_job_submit.sh YourEnv