9a5d40aff4
* test PL examples * minor formatting * skip failing * skip failing * args * mnist datamodule * refactor tests * refactor tests * skip * skip * drop DM * drop DM Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> |
||
---|---|---|
.. | ||
README.md | ||
__init__.py | ||
autoencoder.py | ||
image_classifier.py | ||
mnist_classifier.py | ||
mnist_classifier_dali.py | ||
submit_ddp2_job.sh | ||
submit_ddp_job.sh |
README.md
Basic Examples
Use these examples to test how lightning works.
MNIST
Trains MNIST where the model is defined inside the LightningModule.
# cpu
python mnist.py
# gpus (any number)
python mnist.py
# dataparallel
python mnist.py --gpus 2 --distributed_backend 'dp'
MNIST with DALI
The MNIST example above using NVIDIA DALI. Requires NVIDIA DALI to be installed based on your CUDA version, see here.
python mnist_dali.py
Image classifier
Generic image classifier with an arbitrary backbone (ie: a simple system)
# cpu
python image_classifier.py
# gpus (any number)
python image_classifier.py --gpus 2
# dataparallel
python image_classifier.py --gpus 2 --distributed_backend 'dp'
Autoencoder
Showing the power of a system... arbitrarily complex training loops
# cpu
python autoencoder.py
# gpus (any number)
python autoencoder.py --gpus 2
# dataparallel
python autoencoder.py --gpus 2 --distributed_backend 'dp'
Multi-node example
This demo launches a job using 2 GPUs on 2 different nodes (4 GPUs total). To run this demo do the following:
- Log into the jumphost node of your SLURM-managed cluster.
- Create a conda environment with Lightning and a GPU PyTorch version.
- Choose a script to submit
DDP
Submit this job to run with DistributedDataParallel (2 nodes, 2 gpus each)
sbatch submit_ddp_job.sh YourEnv
DDP2
Submit this job to run with a different implementation of DistributedDataParallel. In this version, each node acts like DataParallel but syncs across nodes like DDP.
sbatch submit_ddp2_job.sh YourEnv