6.9 KiB
6.9 KiB
New project Quick Start
To start a new project you define two files, a LightningModule and a Trainer file.
A separate trainer file allows to run many LightningModules. Each LightningModule has the core logic to a particular research project.
For example, one lightningModule could be an image classifier, the other one could be a seq-2-seq model, both (optionally) ran by the same trainer file.
Docs shortcuts
Quick start examples
- CPU example
- Hyperparameter search on single GPU
- Hyperparameter search on multiple GPUs on same node
- [Hyperparameter search on a SLURM HPC cluster](examples/Examples/#Hyperparameter search on a SLURM HPC cluster)
Checkpointing
Computing cluster (SLURM)
Debugging
- Fast dev run
- Inspect gradient norms
- Log GPU usage
- Make model overfit on subset of data
- Print the parameter count by layer
- Pring which gradients are nan
- Print input and output size of every module in system
Distributed training
Experiment Logging
- Display metrics in progress bar
- Log metric row every k batches
- Process position
- Tensorboard support
- Save a snapshot of all hyperparameters
- Snapshot code for a training run
- Write logs file to csv every k batches
Training loop
- Accumulate gradients
- Force training for min or max epochs
- Force disable early stop
- Gradient Clipping
- Hooks
- Learning rate scheduling
- Use multiple optimizers (like GANs)
- Set how much of the training set to check (1-100%)
- Step optimizers at arbitrary intervals