5.7 KiB
5.7 KiB
New project Quick Start
To start a new project define these two files.
- Define a LightningModule
- Pick a trainer
Docs shortcuts
Quick start examples
- CPU example
- Hyperparameter search on single GPU
- Hyperparameter search on multiple GPUs on same node
- [Hyperparameter search on a SLURM HPC cluster](examples/Examples/#Hyperparameter search on a SLURM HPC cluster)
Checkpointing
Computing cluster (SLURM)
Debugging
- Fast dev run
- Inspect gradient norms
- Log GPU usage
- Make model overfit on subset of data
- Print the parameter count by layer
- Pring which gradients are nan
Distributed training
Experiment Logging
- Display metrics in progress bar
- Log arbitrary metrics
- Log metric row every k batches
- Process position
- Save a snapshot of all hyperparameters
- Snapshot code for a training run
- Write logs file to csv every k batches
Training loop
- Accumulate gradients
- Anneal Learning rate
- Force training for min or max epochs
- Force disable early stop
- Gradient Clipping
- Use multiple optimizers (like GANs)
- Set how much of the training set to check (1-100%)
######Validation loop