2.8 KiB
2.8 KiB
Trainer
The lightning trainer abstracts best practices for running a training, val, test routine. It calls parts of your model when it wants to hand over full control and otherwise makes training assumptions which are now standard practice in AI research.
This is the basic use of the trainer:
from pytorch_lightning import Trainer
model = LightningTemplate()
trainer = Trainer()
trainer.fit(model)
But of course the fun is in all the advanced things it can do:
Checkpointing
- Model saving
- Model loading
Computing cluster (SLURM)
- Automatic checkpointing
- Automatic saving, loading
- Running grid search on a cluster
- Walltime auto-resubmit
Debugging
- Fast dev run
- Inspect gradient norms
- Log GPU usage
- Make model overfit on subset of data
- Print the parameter count by layer
- Pring which gradients are nan
Distributed training
- 16-bit mixed precision
- Single-gpu
- Multi-gpu
- Multi-node
Experiment Logging
- Display metrics in progress bar
- Log arbitrary metrics
- Process position
- Write logs file to csv every k batches
- Log metric row every k batches
- Save a snapshot of all hyperparameters
- Save a snapshot of the code for a particular model run
Training loop
- Accumulate gradients
- Anneal Learning rate
- Force training for min or max epochs
- Force disable early stop
- Use multiple optimizers (like GANs)
- Set how much of the training set to check (1-100%)
Validation loop