2019-06-26 23:18:41 +00:00
|
|
|
# PYTORCH-LIGHTNING DOCUMENTATION
|
|
|
|
|
|
|
|
###### Quick start
|
2019-06-27 00:15:18 +00:00
|
|
|
- Define a lightning model
|
|
|
|
- Set up the trainer
|
2019-06-27 00:07:28 +00:00
|
|
|
|
2019-06-27 00:15:18 +00:00
|
|
|
###### Quick start examples
|
2019-06-26 23:18:41 +00:00
|
|
|
- CPU example
|
|
|
|
- Single GPU example
|
|
|
|
- Multi-gpu example
|
|
|
|
- SLURM cluster example
|
|
|
|
|
|
|
|
|
|
|
|
###### Distributed training
|
|
|
|
- Single-gpu
|
|
|
|
- Multi-gpu
|
|
|
|
- Multi-node
|
|
|
|
|
2019-06-27 00:07:28 +00:00
|
|
|
###### Checkpointing
|
|
|
|
- Model saving
|
|
|
|
- Model loading
|
2019-06-26 23:18:41 +00:00
|
|
|
|
|
|
|
###### Computing cluster (SLURM)
|
|
|
|
- Automatic checkpointing
|
|
|
|
- Automatic saving, loading
|
2019-06-27 00:15:18 +00:00
|
|
|
- Walltime auto-resubmit
|
|
|
|
|
|
|
|
###### Common training use cases
|
|
|
|
- 16-bit mixed precision
|
|
|
|
- Accumulate gradients
|
|
|
|
- Check val many times during 1 training epoch
|
|
|
|
- Check GPU usage
|
|
|
|
- Check validation every n epochs
|
|
|
|
- Check which gradients are nan
|
|
|
|
- Inspect gradient norms
|
|
|
|
- Learning rate annealing
|
|
|
|
- Make model overfit on subset of data
|
|
|
|
- Min, max epochs
|
|
|
|
- Multiple optimizers (like GANs)
|
|
|
|
- Run a sanity check of model val and tng step
|
|
|
|
- Set how much of the tng, val, test sets to check (1-100%)
|