2019-03-31 19:32:35 +00:00
< p align = "center" >
< a href = "https://williamfalcon.github.io/pytorch-lightning/" >
2019-04-03 16:40:03 +00:00
< img alt = "" src = "https://github.com/williamFalcon/pytorch-lightning/blob/master/docs/source/_static/lightning_logo.png" width = "50" >
2019-03-31 19:32:35 +00:00
< / a >
< / p >
< h3 align = "center" >
Pytorch Lightning
< / h3 >
< p align = "center" >
2019-04-01 16:38:31 +00:00
The Keras for ML researchers using PyTorch. More control. Less boilerplate.
2019-03-31 19:32:35 +00:00
< / p >
< p align = "center" >
2019-03-31 20:59:24 +00:00
< a href = "https://badge.fury.io/py/pytorch-lightning" > < img src = "https://badge.fury.io/py/pytorch-lightning.svg" alt = "PyPI version" height = "18" > < / a >
2019-03-31 20:59:39 +00:00
<!-- <a href="https://travis - ci.org/williamFalcon/test - tube"><img src="https://travis - ci.org/williamFalcon/pytorch - lightning.svg?branch=master"></a> -->
2019-04-03 16:40:03 +00:00
< a href = "https://github.com/williamFalcon/pytorch-lightning/blob/master/COPYING" > < img src = "https://img.shields.io/badge/License-MIT-yellow.svg" > < / a >
2019-03-31 19:32:35 +00:00
< / p >
```bash
pip install pytorch-lightning
```
2019-03-31 00:50:32 +00:00
2019-03-31 19:33:05 +00:00
## Docs
2019-06-26 23:18:41 +00:00
**[View the docs here](https://williamfalcon.github.io/pytorch-lightning/)**
2019-03-31 19:39:39 +00:00
## What is it?
2019-06-27 18:38:04 +00:00
Keras and fast.ai are too abstract for researchers. Lightning abstracts the full training loop but gives you control in the critical points.
2019-03-31 20:50:32 +00:00
2019-06-26 23:47:31 +00:00
## Why do I want to use lightning?
2019-06-28 18:12:34 +00:00
Because you don't want to define a training loop, validation loop, gradient clipping, checkpointing, loading,
2019-06-28 18:13:15 +00:00
gpu training, etc... every time you start a project. Let lightning handle all of that for you! Just define your
data and what happens in the training, testing and validation loop and lightning will do the rest.
2019-06-26 23:47:31 +00:00
2019-06-26 23:58:33 +00:00
To use lightning do 2 things:
2019-06-27 18:45:54 +00:00
1. [Define a Trainer ](https://github.com/williamFalcon/pytorch-lightning/blob/master/examples/new_project_templates/trainer_cpu_template.py ).
2019-06-27 18:39:11 +00:00
2. [Define a LightningModel ](https://github.com/williamFalcon/pytorch-lightning/blob/master/examples/new_project_templates/lightning_module_template.py ).
2019-06-26 23:44:41 +00:00
2019-06-27 18:43:10 +00:00
## What does lightning control for me?
Everything! Except the following three things:
2019-06-26 23:44:41 +00:00
2019-06-28 18:14:21 +00:00
**What happens in the training loop**
2019-06-27 18:44:51 +00:00
2019-06-26 23:44:41 +00:00
```python
# define what happens for training here
2019-06-27 00:00:53 +00:00
def training_step(self, data_batch, batch_nb):
x, y = data_batch
2019-06-27 18:43:10 +00:00
# define your own forward and loss calculation
2019-06-27 00:00:53 +00:00
out = self.forward(x)
loss = my_loss(out, y)
return {'loss': loss}
2019-06-26 23:44:41 +00:00
```
2019-06-27 18:44:51 +00:00
2019-06-28 18:14:21 +00:00
**What happens in the validation loop**
2019-06-26 23:44:41 +00:00
```python
# define what happens for validation here
2019-06-27 18:43:10 +00:00
def validation_step(self, data_batch, batch_nb):
x, y = data_batch
# define your own forward and loss calculation
2019-06-27 00:00:53 +00:00
out = self.forward(x)
loss = my_loss(out, y)
return {'loss': loss}
2019-06-26 23:44:41 +00:00
```
2019-06-28 18:14:21 +00:00
**And what to do with the output of all validation batches**
2019-06-27 18:44:51 +00:00
```python
def validation_end(self, outputs):
"""
Called at the end of validation to aggregate outputs
:param outputs: list of individual outputs of each validation step
:return:
"""
val_loss_mean = 0
val_acc_mean = 0
for output in outputs:
val_loss_mean += output['val_loss']
val_acc_mean += output['val_acc']
val_loss_mean /= len(outputs)
val_acc_mean /= len(outputs)
tqdm_dic = {'val_loss': val_loss_mean.item(), 'val_acc': val_acc_mean.item()}
return tqdm_dic
```
2019-06-26 23:44:41 +00:00
2019-06-27 18:44:51 +00:00
## Lightning gives you options to control the following:
2019-06-26 23:44:41 +00:00
2019-06-27 18:43:10 +00:00
**Checkpointing**
2019-06-26 23:44:41 +00:00
2019-06-27 18:43:10 +00:00
- Model saving
- Model loading
2019-06-26 23:44:41 +00:00
2019-06-27 18:43:10 +00:00
**Computing cluster (SLURM)**
2019-06-26 23:44:41 +00:00
2019-06-27 18:43:10 +00:00
- Automatic checkpointing
- Automatic saving, loading
- Running grid search on a cluster
- Walltime auto-resubmit
2019-06-26 23:44:41 +00:00
2019-06-27 18:43:10 +00:00
**Debugging**
2019-06-26 23:44:41 +00:00
2019-06-28 18:44:57 +00:00
- [Fast dev run ](https://williamfalcon.github.io/pytorch-lightning/Trainer/debugging/#fast-dev-run )
- [Inspect gradient norms ](https://williamfalcon.github.io/pytorch-lightning/Trainer/debugging/#inspect-gradient-norms )
- [Log GPU usage ](https://williamfalcon.github.io/pytorch-lightning/Trainer/debugging/#Log-gpu-usage )
- [Make model overfit on subset of data ](https://williamfalcon.github.io/pytorch-lightning/Trainer/debugging/#make-model-overfit-on-subset-of-data )
- [Print the parameter count by layer ](https://williamfalcon.github.io/pytorch-lightning/Trainer/debugging/#print-the-parameter-count-by-layer )
- [Pring which gradients are nan ](https://williamfalcon.github.io/pytorch-lightning/Trainer/debugging/#print-which-gradients-are-nan )
2019-06-26 23:44:41 +00:00
2019-06-27 18:43:10 +00:00
**Distributed training**
2019-03-31 19:39:39 +00:00
2019-06-28 18:45:49 +00:00
- [16-bit mixed precision ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/#16-bit-mixed-precision )
- [Multi-GPU ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/#Multi-GPU )
- [Multi-node ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/#Multi-node )
- [Single GPU ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/#single-gpu )
- [Self-balancing architecture ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/#self-balancing-architecture )
2019-06-26 23:58:33 +00:00
2019-06-27 18:43:10 +00:00
**Experiment Logging**
2019-06-26 23:58:33 +00:00
2019-06-28 18:46:28 +00:00
- [Display metrics in progress bar ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Logging/#display-metrics-in-progress-bar )
2019-06-27 18:43:10 +00:00
- Log arbitrary metrics
2019-06-28 18:46:28 +00:00
- [Log metric row every k batches ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Logging/#log-metric-row-every-k-batches )
2019-06-27 18:43:10 +00:00
- [Process position ](Logging/#process-position )
2019-06-28 18:46:28 +00:00
- [Save a snapshot of all hyperparameters ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Logging/#save-a-snapshot-of-all-hyperparameters )
- [Snapshot code for a training run ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Logging/#snapshot-code-for-a-training-run )
- [Write logs file to csv every k batches ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Logging/#write-logs-file-to-csv-every-k-batches )
2019-06-27 18:43:10 +00:00
**Training loop**
2019-06-28 18:48:19 +00:00
- [Accumulate gradients ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Training%20Loop/#accumulated-gradients )
- [Anneal Learning rate ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Training%20Loop/#anneal-learning-rate )
- [Force training for min or max epochs ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Training%20Loop/#force-training-for-min-or-max-epochs )
- [Force disable early stop ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Training%20Loop/#force-disable-early-stop )
- [Use multiple optimizers (like GANs) ](https://williamfalcon.github.io/pytorch-lightning/Pytorch-Lightning/LightningModule/#configure_optimizers )
- [Set how much of the training set to check (1-100%) ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Training%20Loop/#set-how-much-of-the-training-set-to-check )
2019-06-27 18:43:10 +00:00
**Validation loop**
- [Check validation every n epochs ](Validation%20Loop/#check-validation-every-n-epochs )
- [Set how much of the validation set to check ](Validation%20Loop/#set-how-much-of-the-validation-set-to-check )
- [Set how much of the test set to check ](Validation%20Loop/#set-how-much-of-the-test-set-to-check )
- [Set validation check frequency within 1 training epoch ](Validation%20Loop/#set-validation-check-frequency-within-1-training-epoch )
- [Set the number of validation sanity steps ](Validation%20Loop/#set-the-number-of-validation-sanity-steps )
2019-04-03 16:40:03 +00:00
2019-03-31 01:47:51 +00:00
2019-06-27 00:02:51 +00:00
## Demo
2019-06-25 22:40:34 +00:00
```bash
# install lightning
pip install pytorch-lightning
# clone lightning for the demo
git clone https://github.com/williamFalcon/pytorch-lightning.git
2019-06-28 17:51:28 +00:00
cd examples/new_project_templates/
2019-06-25 22:40:34 +00:00
# run demo (on cpu)
2019-06-28 17:51:28 +00:00
python trainer_gpu_cluster_template.py
2019-06-25 22:47:11 +00:00
```
2019-06-25 22:40:34 +00:00
2019-06-25 22:47:11 +00:00
Without changing the model AT ALL, you can run the model on a single gpu, over multiple gpus, or over multiple nodes.
```bash
2019-06-25 22:44:11 +00:00
# run a grid search on two gpus
2019-06-25 22:40:34 +00:00
python fully_featured_trainer.py --gpus "0;1"
2019-06-25 22:44:11 +00:00
# run single model on multiple gpus
python fully_featured_trainer.py --gpus "0;1" --interactive
2019-06-25 22:40:34 +00:00
```
2019-03-31 01:21:10 +00:00