2019-03-31 19:32:35 +00:00
< p align = "center" >
< a href = "https://williamfalcon.github.io/pytorch-lightning/" >
2019-04-03 16:40:03 +00:00
< img alt = "" src = "https://github.com/williamFalcon/pytorch-lightning/blob/master/docs/source/_static/lightning_logo.png" width = "50" >
2019-03-31 19:32:35 +00:00
< / a >
< / p >
< h3 align = "center" >
Pytorch Lightning
< / h3 >
< p align = "center" >
2019-04-01 16:38:31 +00:00
The Keras for ML researchers using PyTorch. More control. Less boilerplate.
2019-03-31 19:32:35 +00:00
< / p >
< p align = "center" >
2019-03-31 20:59:24 +00:00
< a href = "https://badge.fury.io/py/pytorch-lightning" > < img src = "https://badge.fury.io/py/pytorch-lightning.svg" alt = "PyPI version" height = "18" > < / a >
2019-03-31 20:59:39 +00:00
<!-- <a href="https://travis - ci.org/williamFalcon/test - tube"><img src="https://travis - ci.org/williamFalcon/pytorch - lightning.svg?branch=master"></a> -->
2019-04-03 16:40:03 +00:00
< a href = "https://github.com/williamFalcon/pytorch-lightning/blob/master/COPYING" > < img src = "https://img.shields.io/badge/License-MIT-yellow.svg" > < / a >
2019-03-31 19:32:35 +00:00
< / p >
```bash
pip install pytorch-lightning
```
2019-03-31 00:50:32 +00:00
2019-03-31 19:33:05 +00:00
## Docs
2019-06-26 23:18:41 +00:00
**[View the docs here](https://williamfalcon.github.io/pytorch-lightning/)**
2019-03-31 19:39:39 +00:00
## What is it?
2019-06-27 18:38:04 +00:00
Keras and fast.ai are too abstract for researchers. Lightning abstracts the full training loop but gives you control in the critical points.
2019-03-31 20:50:32 +00:00
2019-06-26 23:47:31 +00:00
## Why do I want to use lightning?
2019-06-27 18:38:04 +00:00
Because you want to use best practices and get gpu training, multi-node training, checkpointing, mixed-precision, etc... for free, but still want granular control of the meat of the training, validation and testing loops.
2019-06-26 23:47:31 +00:00
2019-06-26 23:58:33 +00:00
To use lightning do 2 things:
2019-06-27 18:45:54 +00:00
1. [Define a Trainer ](https://github.com/williamFalcon/pytorch-lightning/blob/master/examples/new_project_templates/trainer_cpu_template.py ).
2019-06-27 18:39:11 +00:00
2. [Define a LightningModel ](https://github.com/williamFalcon/pytorch-lightning/blob/master/examples/new_project_templates/lightning_module_template.py ).
2019-06-26 23:44:41 +00:00
2019-06-27 18:43:10 +00:00
## What does lightning control for me?
Everything! Except the following three things:
2019-06-26 23:44:41 +00:00
2019-06-27 18:45:19 +00:00
**Automatic training loop**
2019-06-27 18:44:51 +00:00
2019-06-26 23:44:41 +00:00
```python
# define what happens for training here
2019-06-27 00:00:53 +00:00
def training_step(self, data_batch, batch_nb):
x, y = data_batch
2019-06-27 18:43:10 +00:00
# define your own forward and loss calculation
2019-06-27 00:00:53 +00:00
out = self.forward(x)
loss = my_loss(out, y)
return {'loss': loss}
2019-06-26 23:44:41 +00:00
```
2019-06-27 18:44:51 +00:00
2019-06-27 18:45:19 +00:00
**Automatic validation loop**
2019-06-26 23:44:41 +00:00
```python
# define what happens for validation here
2019-06-27 18:43:10 +00:00
def validation_step(self, data_batch, batch_nb):
x, y = data_batch
# define your own forward and loss calculation
2019-06-27 00:00:53 +00:00
out = self.forward(x)
loss = my_loss(out, y)
return {'loss': loss}
2019-06-26 23:44:41 +00:00
```
2019-06-27 18:45:19 +00:00
**Collate the output of the validation_step**
2019-06-27 18:44:51 +00:00
```python
def validation_end(self, outputs):
"""
Called at the end of validation to aggregate outputs
:param outputs: list of individual outputs of each validation step
:return:
"""
val_loss_mean = 0
val_acc_mean = 0
for output in outputs:
val_loss_mean += output['val_loss']
val_acc_mean += output['val_acc']
val_loss_mean /= len(outputs)
val_acc_mean /= len(outputs)
tqdm_dic = {'val_loss': val_loss_mean.item(), 'val_acc': val_acc_mean.item()}
return tqdm_dic
```
2019-06-26 23:44:41 +00:00
2019-06-27 18:44:51 +00:00
## Lightning gives you options to control the following:
2019-06-26 23:44:41 +00:00
2019-06-27 18:43:10 +00:00
**Checkpointing**
2019-06-26 23:44:41 +00:00
2019-06-27 18:43:10 +00:00
- Model saving
- Model loading
2019-06-26 23:44:41 +00:00
2019-06-27 18:43:10 +00:00
**Computing cluster (SLURM)**
2019-06-26 23:44:41 +00:00
2019-06-27 18:43:10 +00:00
- Automatic checkpointing
- Automatic saving, loading
- Running grid search on a cluster
- Walltime auto-resubmit
2019-06-26 23:44:41 +00:00
2019-06-27 18:43:10 +00:00
**Debugging**
2019-06-26 23:44:41 +00:00
2019-06-27 18:43:10 +00:00
- [Fast dev run ](Debugging/#fast-dev-run )
- [Inspect gradient norms ](Debugging/#inspect-gradient-norms )
- [Log GPU usage ](Debugging/#Log-gpu-usage )
- [Make model overfit on subset of data ](Debugging/#make-model-overfit-on-subset-of-data )
- [Print the parameter count by layer ](Debugging/#print-the-parameter-count-by-layer )
- [Pring which gradients are nan ](Debugging/#print-which-gradients-are-nan )
2019-06-26 23:44:41 +00:00
2019-06-27 18:43:10 +00:00
**Distributed training**
2019-03-31 19:39:39 +00:00
2019-06-27 18:43:10 +00:00
- [16-bit mixed precision ](Distributed%20training/#16-bit-mixed-precision )
- [Multi-GPU ](Distributed%20training/#Multi-GPU )
- [Multi-node ](Distributed%20training/#Multi-node )
- [Single GPU ](Distributed%20training/#single-gpu )
- [Self-balancing architecture ](Distributed%20training/#self-balancing-architecture )
2019-06-26 23:58:33 +00:00
2019-06-27 18:43:10 +00:00
**Experiment Logging**
2019-06-26 23:58:33 +00:00
2019-06-27 18:43:10 +00:00
- [Display metrics in progress bar ](Logging/#display-metrics-in-progress-bar )
- Log arbitrary metrics
- [Log metric row every k batches ](Logging/#log-metric-row-every-k-batches )
- [Process position ](Logging/#process-position )
- [Save a snapshot of all hyperparameters ](Logging/#save-a-snapshot-of-all-hyperparameters )
- [Snapshot code for a training run ](Logging/#snapshot-code-for-a-training-run )
- [Write logs file to csv every k batches ](Logging/#write-logs-file-to-csv-every-k-batches )
**Training loop**
- [Accumulate gradients ](Training%20Loop/#accumulated-gradients )
- [Anneal Learning rate ](Training%20Loop/#anneal-learning-rate )
- [Force training for min or max epochs ](Training%20Loop/#force-training-for-min-or-max-epochs )
- [Force disable early stop ](Training%20Loop/#force-disable-early-stop )
- [Use multiple optimizers (like GANs) ](../Pytorch-lightning/LightningModule/#configure_optimizers )
- [Set how much of the training set to check (1-100%) ](Training%20Loop/#set-how-much-of-the-training-set-to-check )
**Validation loop**
- [Check validation every n epochs ](Validation%20Loop/#check-validation-every-n-epochs )
- [Set how much of the validation set to check ](Validation%20Loop/#set-how-much-of-the-validation-set-to-check )
- [Set how much of the test set to check ](Validation%20Loop/#set-how-much-of-the-test-set-to-check )
- [Set validation check frequency within 1 training epoch ](Validation%20Loop/#set-validation-check-frequency-within-1-training-epoch )
- [Set the number of validation sanity steps ](Validation%20Loop/#set-the-number-of-validation-sanity-steps )
2019-04-03 16:40:03 +00:00
2019-03-31 01:47:51 +00:00
2019-06-27 00:02:51 +00:00
## Demo
2019-06-25 22:40:34 +00:00
```bash
# install lightning
pip install pytorch-lightning
# clone lightning for the demo
git clone https://github.com/williamFalcon/pytorch-lightning.git
cd pytorch-lightning/docs/source/examples
# run demo (on cpu)
python fully_featured_trainer.py
2019-06-25 22:47:11 +00:00
```
2019-06-25 22:40:34 +00:00
2019-06-25 22:47:11 +00:00
Without changing the model AT ALL, you can run the model on a single gpu, over multiple gpus, or over multiple nodes.
```bash
2019-06-25 22:44:11 +00:00
# run a grid search on two gpus
2019-06-25 22:40:34 +00:00
python fully_featured_trainer.py --gpus "0;1"
2019-06-25 22:44:11 +00:00
# run single model on multiple gpus
python fully_featured_trainer.py --gpus "0;1" --interactive
2019-06-25 22:40:34 +00:00
```
2019-03-31 01:21:10 +00:00