lightning/README.md

172 lines
6.6 KiB
Markdown
Raw Normal View History

2019-03-31 19:32:35 +00:00
<p align="center">
<a href="https://williamfalcon.github.io/pytorch-lightning/">
<img alt="" src="https://github.com/williamFalcon/pytorch-lightning/blob/master/docs/source/_static/lightning_logo.png" width="50">
2019-03-31 19:32:35 +00:00
</a>
</p>
<h3 align="center">
Pytorch Lightning
</h3>
<p align="center">
2019-04-01 16:38:31 +00:00
The Keras for ML researchers using PyTorch. More control. Less boilerplate.
2019-03-31 19:32:35 +00:00
</p>
<p align="center">
2019-03-31 20:59:24 +00:00
<a href="https://badge.fury.io/py/pytorch-lightning"><img src="https://badge.fury.io/py/pytorch-lightning.svg" alt="PyPI version" height="18"></a>
2019-03-31 20:59:39 +00:00
<!-- <a href="https://travis-ci.org/williamFalcon/test-tube"><img src="https://travis-ci.org/williamFalcon/pytorch-lightning.svg?branch=master"></a> -->
<a href="https://github.com/williamFalcon/pytorch-lightning/blob/master/COPYING"><img src="https://img.shields.io/badge/License-MIT-yellow.svg"></a>
2019-03-31 19:32:35 +00:00
</p>
```bash
pip install pytorch-lightning
```
2019-03-31 00:50:32 +00:00
2019-03-31 19:33:05 +00:00
## Docs
2019-06-26 23:18:41 +00:00
**[View the docs here](https://williamfalcon.github.io/pytorch-lightning/)**
2019-03-31 19:39:39 +00:00
## What is it?
2019-06-27 18:38:04 +00:00
Keras and fast.ai are too abstract for researchers. Lightning abstracts the full training loop but gives you control in the critical points.
2019-03-31 20:50:32 +00:00
2019-06-26 23:47:31 +00:00
## Why do I want to use lightning?
2019-06-28 18:12:34 +00:00
Because you don't want to define a training loop, validation loop, gradient clipping, checkpointing, loading,
2019-06-28 18:13:15 +00:00
gpu training, etc... every time you start a project. Let lightning handle all of that for you! Just define your
data and what happens in the training, testing and validation loop and lightning will do the rest.
2019-06-26 23:47:31 +00:00
2019-06-26 23:58:33 +00:00
To use lightning do 2 things:
2019-06-27 18:45:54 +00:00
1. [Define a Trainer](https://github.com/williamFalcon/pytorch-lightning/blob/master/examples/new_project_templates/trainer_cpu_template.py).
2019-06-27 18:39:11 +00:00
2. [Define a LightningModel](https://github.com/williamFalcon/pytorch-lightning/blob/master/examples/new_project_templates/lightning_module_template.py).
2019-06-26 23:44:41 +00:00
2019-06-27 18:43:10 +00:00
## What does lightning control for me?
Everything! Except the following three things:
2019-06-26 23:44:41 +00:00
2019-06-28 18:14:21 +00:00
**What happens in the training loop**
2019-06-27 18:44:51 +00:00
2019-06-26 23:44:41 +00:00
```python
# define what happens for training here
2019-06-27 00:00:53 +00:00
def training_step(self, data_batch, batch_nb):
x, y = data_batch
2019-06-27 18:43:10 +00:00
# define your own forward and loss calculation
2019-06-27 00:00:53 +00:00
out = self.forward(x)
loss = my_loss(out, y)
return {'loss': loss}
2019-06-26 23:44:41 +00:00
```
2019-06-27 18:44:51 +00:00
2019-06-28 18:14:21 +00:00
**What happens in the validation loop**
2019-06-26 23:44:41 +00:00
```python
# define what happens for validation here
2019-06-27 18:43:10 +00:00
def validation_step(self, data_batch, batch_nb):
x, y = data_batch
# define your own forward and loss calculation
2019-06-27 00:00:53 +00:00
out = self.forward(x)
loss = my_loss(out, y)
return {'loss': loss}
2019-06-26 23:44:41 +00:00
```
2019-06-28 18:14:21 +00:00
**And what to do with the output of all validation batches**
2019-06-27 18:44:51 +00:00
```python
def validation_end(self, outputs):
"""
Called at the end of validation to aggregate outputs
:param outputs: list of individual outputs of each validation step
:return:
"""
val_loss_mean = 0
val_acc_mean = 0
for output in outputs:
val_loss_mean += output['val_loss']
val_acc_mean += output['val_acc']
val_loss_mean /= len(outputs)
val_acc_mean /= len(outputs)
tqdm_dic = {'val_loss': val_loss_mean.item(), 'val_acc': val_acc_mean.item()}
return tqdm_dic
```
2019-06-26 23:44:41 +00:00
2019-06-27 18:44:51 +00:00
## Lightning gives you options to control the following:
2019-06-26 23:44:41 +00:00
2019-06-27 18:43:10 +00:00
**Checkpointing**
2019-06-26 23:44:41 +00:00
2019-06-27 18:43:10 +00:00
- Model saving
- Model loading
2019-06-26 23:44:41 +00:00
2019-06-27 18:43:10 +00:00
**Computing cluster (SLURM)**
2019-06-26 23:44:41 +00:00
2019-06-27 18:43:10 +00:00
- Automatic checkpointing
- Automatic saving, loading
- Running grid search on a cluster
- Walltime auto-resubmit
2019-06-26 23:44:41 +00:00
2019-06-27 18:43:10 +00:00
**Debugging**
2019-06-26 23:44:41 +00:00
2019-06-28 18:44:57 +00:00
- [Fast dev run](https://williamfalcon.github.io/pytorch-lightning/Trainer/debugging/#fast-dev-run)
- [Inspect gradient norms](https://williamfalcon.github.io/pytorch-lightning/Trainer/debugging/#inspect-gradient-norms)
- [Log GPU usage](https://williamfalcon.github.io/pytorch-lightning/Trainer/debugging/#Log-gpu-usage)
- [Make model overfit on subset of data](https://williamfalcon.github.io/pytorch-lightning/Trainer/debugging/#make-model-overfit-on-subset-of-data)
- [Print the parameter count by layer](https://williamfalcon.github.io/pytorch-lightning/Trainer/debugging/#print-the-parameter-count-by-layer)
- [Pring which gradients are nan](https://williamfalcon.github.io/pytorch-lightning/Trainer/debugging/#print-which-gradients-are-nan)
2019-06-26 23:44:41 +00:00
2019-06-27 18:43:10 +00:00
**Distributed training**
2019-03-31 19:39:39 +00:00
2019-06-27 18:43:10 +00:00
- [16-bit mixed precision](Distributed%20training/#16-bit-mixed-precision)
- [Multi-GPU](Distributed%20training/#Multi-GPU)
- [Multi-node](Distributed%20training/#Multi-node)
- [Single GPU](Distributed%20training/#single-gpu)
- [Self-balancing architecture](Distributed%20training/#self-balancing-architecture)
2019-06-26 23:58:33 +00:00
2019-06-27 18:43:10 +00:00
**Experiment Logging**
2019-06-26 23:58:33 +00:00
2019-06-27 18:43:10 +00:00
- [Display metrics in progress bar](Logging/#display-metrics-in-progress-bar)
- Log arbitrary metrics
- [Log metric row every k batches](Logging/#log-metric-row-every-k-batches)
- [Process position](Logging/#process-position)
- [Save a snapshot of all hyperparameters](Logging/#save-a-snapshot-of-all-hyperparameters)
- [Snapshot code for a training run](Logging/#snapshot-code-for-a-training-run)
- [Write logs file to csv every k batches](Logging/#write-logs-file-to-csv-every-k-batches)
**Training loop**
- [Accumulate gradients](Training%20Loop/#accumulated-gradients)
- [Anneal Learning rate](Training%20Loop/#anneal-learning-rate)
- [Force training for min or max epochs](Training%20Loop/#force-training-for-min-or-max-epochs)
- [Force disable early stop](Training%20Loop/#force-disable-early-stop)
- [Use multiple optimizers (like GANs)](../Pytorch-lightning/LightningModule/#configure_optimizers)
- [Set how much of the training set to check (1-100%)](Training%20Loop/#set-how-much-of-the-training-set-to-check)
**Validation loop**
- [Check validation every n epochs](Validation%20Loop/#check-validation-every-n-epochs)
- [Set how much of the validation set to check](Validation%20Loop/#set-how-much-of-the-validation-set-to-check)
- [Set how much of the test set to check](Validation%20Loop/#set-how-much-of-the-test-set-to-check)
- [Set validation check frequency within 1 training epoch](Validation%20Loop/#set-validation-check-frequency-within-1-training-epoch)
- [Set the number of validation sanity steps](Validation%20Loop/#set-the-number-of-validation-sanity-steps)
2019-03-31 01:47:51 +00:00
2019-06-27 00:02:51 +00:00
## Demo
2019-06-25 22:40:34 +00:00
```bash
# install lightning
pip install pytorch-lightning
# clone lightning for the demo
git clone https://github.com/williamFalcon/pytorch-lightning.git
2019-06-28 17:51:28 +00:00
cd examples/new_project_templates/
2019-06-25 22:40:34 +00:00
# run demo (on cpu)
2019-06-28 17:51:28 +00:00
python trainer_gpu_cluster_template.py
2019-06-25 22:47:11 +00:00
```
2019-06-25 22:40:34 +00:00
2019-06-25 22:47:11 +00:00
Without changing the model AT ALL, you can run the model on a single gpu, over multiple gpus, or over multiple nodes.
```bash
2019-06-25 22:44:11 +00:00
# run a grid search on two gpus
2019-06-25 22:40:34 +00:00
python fully_featured_trainer.py --gpus "0;1"
2019-06-25 22:44:11 +00:00
# run single model on multiple gpus
python fully_featured_trainer.py --gpus "0;1" --interactive
2019-06-25 22:40:34 +00:00
```
2019-03-31 01:21:10 +00:00