2019-06-28 22:44:44 +00:00
###### New project Quick Start
2019-08-16 01:29:25 +00:00
To start a new project define two files, a LightningModule and a Trainer file.
To illustrate Lightning power and simplicity, here's an example of a typical research flow.
2019-07-28 12:28:15 +00:00
2019-08-16 01:19:29 +00:00
###### Case 1: BERT
Let's say you're working on something like BERT but want to try different ways of training or even different networks.
You would define a single LightningModule and use flags to switch between your different ideas.
```python
class BERT(pl.LightningModule):
def __init__ (self, model_name, task):
self.task = task
if model_name == 'transformer':
self.net = Transformer()
elif model_name == 'my_cool_version':
self.net = MyCoolVersion()
def training_step(self, batch, batch_nb):
if self.task == 'standard_bert':
# do standard bert training with self.net...
# return loss
if self.task == 'my_cool_task':
# do my own version with self.net
# return loss
```
2019-06-27 18:29:44 +00:00
2019-08-16 01:19:29 +00:00
###### Case 2: COOLER NOT BERT
But if you wanted to try something **completely** different, you'd define a new module for that.
```python
class CoolerNotBERT(pl.LightningModule):
def __init__ (self):
self.net = ...
def training_step(self, batch, batch_nb):
# do some other cool task
# return loss
```
###### Rapid research flow
Then you could do rapid research by switching between these two and using the same trainer.
```python
if use_bert:
model = BERT()
else:
model = CoolerNotBERT()
trainer = Trainer(gpus=[0, 1, 2, 3], use_amp=True)
trainer.fit(model)
```
2019-08-16 01:29:25 +00:00
Notice a few things about this flow:
1. You're writing pure PyTorch... no unnecessary abstractions or new libraries to learn.
2. You get free GPU and 16-bit support without writing any of that code in your model.
3. You also get all of the capabilities below (without coding or testing yourself).
2019-08-16 01:19:29 +00:00
---
###### Templates
2019-07-28 12:27:09 +00:00
1. [MNIST LightningModule ](https://williamfalcon.github.io/pytorch-lightning/LightningModule/RequiredTrainerInterface/#minimal-example )
2019-07-28 12:26:58 +00:00
2. [Trainer ](https://williamfalcon.github.io/pytorch-lightning/Trainer/ )
2019-08-05 18:37:29 +00:00
- [Basic CPU Trainer Template ](https://github.com/williamFalcon/pytorch-lightning/blob/master/examples/new_project_templates/single_cpu_template.py )
- [Multi-GPU Trainer Template ](https://github.com/williamFalcon/pytorch-lightning/blob/master/examples/new_project_templates/single_gpu_node_template.py )
- [GPU cluster Trainer Template ](https://github.com/williamFalcon/pytorch-lightning/blob/master/examples/new_project_templates/multi_node_cluster_template.py )
2019-06-27 00:07:28 +00:00
2019-06-28 22:45:58 +00:00
###### Docs shortcuts
2019-06-28 22:44:44 +00:00
- [LightningModule ](LightningModule/RequiredTrainerInterface/ )
- [Trainer ](Trainer/ )
2019-06-27 00:15:18 +00:00
###### Quick start examples
2019-06-28 23:00:01 +00:00
- [CPU example ](examples/Examples/#cpu-hyperparameter-search )
- [Hyperparameter search on single GPU ](examples/Examples/#hyperparameter-search-on-a-single-or-multiple-gpus )
- [Hyperparameter search on multiple GPUs on same node ](examples/Examples/#hyperparameter-search-on-a-single-or-multiple-gpus )
- [Hyperparameter search on a SLURM HPC cluster ](examples/Examples/#Hyperparameter search on a SLURM HPC cluster )
2019-06-26 23:18:41 +00:00
2019-06-28 18:53:43 +00:00
###### Checkpointing
2019-10-01 10:29:12 +00:00
- [Checkpoint callback ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Checkpointing/#model-saving )
2019-06-28 21:45:56 +00:00
- [Model saving ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Checkpointing/#model-saving )
- [Model loading ](https://williamfalcon.github.io/pytorch-lightning/LightningModule/methods/#load-from-metrics )
2019-08-07 11:09:37 +00:00
- [Restoring training session ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Checkpointing/#restoring-training-session )
2019-06-26 23:18:41 +00:00
2019-06-28 22:49:18 +00:00
###### Computing cluster (SLURM)
2019-06-28 18:53:43 +00:00
2019-06-28 21:49:56 +00:00
- [Running grid search on a cluster ](https://williamfalcon.github.io/pytorch-lightning/Trainer/SLURM%20Managed%20Cluster#running-grid-search-on-a-cluster )
- [Walltime auto-resubmit ](https://williamfalcon.github.io/pytorch-lightning/Trainer/SLURM%20Managed%20Cluster#walltime-auto-resubmit )
2019-06-28 18:53:43 +00:00
2019-06-28 22:49:18 +00:00
###### Debugging
2019-06-28 18:53:43 +00:00
- [Fast dev run ](https://williamfalcon.github.io/pytorch-lightning/Trainer/debugging/#fast-dev-run )
- [Inspect gradient norms ](https://williamfalcon.github.io/pytorch-lightning/Trainer/debugging/#inspect-gradient-norms )
- [Log GPU usage ](https://williamfalcon.github.io/pytorch-lightning/Trainer/debugging/#Log-gpu-usage )
- [Make model overfit on subset of data ](https://williamfalcon.github.io/pytorch-lightning/Trainer/debugging/#make-model-overfit-on-subset-of-data )
- [Print the parameter count by layer ](https://williamfalcon.github.io/pytorch-lightning/Trainer/debugging/#print-the-parameter-count-by-layer )
- [Pring which gradients are nan ](https://williamfalcon.github.io/pytorch-lightning/Trainer/debugging/#print-which-gradients-are-nan )
2019-08-07 17:23:47 +00:00
- [Print input and output size of every module in system ](https://williamfalcon.github.io/pytorch-lightning/LightningModule/properties/#example_input_array )
2019-06-28 18:53:43 +00:00
2019-06-28 22:49:18 +00:00
###### Distributed training
2019-06-28 18:53:43 +00:00
- [16-bit mixed precision ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/#16-bit-mixed-precision )
- [Multi-GPU ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/#Multi-GPU )
- [Multi-node ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/#Multi-node )
- [Single GPU ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/#single-gpu )
- [Self-balancing architecture ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/#self-balancing-architecture )
2019-06-28 22:49:18 +00:00
###### Experiment Logging
2019-06-28 18:53:43 +00:00
- [Display metrics in progress bar ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Logging/#display-metrics-in-progress-bar )
- [Log metric row every k batches ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Logging/#log-metric-row-every-k-batches )
2019-06-28 22:48:09 +00:00
- [Process position ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Logging/#process-position )
2019-07-28 11:59:16 +00:00
- [Tensorboard support ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Logging/#tensorboard-support )
2019-06-28 18:53:43 +00:00
- [Save a snapshot of all hyperparameters ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Logging/#save-a-snapshot-of-all-hyperparameters )
- [Snapshot code for a training run ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Logging/#snapshot-code-for-a-training-run )
- [Write logs file to csv every k batches ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Logging/#write-logs-file-to-csv-every-k-batches )
2019-06-28 22:49:18 +00:00
###### Training loop
2019-06-28 18:53:43 +00:00
- [Accumulate gradients ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Training%20Loop/#accumulated-gradients )
- [Force training for min or max epochs ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Training%20Loop/#force-training-for-min-or-max-epochs )
2019-10-01 10:29:12 +00:00
- [Early stopping callback ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Training%20Loop/#early-stopping )
2019-06-28 18:53:43 +00:00
- [Force disable early stop ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Training%20Loop/#force-disable-early-stop )
2019-06-28 22:01:53 +00:00
- [Gradient Clipping ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Training%20Loop/#gradient-clipping )
2019-07-28 11:59:16 +00:00
- [Hooks ](https://williamfalcon.github.io/pytorch-lightning/Trainer/hooks/ )
2019-07-28 14:00:53 +00:00
- [Learning rate scheduling ](https://williamfalcon.github.io/pytorch-lightning/LightningModule/RequiredTrainerInterface/#configure_optimizers )
2019-08-01 14:21:20 +00:00
- [Use multiple optimizers (like GANs) ](https://williamfalcon.github.io/pytorch-lightning/LightningModule/RequiredTrainerInterface/#configure_optimizers )
2019-06-28 18:53:43 +00:00
- [Set how much of the training set to check (1-100%) ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Training%20Loop/#set-how-much-of-the-training-set-to-check )
2019-08-13 15:57:02 +00:00
- [Step optimizers at arbitrary intervals ](https://williamfalcon.github.io/pytorch-lightning/Trainer/hooks/#optimizer_step )
2019-06-28 18:53:43 +00:00
2019-07-28 12:04:28 +00:00
###### Validation loop
2019-06-28 18:53:43 +00:00
- [Check validation every n epochs ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Validation%20loop/#check-validation-every-n-epochs )
2019-07-28 11:59:16 +00:00
- [Hooks ](https://williamfalcon.github.io/pytorch-lightning/Trainer/hooks/ )
2019-06-28 18:53:43 +00:00
- [Set how much of the validation set to check ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Validation%20loop/#set-how-much-of-the-validation-set-to-check )
- [Set how much of the test set to check ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Validation%20loop/#set-how-much-of-the-test-set-to-check )
- [Set validation check frequency within 1 training epoch ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Validation%20loop/#set-validation-check-frequency-within-1-training-epoch )
- [Set the number of validation sanity steps ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Validation%20loop/#set-the-number-of-validation-sanity-steps )
2019-08-31 07:18:16 +00:00
###### Testing loop
- [Run test set ](https://williamfalcon.github.io/pytorch-lightning/Trainer/Testing%20loop/ )