lightning/docs/Trainer/Checkpointing.md

69 lines
2.2 KiB
Markdown
Raw Normal View History

i Lightning can automate saving and loading checkpoints.
---
### Model saving
2019-06-28 21:42:32 +00:00
To enable checkpointing, define the checkpoint callback and give it to the trainer.
``` {.python}
from pytorch_lightning.callbacks import ModelCheckpoint
2019-06-28 21:42:32 +00:00
checkpoint_callback = ModelCheckpoint(
filepath='/path/to/store/weights.ckpt',
2019-06-28 21:42:32 +00:00
save_best_only=True,
verbose=True,
2019-06-28 21:42:32 +00:00
monitor='val_loss',
mode='min'
)
2019-06-28 21:42:32 +00:00
trainer = Trainer(checkpoint_callback=checkpoint_callback)
```
2019-08-07 11:09:37 +00:00
---
### Restoring training session
You might want to not only load a model but also continue training it. Use this method to
restore the trainer state as well. This will continue from the epoch and global step you last left off.
However, the dataloaders will start from the first batch again (if you shuffled it shouldn't matter).
Lightning will restore the session if you pass an experiment with the same version and there's a saved checkpoint.
``` {.python}
from test_tube import Experiment
exp = Experiment(version=a_previous_version_with_a_saved_checkpoint)
2019-08-07 20:01:51 +00:00
trainer = Trainer(experiment=exp)
2019-08-07 11:09:37 +00:00
2019-08-07 20:01:51 +00:00
# this fit call loads model weights and trainer state
# the trainer continues seamlessly from where you left off
# without having to do anything else.
trainer.fit(model)
2019-08-07 11:09:37 +00:00
```
2019-08-07 20:01:51 +00:00
The trainer restores:
- global_step
- current_epoch
- All optimizers
- All lr_schedulers
- Model weights
You can even change the logic of your model as long as the weights and "architecture" of
the system isn't different. If you add a layer, for instance, it might not work.
At a rough level, here's [what happens inside Trainer](https://github.com/williamFalcon/pytorch-lightning/blob/master/pytorch_lightning/root_module/model_saving.py#L63):
```python
self.global_step = checkpoint['global_step']
self.current_epoch = checkpoint['epoch']
# restore the optimizers
optimizer_states = checkpoint['optimizer_states']
for optimizer, opt_state in zip(self.optimizers, optimizer_states):
optimizer.load_state_dict(opt_state)
2019-06-28 21:42:32 +00:00
2019-08-07 20:01:51 +00:00
# restore the lr schedulers
lr_schedulers = checkpoint['lr_schedulers']
for scheduler, lrs_state in zip(self.lr_schedulers, lr_schedulers):
scheduler.load_state_dict(lrs_state)
2019-06-28 21:42:32 +00:00
2019-08-07 20:01:51 +00:00
# uses the model you passed into trainer
model.load_state_dict(checkpoint['state_dict'])
```