219 lines
7.4 KiB
ReStructuredText
219 lines
7.4 KiB
ReStructuredText
.. testsetup:: *
|
|
|
|
import os
|
|
from pytorch_lightning.trainer.trainer import Trainer
|
|
from pytorch_lightning.core.lightning import LightningModule
|
|
|
|
.. _weights_loading:
|
|
|
|
##########################
|
|
Saving and loading weights
|
|
##########################
|
|
|
|
Lightning automates saving and loading checkpoints. Checkpoints capture the exact value of all parameters used by a model.
|
|
|
|
Checkpointing your training allows you to resume a training process in case it was interrupted, fine-tune a model or use a pre-trained model for inference without having to retrain the model.
|
|
|
|
|
|
*****************
|
|
Checkpoint saving
|
|
*****************
|
|
A Lightning checkpoint has everything needed to restore a training session including:
|
|
|
|
- 16-bit scaling factor (apex)
|
|
- Current epoch
|
|
- Global step
|
|
- Model state_dict
|
|
- State of all optimizers
|
|
- State of all learningRate schedulers
|
|
- State of all callbacks
|
|
- The hyperparameters used for that model if passed in as hparams (Argparse.Namespace)
|
|
|
|
Automatic saving
|
|
================
|
|
|
|
Lightning automatically saves a checkpoint for you in your current working directory, with the state of your last training epoch. This makes sure you can resume training in case it was interrupted.
|
|
|
|
To change the checkpoint path pass in:
|
|
|
|
.. code-block:: python
|
|
|
|
# saves checkpoints to '/your/path/to/save/checkpoints' at every epoch end
|
|
trainer = Trainer(default_root_dir="/your/path/to/save/checkpoints")
|
|
|
|
You can customize the checkpointing behavior to monitor any quantity of your training or validation steps. For example, if you want to update your checkpoints based on your validation loss:
|
|
|
|
1. Calculate any metric or other quantity you wish to monitor, such as validation loss.
|
|
2. Log the quantity using :func:`~~pytorch_lightning.core.lightning.LightningModule.log` method, with a key such as `val_loss`.
|
|
3. Initializing the :class:`~pytorch_lightning.callbacks.ModelCheckpoint` callback, and set `monitor` to be the key of your quantity.
|
|
4. Pass the callback to the `callbacks` :class:`~pytorch_lightning.trainer.Trainer` flag.
|
|
|
|
.. testcode::
|
|
|
|
from pytorch_lightning.callbacks import ModelCheckpoint
|
|
|
|
|
|
class LitAutoEncoder(LightningModule):
|
|
def validation_step(self, batch, batch_idx):
|
|
x, y = batch
|
|
y_hat = self.backbone(x)
|
|
|
|
# 1. calculate loss
|
|
loss = F.cross_entropy(y_hat, y)
|
|
|
|
# 2. log `val_loss`
|
|
self.log("val_loss", loss)
|
|
|
|
|
|
# 3. Init ModelCheckpoint callback, monitoring 'val_loss'
|
|
checkpoint_callback = ModelCheckpoint(monitor="val_loss")
|
|
|
|
# 4. Add your callback to the callbacks list
|
|
trainer = Trainer(callbacks=[checkpoint_callback])
|
|
|
|
You can also control more advanced options, like `save_top_k`, to save the best k models and the `mode` of the monitored quantity (min/max), `save_weights_only` or `period` to set the interval of epochs between checkpoints, to avoid slowdowns.
|
|
|
|
.. testcode::
|
|
|
|
from pytorch_lightning.callbacks import ModelCheckpoint
|
|
|
|
|
|
class LitAutoEncoder(LightningModule):
|
|
def validation_step(self, batch, batch_idx):
|
|
x, y = batch
|
|
y_hat = self.backbone(x)
|
|
loss = F.cross_entropy(y_hat, y)
|
|
self.log("val_loss", loss)
|
|
|
|
|
|
# saves a file like: my/path/sample-mnist-epoch=02-val_loss=0.32.ckpt
|
|
checkpoint_callback = ModelCheckpoint(
|
|
monitor="val_loss",
|
|
dirpath="my/path/",
|
|
filename="sample-mnist-{epoch:02d}-{val_loss:.2f}",
|
|
save_top_k=3,
|
|
mode="min",
|
|
)
|
|
|
|
trainer = Trainer(callbacks=[checkpoint_callback])
|
|
|
|
You can retrieve the checkpoint after training by calling
|
|
|
|
.. code-block:: python
|
|
|
|
checkpoint_callback = ModelCheckpoint(dirpath="my/path/")
|
|
trainer = Trainer(callbacks=[checkpoint_callback])
|
|
trainer.fit(model)
|
|
checkpoint_callback.best_model_path
|
|
|
|
Disabling checkpoints
|
|
---------------------
|
|
|
|
You can disable checkpointing by passing
|
|
|
|
.. testcode::
|
|
|
|
trainer = Trainer(checkpoint_callback=False)
|
|
|
|
|
|
The Lightning checkpoint also saves the arguments passed into the LightningModule init
|
|
under the `hyper_parameters` key in the checkpoint.
|
|
|
|
.. code-block:: python
|
|
|
|
class MyLightningModule(LightningModule):
|
|
def __init__(self, learning_rate, *args, **kwargs):
|
|
super().__init__()
|
|
self.save_hyperparameters()
|
|
|
|
|
|
# all init args were saved to the checkpoint
|
|
checkpoint = torch.load(CKPT_PATH)
|
|
print(checkpoint["hyper_parameters"])
|
|
# {'learning_rate': the_value}
|
|
|
|
Manual saving
|
|
=============
|
|
You can manually save checkpoints and restore your model from the checkpointed state.
|
|
|
|
.. code-block:: python
|
|
|
|
model = MyLightningModule(hparams)
|
|
trainer.fit(model)
|
|
trainer.save_checkpoint("example.ckpt")
|
|
new_model = MyModel.load_from_checkpoint(checkpoint_path="example.ckpt")
|
|
|
|
Manual saving with accelerators
|
|
===============================
|
|
|
|
Lightning also handles accelerators where multiple processes are running, such as DDP. For example, when using the DDP accelerator our training script is running across multiple devices at the same time.
|
|
Lightning automatically ensures that the model is saved only on the main process, whilst other processes do not interfere with saving checkpoints. This requires no code changes as seen below.
|
|
|
|
.. code-block:: python
|
|
|
|
trainer = Trainer(accelerator="ddp")
|
|
model = MyLightningModule(hparams)
|
|
trainer.fit(model)
|
|
# Saves only on the main process
|
|
trainer.save_checkpoint("example.ckpt")
|
|
|
|
Not using `trainer.save_checkpoint` can lead to unexpected behaviour and potential deadlock. Using other saving functions will result in all devices attempting to save the checkpoint. As a result, we highly recommend using the trainer's save functionality.
|
|
If using custom saving functions cannot be avoided, we recommend using :func:`~pytorch_lightning.loggers.base.rank_zero_only` to ensure saving occurs only on the main process.
|
|
|
|
******************
|
|
Checkpoint loading
|
|
******************
|
|
|
|
To load a model along with its weights, biases and hyperparameters use the following method:
|
|
|
|
.. code-block:: python
|
|
|
|
model = MyLightingModule.load_from_checkpoint(PATH)
|
|
|
|
print(model.learning_rate)
|
|
# prints the learning_rate you used in this checkpoint
|
|
|
|
model.eval()
|
|
y_hat = model(x)
|
|
|
|
But if you don't want to use the values saved in the checkpoint, pass in your own here
|
|
|
|
.. testcode::
|
|
|
|
class LitModel(LightningModule):
|
|
def __init__(self, in_dim, out_dim):
|
|
super().__init__()
|
|
self.save_hyperparameters()
|
|
self.l1 = nn.Linear(self.hparams.in_dim, self.hparams.out_dim)
|
|
|
|
you can restore the model like this
|
|
|
|
.. code-block:: python
|
|
|
|
# if you train and save the model like this it will use these values when loading
|
|
# the weights. But you can overwrite this
|
|
LitModel(in_dim=32, out_dim=10)
|
|
|
|
# uses in_dim=32, out_dim=10
|
|
model = LitModel.load_from_checkpoint(PATH)
|
|
|
|
# uses in_dim=128, out_dim=10
|
|
model = LitModel.load_from_checkpoint(PATH, in_dim=128, out_dim=10)
|
|
|
|
.. automethod:: pytorch_lightning.core.lightning.LightningModule.load_from_checkpoint
|
|
:noindex:
|
|
|
|
Restoring Training State
|
|
========================
|
|
|
|
If you don't just want to load weights, but instead restore the full training,
|
|
do the following:
|
|
|
|
.. code-block:: python
|
|
|
|
model = LitModel()
|
|
trainer = Trainer(resume_from_checkpoint="some/path/to/my_checkpoint.ckpt")
|
|
|
|
# automatically restores model, epoch, step, LR schedulers, apex, etc...
|
|
trainer.fit(model)
|