2020-02-11 04:55:22 +00:00
|
|
|
Debugging
|
2020-02-16 15:50:00 +00:00
|
|
|
=========
|
2020-02-11 04:55:22 +00:00
|
|
|
The following are flags that make debugging much easier.
|
|
|
|
|
|
|
|
Fast dev run
|
2020-02-16 15:50:00 +00:00
|
|
|
------------
|
2020-02-11 04:55:22 +00:00
|
|
|
This flag runs a "unit test" by running 1 training batch and 1 validation batch.
|
|
|
|
The point is to detect any bugs in the training/validation loop without having to wait for
|
|
|
|
a full epoch to crash.
|
|
|
|
|
2020-03-20 19:49:01 +00:00
|
|
|
(See: :paramref:`~pytorch_lightning.trainer.trainer.Trainer.fast_dev_run`
|
|
|
|
argument of :class:`~pytorch_lightning.trainer.trainer.Trainer`)
|
|
|
|
|
2020-02-11 04:55:22 +00:00
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
trainer = pl.Trainer(fast_dev_run=True)
|
|
|
|
|
|
|
|
Inspect gradient norms
|
2020-02-16 15:50:00 +00:00
|
|
|
----------------------
|
2020-02-11 04:55:22 +00:00
|
|
|
Logs (to a logger), the norm of each weight matrix.
|
|
|
|
|
2020-03-20 19:49:01 +00:00
|
|
|
(See: :paramref:`~pytorch_lightning.trainer.trainer.Trainer.track_grad_norm`
|
|
|
|
argument of :class:`~pytorch_lightning.trainer.trainer.Trainer`)
|
|
|
|
|
2020-02-11 04:55:22 +00:00
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
# the 2-norm
|
|
|
|
trainer = pl.Trainer(track_grad_norm=2)
|
|
|
|
|
|
|
|
Log GPU usage
|
2020-02-16 15:50:00 +00:00
|
|
|
-------------
|
2020-02-11 04:55:22 +00:00
|
|
|
Logs (to a logger) the GPU usage for each GPU on the master machine.
|
|
|
|
|
2020-03-20 19:49:01 +00:00
|
|
|
(See: :paramref:`~pytorch_lightning.trainer.trainer.Trainer.log_gpu_memory`
|
|
|
|
argument of :class:`~pytorch_lightning.trainer.trainer.Trainer`)
|
2020-02-11 04:55:22 +00:00
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
trainer = pl.Trainer(log_gpu_memory=True)
|
|
|
|
|
|
|
|
Make model overfit on subset of data
|
2020-02-16 15:50:00 +00:00
|
|
|
------------------------------------
|
2020-02-11 04:55:22 +00:00
|
|
|
|
|
|
|
A good debugging technique is to take a tiny portion of your data (say 2 samples per class),
|
|
|
|
and try to get your model to overfit. If it can't, it's a sign it won't work with large datasets.
|
|
|
|
|
2020-03-20 19:49:01 +00:00
|
|
|
(See: :paramref:`~pytorch_lightning.trainer.trainer.Trainer.overfit_pct`
|
|
|
|
argument of :class:`~pytorch_lightning.trainer.trainer.Trainer`)
|
2020-02-11 04:55:22 +00:00
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
trainer = pl.Trainer(overfit_pct=0.01)
|
|
|
|
|
|
|
|
Print the parameter count by layer
|
2020-02-16 15:50:00 +00:00
|
|
|
----------------------------------
|
2020-02-11 04:55:22 +00:00
|
|
|
Whenever the .fit() function gets called, the Trainer will print the weights summary for the lightningModule.
|
|
|
|
To disable this behavior, turn off this flag:
|
|
|
|
|
2020-03-20 19:49:01 +00:00
|
|
|
(See: :paramref:`~pytorch_lightning.trainer.trainer.Trainer.weights_summary`
|
|
|
|
argument of :class:`~pytorch_lightning.trainer.trainer.Trainer`)
|
2020-02-11 04:55:22 +00:00
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
trainer = pl.Trainer(weights_summary=None)
|
|
|
|
|
|
|
|
|
|
|
|
Set the number of validation sanity steps
|
2020-02-16 15:50:00 +00:00
|
|
|
-----------------------------------------
|
2020-02-11 04:55:22 +00:00
|
|
|
Lightning runs a few steps of validation in the beginning of training.
|
|
|
|
This avoids crashing in the validation loop sometime deep into a lengthy training loop.
|
|
|
|
|
2020-03-20 19:49:01 +00:00
|
|
|
(See: :paramref:`~pytorch_lightning.trainer.trainer.Trainer.num_sanity_val_steps`
|
|
|
|
argument of :class:`~pytorch_lightning.trainer.trainer.Trainer`)
|
|
|
|
|
2020-02-11 04:55:22 +00:00
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
# DEFAULT
|
2020-03-20 19:49:01 +00:00
|
|
|
trainer = Trainer(num_sanity_val_steps=5)
|