Add video tutorials to docs (#3977)

* videos in trainer api

* videos in docs

* videos in docs

* videos in trainer api

* videos in docs

* videos in docs

* videos in docs

* videos in docs

* Update new-project.rst

* docs

* Update new-project.rst
This commit is contained in:
edenlightning 2020-10-08 05:49:56 -04:00 committed by GitHub
parent 1d3c7dc8d6
commit 6dfa748ce3
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
9 changed files with 471 additions and 2 deletions

View File

@ -8,6 +8,15 @@
=================
Lightning offers 16-bit training for CPUs, GPUs and TPUs.
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/yt_thumbs/thumb_precision.png"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/yt/Trainer+flags+9+-+precision_1.mp4"></video>
|
----------
GPU 16-bit

View File

@ -10,6 +10,15 @@
Callback
========
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/callbacks.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/callbacks.mp4"></video>
|
A callback is a self-contained program that can be reused across projects.
Lightning has a callback system to execute callbacks when needed. Callbacks should capture NON-ESSENTIAL

View File

@ -6,6 +6,15 @@
Debugging
=========
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/yt_thumbs/thumb_debugging.png"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/yt/Trainer+flags+7-+debugging_1.mp4"></video>
|
The following are flags that make debugging much easier.
----------------

View File

@ -9,6 +9,14 @@
Early stopping
**************
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/yt_thumbs/thumb_earlystop.png"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/yt/Trainer+flags+19-+early+stopping_1.mp4"></video>
|
Stopping an epoch early
=======================
You can stop an epoch early by overriding :meth:`~pytorch_lightning.core.hooks.ModelHooks.on_train_batch_start` to return ``-1`` when some condition is met.

View File

@ -8,6 +8,14 @@
Learning Rate Finder
--------------------
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/auto_lr_find.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/auto_lr_find.mp4"></video>
|
For training deep neural networks, selecting a good learning rate is essential
for both better performance and faster convergence. Even optimizers such as
`Adam` that are self-adjusting the learning rate can benefit from more optimal

View File

@ -10,6 +10,14 @@ Multi-GPU training
==================
Lightning supports multiple ways of doing distributed training.
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/yt_thumbs/thumb_multi_gpus.png"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/yt/Trainer+flags+4-+multi+node+training_3.mp4"></video>
|
----------
Preparing your code

View File

@ -29,11 +29,12 @@ Organizing your code with PyTorch Lightning makes your code:
----------
Here's a 2 minute conversion guide for PyTorch projects:
Here's a 3 minute conversion guide for PyTorch projects:
.. raw:: html
<video width="100%" controls autoplay muted playsinline src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/pl_quick_start_full.m4v"></video>
<video width="100%" controls autoplay muted playsinline
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/pl_docs_animation_final.m4v"></video>
----------

View File

@ -3,6 +3,14 @@
TPU support
===========
.. raw:: html
<video width="100%" controls autoplay
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/yt_thumbs/thumb_tpus.png"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/tpu_cores.mp4"></video>
|
Lightning supports running on TPUs. At this moment, TPUs are available
on Google Cloud (GCP), Google Colab and Kaggle Environments. For more information on TPUs
`watch this video <https://www.youtube.com/watch?v=kPMpmcl_Pyw>`_.

View File

@ -167,6 +167,15 @@ Trainer flags
accumulate_grad_batches
^^^^^^^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/accumulate_grad_batches.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/accumulate_grad_batches.mp4"></video>
|
Accumulates grads every k batches or as set up in the dict.
Trainer also calls ``optimizer.step()`` for the last indivisible step number.
@ -186,6 +195,14 @@ Example::
amp_backend
^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/amp_backend.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/amp_backend.mp4"></video>
|
Use PyTorch AMP ('native') (available PyTorch 1.6+), or NVIDIA apex ('apex').
.. testcode::
@ -198,6 +215,15 @@ Use PyTorch AMP ('native') (available PyTorch 1.6+), or NVIDIA apex ('apex').
amp_level
^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/amp_level.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/amp_level.mp4"></video>
|
The optimization level to use (O1, O2, etc...)
for 16-bit GPU precision (using NVIDIA apex under the hood).
@ -210,6 +236,15 @@ Example::
auto_scale_batch_size
^^^^^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/auto_scale%E2%80%A8_batch_size.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/auto_scale_batch_size.mp4"></video>
|
Automatically tries to find the largest batch size that fits into memory,
before any training.
@ -227,6 +262,14 @@ before any training.
auto_select_gpus
^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/auto_select+_gpus.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/auto_select_gpus.mp4"></video>
|
If enabled and `gpus` is an integer, pick available gpus automatically.
This is especially useful when GPUs are configured to be in "exclusive mode",
such that only one process at a time can access them.
@ -241,6 +284,15 @@ Example::
auto_lr_find
^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/auto_lr_find.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/auto_lr_find.mp4"></video>
|
Runs a learning rate finder algorithm (see this `paper <https://arxiv.org/abs/1506.01186>`_)
when calling trainer.tune(), to find optimal initial learning rate.
@ -271,6 +323,14 @@ Example::
benchmark
^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/benchmark.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/benchmark.mp4"></video>
|
If true enables cudnn.benchmark.
This flag is likely to increase the speed of your system if your
input sizes don't change. However, if it does, then it will likely
@ -288,6 +348,14 @@ Example::
deterministic
^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/deterministic.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/deterministic.mp4"></video>
|
If true enables cudnn.deterministic.
Might make your system slower, but ensures reproducibility.
Also sets ``$HOROVOD_FUSION_THRESHOLD=0``.
@ -303,6 +371,14 @@ Example::
callbacks
^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/callbacks.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/callbacks.mp4"></video>
|
Add a list of user defined callbacks. These callbacks DO NOT replace the explicit callbacks
(loggers or ModelCheckpoint).
@ -327,6 +403,14 @@ Example::
check_val_every_n_epoch
^^^^^^^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/check_val_every_n_epoch.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/check_val_every_n_epoch.mp4"></video>
|
Check val every n train epochs.
Example::
@ -339,6 +423,15 @@ Example::
checkpoint_callback
^^^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/checkpoint_callback.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/checkpoint_callback.mp4"></video>
|
Callback for checkpointing.
.. code-block:: python
@ -362,6 +455,15 @@ Example::
cluster_environment
^^^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/cluster_environment.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/cluster_environment.mp4"></video>
|
Environment to connect arbitrary cluster backends. Lightning automatically handles:
- SLURM
@ -389,6 +491,14 @@ For any other non-supported cluster environment, define your own class and pass
default_root_dir
^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/default%E2%80%A8_root_dir.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/default_root_dir.mp4"></video>
|
Default path for logs and weights when no logger or
:class:`pytorch_lightning.callbacks.ModelCheckpoint` callback passed. On
certain clusters you might want to separate where logs and checkpoints are
@ -403,6 +513,15 @@ Example::
distributed_backend
^^^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/distributed_backend.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/distributed_backend.mp4"></video>
|
The distributed backend to use.
- (```dp```) is DataParallel (split batch among GPUs of same machine)
@ -447,6 +566,22 @@ early_stop_callback
fast_dev_run
^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/fast_dev_run.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/fast_dev_run.mp4"></video>
|
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/fast_dev_run.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/fast_dev_run.mp4"></video>
|
Runs 1 batch of train, test and val to find any bugs (ie: a sort of unit test).
Under the hood the pseudocode looks like this:
@ -477,6 +612,14 @@ Under the hood the pseudocode looks like this:
gpus
^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/gpus.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/gpus.mp4"></video>
|
- Number of GPUs to train on (int)
- or which GPUs to train on (list)
- can handle strings
@ -514,6 +657,15 @@ See Also:
gradient_clip_val
^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/gradient+_clip_val.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/gradient_clip_val.mp4"></video>
|
Gradient clipping value
- 0 means don't clip.
@ -527,6 +679,14 @@ Gradient clipping value
limit_test_batches
^^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/limit_test_batches.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/limit_batches.mp4"></video>
|
How much of test dataset to check.
.. testcode::
@ -545,6 +705,14 @@ In the case of multiple test dataloaders, the limit applies to each dataloader i
limit_val_batches
^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/limit_val_batches.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/limit_batches.mp4"></video>
|
How much of validation dataset to check.
Useful when debugging or testing something that happens at the end of an epoch.
@ -563,6 +731,15 @@ In the case of multiple validation dataloaders, the limit applies to each datalo
log_gpu_memory
^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/log_gpu_memory.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/log_gpu_memory.mp4"></video>
|
Options:
- None
@ -585,6 +762,14 @@ Options:
flush_logs_every_n_steps
^^^^^^^^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/flush_logs%E2%80%A8_every_n_steps.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/flush_logs_every_n_steps.mp4"></video>
|
Writes logs to disk this often.
.. testcode::
@ -598,6 +783,14 @@ See Also:
logger
^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/logger.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/logger.mp4"></video>
|
:ref:`Logger <loggers>` (or iterable collection of loggers) for experiment tracking.
.. testcode::
@ -614,6 +807,15 @@ logger
max_epochs
^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/max_epochs.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/min_max_epochs.mp4"></video>
|
Stop training once this number of epochs is reached
.. testcode::
@ -623,6 +825,15 @@ Stop training once this number of epochs is reached
min_epochs
^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/min_epochs.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/min_max_epochs.mp4"></video>
|
Force training for at least these many epochs
.. testcode::
@ -632,6 +843,15 @@ Force training for at least these many epochs
max_steps
^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/max_steps.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/min_max_steps.mp4"></video>
|
Stop training after this number of steps
Training will stop if max_steps or max_epochs have reached (earliest).
@ -646,6 +866,14 @@ Training will stop if max_steps or max_epochs have reached (earliest).
min_steps
^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/min_steps.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/min_max_steps.mp4"></video>
|
Force training for at least these number of steps.
Trainer will train model for at least min_steps or min_epochs (latest).
@ -660,6 +888,14 @@ Trainer will train model for at least min_steps or min_epochs (latest).
num_nodes
^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/num_nodes.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/num_nodes.mp4"></video>
|
Number of GPU nodes for distributed training.
.. testcode::
@ -673,6 +909,14 @@ Number of GPU nodes for distributed training.
num_processes
^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/num_processes.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/num_processes.mp4"></video>
|
Number of processes to train with. Automatically set to the number of GPUs
when using ``distrbuted_backend="ddp"``. Set to a number greater than 1 when
using ``distributed_backend="ddp_cpu"`` to mimic distributed training on a
@ -688,6 +932,14 @@ CPUs.
num_sanity_val_steps
^^^^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/num_sanity%E2%80%A8_val_steps.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/num_sanity_val_steps.mp4"></video>
|
Sanity check runs n batches of val before starting the training routine.
This catches any bugs in your validation without having to wait for the first validation check.
The Trainer uses 2 steps by default. Turn it off or modify it here.
@ -713,6 +965,15 @@ Example::
prepare_data_per_node
^^^^^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/prepare_data_per_node.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/prepare_data_per_node.mp4"></video>
|
If True will call `prepare_data()` on LOCAL_RANK=0 for every node.
If False will only call from NODE_RANK=0, LOCAL_RANK=0
@ -726,6 +987,15 @@ If False will only call from NODE_RANK=0, LOCAL_RANK=0
tpu_cores
^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/tpu_cores.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/tpu_cores.mp4"></video>
|
- How many TPU cores to train on (1 or 8).
- Which TPU core to train on [1-8]
@ -773,6 +1043,15 @@ Example::
overfit_batches
^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/overfit_batches.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/overfit_batches.mp4"></video>
|
Uses this much data of the training set. If nonzero, will use the same training set for validation and testing.
If the training dataloaders have `shuffle=True`, Lightning will automatically disable it.
@ -791,6 +1070,15 @@ Useful for quickly debugging or trying to overfit on purpose.
precision
^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/precision.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/precision.mp4"></video>
|
Full precision (32), half precision (16).
Can be used on CPU, GPU or TPUs.
@ -813,6 +1101,15 @@ Example::
process_position
^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/process_position.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/process_position.mp4"></video>
|
Orders the progress bar. Useful when running multiple trainers on the same node.
.. testcode::
@ -825,6 +1122,15 @@ Note:
profiler
^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/profiler.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/profiler.mp4"></video>
|
To profile individual steps during training and assist in identifying bottlenecks.
See the :ref:`profiler documentation <profiler>`. for more details.
@ -847,6 +1153,15 @@ See the :ref:`profiler documentation <profiler>`. for more details.
progress_bar_refresh_rate
^^^^^^^^^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/progress_bar%E2%80%A8_refresh_rate.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/progress_bar_refresh_rate.mp4"></video>
|
How often to refresh progress bar (in steps).
In notebooks, faster refresh rates (lower number) is known to crash them
because of their screen refresh rates, so raise it to 50 or more.
@ -864,6 +1179,15 @@ Note:
reload_dataloaders_every_epoch
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/reload_%E2%80%A8dataloaders_%E2%80%A8every_epoch.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/reload_dataloaders_every_epoch.mp4"></video>
|
Set to True to reload dataloaders every epoch.
.. code-block:: python
@ -881,6 +1205,15 @@ Set to True to reload dataloaders every epoch.
replace_sampler_ddp
^^^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/replace_sampler_ddp.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/replace_sampler_ddp.mp4"></video>
|
Enables auto adding of distributed sampler. By default it will add ``shuffle=True``
for train sampler and ``shuffle=False`` for val/test sampler. If you want to customize
it, you can set ``replace_sampler_ddp=False`` and add your own distributed sampler.
@ -900,6 +1233,15 @@ By setting to False, you have to add your own distributed sampler:
resume_from_checkpoint
^^^^^^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/resume_from_checkpoint.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/resume_from_checkpoint.mp4"></video>
|
To resume training from a specific checkpoint pass in the path here.
.. testcode::
@ -913,6 +1255,15 @@ To resume training from a specific checkpoint pass in the path here.
log_every_n_steps
^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/log_every_n_steps.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/log_every_n_steps.mp4"></video>
|
How often to add logging rows (does not write to disk)
.. testcode::
@ -927,6 +1278,14 @@ See Also:
sync_batchnorm
^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/sync_batchnorm.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/sync_batchnorm.mp4"></video>
|
Enable synchronization between batchnorm layers across all GPUs.
.. testcode::
@ -936,6 +1295,14 @@ Enable synchronization between batchnorm layers across all GPUs.
track_grad_norm
^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/track_grad_norm.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/track_grad_norm.mp4"></video>
|
- no tracking (-1)
- Otherwise tracks that norm (2 for 2-norm)
@ -950,6 +1317,14 @@ track_grad_norm
limit_train_batches
^^^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/limit_train_batches.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/limit_batches.mp4"></video>
|
How much of training dataset to check.
Useful when debugging or testing something that happens at the end of an epoch.
@ -972,6 +1347,14 @@ Example::
truncated_bptt_steps
^^^^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/truncated_bptt_steps.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/truncated_bptt_steps.mp4"></video>
|
Truncated back prop breaks performs backprop every k steps of
a much longer sequence.
@ -1029,6 +1412,14 @@ override :meth:`pytorch_lightning.core.LightningModule.tbptt_split_batch`:
val_check_interval
^^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/val_check_interval.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/val_check_interval.mp4"></video>
|
How often within one training epoch to check the validation set.
Can specify as float or int.
@ -1051,6 +1442,15 @@ Can specify as float or int.
weights_save_path
^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/weights_save_path.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/weights_save_path.mp4"></video>
|
Directory of where to save weights if specified.
.. testcode::
@ -1073,6 +1473,15 @@ Example::
weights_summary
^^^^^^^^^^^^^^^
.. raw:: html
<video width="100%" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/weights_summary.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/weights_summary.mp4"></video>
|
Prints a summary of the weights when training begins.
Options: 'full', 'top', None.