Reordered sections for intuitive browsing. (e.g. limit_train_batches was at the end of the page, far from limit_test/val_batches) (#5283)

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
This commit is contained in:
Sejin Kim 2021-01-04 09:05:24 -05:00 committed by GitHub
parent 17a0784c5e
commit 0e593fb6a8
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 141 additions and 144 deletions

View File

@ -670,6 +670,27 @@ Under the hood the pseudocode looks like this when running *fast_dev_run* with a
used only for debugging purposes. ``limit_train/val/test_batches`` only limits the number of batches and won't
disable anything.
flush_logs_every_n_steps
^^^^^^^^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="50%" max-width="400px" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/flush_logs%E2%80%A8_every_n_steps.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/flush_logs_every_n_steps.mp4"></video>
|
Writes logs to disk this often.
.. testcode::
# default used by the Trainer
trainer = Trainer(flush_logs_every_n_steps=100)
See Also:
- :ref:`logging`
gpus
^^^^
@ -736,6 +757,35 @@ Gradient clipping value
# default used by the Trainer
trainer = Trainer(gradient_clip_val=0.0)
limit_train_batches
^^^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="50%" max-width="400px" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/limit_train_batches.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/limit_batches.mp4"></video>
|
How much of training dataset to check.
Useful when debugging or testing something that happens at the end of an epoch.
.. testcode::
# default used by the Trainer
trainer = Trainer(limit_train_batches=1.0)
Example::
# default used by the Trainer
trainer = Trainer(limit_train_batches=1.0)
# run through only 25% of the training set each epoch
trainer = Trainer(limit_train_batches=0.25)
# run through only 10 batches of the training set each epoch
trainer = Trainer(limit_train_batches=10)
limit_test_batches
^^^^^^^^^^^^^^^^^^
@ -790,6 +840,28 @@ Useful when debugging or testing something that happens at the end of an epoch.
In the case of multiple validation dataloaders, the limit applies to each dataloader individually.
log_every_n_steps
^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="50%" max-width="400px" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/log_every_n_steps.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/log_every_n_steps.mp4"></video>
|
How often to add logging rows (does not write to disk)
.. testcode::
# default used by the Trainer
trainer = Trainer(log_every_n_steps=50)
See Also:
- :ref:`logging`
log_gpu_memory
^^^^^^^^^^^^^^
@ -820,27 +892,6 @@ Options:
.. note:: Might slow performance because it uses the output of ``nvidia-smi``.
flush_logs_every_n_steps
^^^^^^^^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="50%" max-width="400px" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/flush_logs%E2%80%A8_every_n_steps.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/flush_logs_every_n_steps.mp4"></video>
|
Writes logs to disk this often.
.. testcode::
# default used by the Trainer
trainer = Trainer(flush_logs_every_n_steps=100)
See Also:
- :ref:`logging`
logger
^^^^^^
@ -1019,6 +1070,32 @@ The Trainer uses 2 steps by default. Turn it off or modify it here.
This option will reset the validation dataloader unless ``num_sanity_val_steps=0``.
overfit_batches
^^^^^^^^^^^^^^^
.. raw:: html
<video width="50%" max-width="400px" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/overfit_batches.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/overfit_batches.mp4"></video>
|
Uses this much data of the training set. If nonzero, will use the same training set for validation and testing.
If the training dataloaders have `shuffle=True`, Lightning will automatically disable it.
Useful for quickly debugging or trying to overfit on purpose.
.. testcode::
# default used by the Trainer
trainer = Trainer(overfit_batches=0.0)
# use only 1% of the train set (and use the train set for val and test)
trainer = Trainer(overfit_batches=0.01)
# overfit on 10 of the same batches
trainer = Trainer(overfit_batches=10)
plugins
^^^^^^^
@ -1079,91 +1156,6 @@ If False will only call from NODE_RANK=0, LOCAL_RANK=0
# use only NODE_RANK=0, LOCAL_RANK=0
Trainer(prepare_data_per_node=False)
tpu_cores
^^^^^^^^^
.. raw:: html
<video width="50%" max-width="400px" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/tpu_cores.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/tpu_cores.mp4"></video>
|
- How many TPU cores to train on (1 or 8).
- Which TPU core to train on [1-8]
A single TPU v2 or v3 has 8 cores. A TPU pod has
up to 2048 cores. A slice of a POD means you get as many cores
as you request.
Your effective batch size is batch_size * total tpu cores.
.. note::
No need to add a :class:`~torch.utils.data.distributed.DistributedSampler`,
Lightning automatically does it for you.
This parameter can be either 1 or 8.
Example::
# your_trainer_file.py
# default used by the Trainer (ie: train on CPU)
trainer = Trainer(tpu_cores=None)
# int: train on a single core
trainer = Trainer(tpu_cores=1)
# list: train on a single selected core
trainer = Trainer(tpu_cores=[2])
# int: train on all cores few cores
trainer = Trainer(tpu_cores=8)
# for 8+ cores must submit via xla script with
# a max of 8 cores specified. The XLA script
# will duplicate script onto each TPU in the POD
trainer = Trainer(tpu_cores=8)
To train on more than 8 cores (ie: a POD),
submit this script using the xla_dist script.
Example::
python -m torch_xla.distributed.xla_dist
--tpu=$TPU_POD_NAME
--conda-env=torch-xla-nightly
--env=XLA_USE_BF16=1
-- python your_trainer_file.py
overfit_batches
^^^^^^^^^^^^^^^
.. raw:: html
<video width="50%" max-width="400px" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/overfit_batches.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/overfit_batches.mp4"></video>
|
Uses this much data of the training set. If nonzero, will use the same training set for validation and testing.
If the training dataloaders have `shuffle=True`, Lightning will automatically disable it.
Useful for quickly debugging or trying to overfit on purpose.
.. testcode::
# default used by the Trainer
trainer = Trainer(overfit_batches=0.0)
# use only 1% of the train set (and use the train set for val and test)
trainer = Trainer(overfit_batches=0.01)
# overfit on 10 of the same batches
trainer = Trainer(overfit_batches=10)
precision
^^^^^^^^^
@ -1346,29 +1338,6 @@ To resume training from a specific checkpoint pass in the path here.
# resume from a specific checkpoint
trainer = Trainer(resume_from_checkpoint='some/path/to/my_checkpoint.ckpt')
log_every_n_steps
^^^^^^^^^^^^^^^^^
.. raw:: html
<video width="50%" max-width="400px" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/log_every_n_steps.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/log_every_n_steps.mp4"></video>
|
How often to add logging rows (does not write to disk)
.. testcode::
# default used by the Trainer
trainer = Trainer(log_every_n_steps=50)
See Also:
- :ref:`logging`
sync_batchnorm
^^^^^^^^^^^^^^
@ -1408,35 +1377,63 @@ track_grad_norm
# track the 2-norm
trainer = Trainer(track_grad_norm=2)
limit_train_batches
^^^^^^^^^^^^^^^^^^^
tpu_cores
^^^^^^^^^
.. raw:: html
<video width="50%" max-width="400px" controls
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/limit_train_batches.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/limit_batches.mp4"></video>
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/tpu_cores.jpg"
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/tpu_cores.mp4"></video>
|
How much of training dataset to check.
Useful when debugging or testing something that happens at the end of an epoch.
- How many TPU cores to train on (1 or 8).
- Which TPU core to train on [1-8]
.. testcode::
A single TPU v2 or v3 has 8 cores. A TPU pod has
up to 2048 cores. A slice of a POD means you get as many cores
as you request.
# default used by the Trainer
trainer = Trainer(limit_train_batches=1.0)
Your effective batch size is batch_size * total tpu cores.
.. note::
No need to add a :class:`~torch.utils.data.distributed.DistributedSampler`,
Lightning automatically does it for you.
This parameter can be either 1 or 8.
Example::
# default used by the Trainer
trainer = Trainer(limit_train_batches=1.0)
# your_trainer_file.py
# run through only 25% of the training set each epoch
trainer = Trainer(limit_train_batches=0.25)
# default used by the Trainer (ie: train on CPU)
trainer = Trainer(tpu_cores=None)
# run through only 10 batches of the training set each epoch
trainer = Trainer(limit_train_batches=10)
# int: train on a single core
trainer = Trainer(tpu_cores=1)
# list: train on a single selected core
trainer = Trainer(tpu_cores=[2])
# int: train on all cores few cores
trainer = Trainer(tpu_cores=8)
# for 8+ cores must submit via xla script with
# a max of 8 cores specified. The XLA script
# will duplicate script onto each TPU in the POD
trainer = Trainer(tpu_cores=8)
To train on more than 8 cores (ie: a POD),
submit this script using the xla_dist script.
Example::
python -m torch_xla.distributed.xla_dist
--tpu=$TPU_POD_NAME
--conda-env=torch-xla-nightly
--env=XLA_USE_BF16=1
-- python your_trainer_file.py
truncated_bptt_steps
^^^^^^^^^^^^^^^^^^^^