Reordered sections for intuitive browsing. (e.g. limit_train_batches was at the end of the page, far from limit_test/val_batches) (#5283)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
This commit is contained in:
parent
17a0784c5e
commit
0e593fb6a8
|
@ -670,6 +670,27 @@ Under the hood the pseudocode looks like this when running *fast_dev_run* with a
|
|||
used only for debugging purposes. ``limit_train/val/test_batches`` only limits the number of batches and won't
|
||||
disable anything.
|
||||
|
||||
flush_logs_every_n_steps
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<video width="50%" max-width="400px" controls
|
||||
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/flush_logs%E2%80%A8_every_n_steps.jpg"
|
||||
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/flush_logs_every_n_steps.mp4"></video>
|
||||
|
||||
|
|
||||
|
||||
Writes logs to disk this often.
|
||||
|
||||
.. testcode::
|
||||
|
||||
# default used by the Trainer
|
||||
trainer = Trainer(flush_logs_every_n_steps=100)
|
||||
|
||||
See Also:
|
||||
- :ref:`logging`
|
||||
|
||||
gpus
|
||||
^^^^
|
||||
|
||||
|
@ -736,6 +757,35 @@ Gradient clipping value
|
|||
# default used by the Trainer
|
||||
trainer = Trainer(gradient_clip_val=0.0)
|
||||
|
||||
limit_train_batches
|
||||
^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<video width="50%" max-width="400px" controls
|
||||
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/limit_train_batches.jpg"
|
||||
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/limit_batches.mp4"></video>
|
||||
|
||||
|
|
||||
|
||||
How much of training dataset to check.
|
||||
Useful when debugging or testing something that happens at the end of an epoch.
|
||||
|
||||
.. testcode::
|
||||
|
||||
# default used by the Trainer
|
||||
trainer = Trainer(limit_train_batches=1.0)
|
||||
|
||||
Example::
|
||||
|
||||
# default used by the Trainer
|
||||
trainer = Trainer(limit_train_batches=1.0)
|
||||
|
||||
# run through only 25% of the training set each epoch
|
||||
trainer = Trainer(limit_train_batches=0.25)
|
||||
|
||||
# run through only 10 batches of the training set each epoch
|
||||
trainer = Trainer(limit_train_batches=10)
|
||||
|
||||
limit_test_batches
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
|
@ -790,6 +840,28 @@ Useful when debugging or testing something that happens at the end of an epoch.
|
|||
|
||||
In the case of multiple validation dataloaders, the limit applies to each dataloader individually.
|
||||
|
||||
log_every_n_steps
|
||||
^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<video width="50%" max-width="400px" controls
|
||||
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/log_every_n_steps.jpg"
|
||||
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/log_every_n_steps.mp4"></video>
|
||||
|
||||
|
|
||||
|
||||
|
||||
How often to add logging rows (does not write to disk)
|
||||
|
||||
.. testcode::
|
||||
|
||||
# default used by the Trainer
|
||||
trainer = Trainer(log_every_n_steps=50)
|
||||
|
||||
See Also:
|
||||
- :ref:`logging`
|
||||
|
||||
log_gpu_memory
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
|
@ -820,27 +892,6 @@ Options:
|
|||
|
||||
.. note:: Might slow performance because it uses the output of ``nvidia-smi``.
|
||||
|
||||
flush_logs_every_n_steps
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<video width="50%" max-width="400px" controls
|
||||
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/flush_logs%E2%80%A8_every_n_steps.jpg"
|
||||
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/flush_logs_every_n_steps.mp4"></video>
|
||||
|
||||
|
|
||||
|
||||
Writes logs to disk this often.
|
||||
|
||||
.. testcode::
|
||||
|
||||
# default used by the Trainer
|
||||
trainer = Trainer(flush_logs_every_n_steps=100)
|
||||
|
||||
See Also:
|
||||
- :ref:`logging`
|
||||
|
||||
logger
|
||||
^^^^^^
|
||||
|
||||
|
@ -1019,6 +1070,32 @@ The Trainer uses 2 steps by default. Turn it off or modify it here.
|
|||
|
||||
This option will reset the validation dataloader unless ``num_sanity_val_steps=0``.
|
||||
|
||||
overfit_batches
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<video width="50%" max-width="400px" controls
|
||||
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/overfit_batches.jpg"
|
||||
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/overfit_batches.mp4"></video>
|
||||
|
||||
|
|
||||
|
||||
Uses this much data of the training set. If nonzero, will use the same training set for validation and testing.
|
||||
If the training dataloaders have `shuffle=True`, Lightning will automatically disable it.
|
||||
|
||||
Useful for quickly debugging or trying to overfit on purpose.
|
||||
|
||||
.. testcode::
|
||||
|
||||
# default used by the Trainer
|
||||
trainer = Trainer(overfit_batches=0.0)
|
||||
|
||||
# use only 1% of the train set (and use the train set for val and test)
|
||||
trainer = Trainer(overfit_batches=0.01)
|
||||
|
||||
# overfit on 10 of the same batches
|
||||
trainer = Trainer(overfit_batches=10)
|
||||
|
||||
plugins
|
||||
^^^^^^^
|
||||
|
@ -1079,91 +1156,6 @@ If False will only call from NODE_RANK=0, LOCAL_RANK=0
|
|||
# use only NODE_RANK=0, LOCAL_RANK=0
|
||||
Trainer(prepare_data_per_node=False)
|
||||
|
||||
tpu_cores
|
||||
^^^^^^^^^
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<video width="50%" max-width="400px" controls
|
||||
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/tpu_cores.jpg"
|
||||
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/tpu_cores.mp4"></video>
|
||||
|
||||
|
|
||||
|
||||
- How many TPU cores to train on (1 or 8).
|
||||
- Which TPU core to train on [1-8]
|
||||
|
||||
A single TPU v2 or v3 has 8 cores. A TPU pod has
|
||||
up to 2048 cores. A slice of a POD means you get as many cores
|
||||
as you request.
|
||||
|
||||
Your effective batch size is batch_size * total tpu cores.
|
||||
|
||||
.. note::
|
||||
No need to add a :class:`~torch.utils.data.distributed.DistributedSampler`,
|
||||
Lightning automatically does it for you.
|
||||
|
||||
This parameter can be either 1 or 8.
|
||||
|
||||
Example::
|
||||
|
||||
# your_trainer_file.py
|
||||
|
||||
# default used by the Trainer (ie: train on CPU)
|
||||
trainer = Trainer(tpu_cores=None)
|
||||
|
||||
# int: train on a single core
|
||||
trainer = Trainer(tpu_cores=1)
|
||||
|
||||
# list: train on a single selected core
|
||||
trainer = Trainer(tpu_cores=[2])
|
||||
|
||||
# int: train on all cores few cores
|
||||
trainer = Trainer(tpu_cores=8)
|
||||
|
||||
# for 8+ cores must submit via xla script with
|
||||
# a max of 8 cores specified. The XLA script
|
||||
# will duplicate script onto each TPU in the POD
|
||||
trainer = Trainer(tpu_cores=8)
|
||||
|
||||
To train on more than 8 cores (ie: a POD),
|
||||
submit this script using the xla_dist script.
|
||||
|
||||
Example::
|
||||
|
||||
python -m torch_xla.distributed.xla_dist
|
||||
--tpu=$TPU_POD_NAME
|
||||
--conda-env=torch-xla-nightly
|
||||
--env=XLA_USE_BF16=1
|
||||
-- python your_trainer_file.py
|
||||
|
||||
overfit_batches
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<video width="50%" max-width="400px" controls
|
||||
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/overfit_batches.jpg"
|
||||
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/overfit_batches.mp4"></video>
|
||||
|
||||
|
|
||||
|
||||
Uses this much data of the training set. If nonzero, will use the same training set for validation and testing.
|
||||
If the training dataloaders have `shuffle=True`, Lightning will automatically disable it.
|
||||
|
||||
Useful for quickly debugging or trying to overfit on purpose.
|
||||
|
||||
.. testcode::
|
||||
|
||||
# default used by the Trainer
|
||||
trainer = Trainer(overfit_batches=0.0)
|
||||
|
||||
# use only 1% of the train set (and use the train set for val and test)
|
||||
trainer = Trainer(overfit_batches=0.01)
|
||||
|
||||
# overfit on 10 of the same batches
|
||||
trainer = Trainer(overfit_batches=10)
|
||||
|
||||
precision
|
||||
^^^^^^^^^
|
||||
|
||||
|
@ -1346,29 +1338,6 @@ To resume training from a specific checkpoint pass in the path here.
|
|||
# resume from a specific checkpoint
|
||||
trainer = Trainer(resume_from_checkpoint='some/path/to/my_checkpoint.ckpt')
|
||||
|
||||
log_every_n_steps
|
||||
^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<video width="50%" max-width="400px" controls
|
||||
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/log_every_n_steps.jpg"
|
||||
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/log_every_n_steps.mp4"></video>
|
||||
|
||||
|
|
||||
|
||||
|
||||
How often to add logging rows (does not write to disk)
|
||||
|
||||
.. testcode::
|
||||
|
||||
# default used by the Trainer
|
||||
trainer = Trainer(log_every_n_steps=50)
|
||||
|
||||
See Also:
|
||||
- :ref:`logging`
|
||||
|
||||
|
||||
sync_batchnorm
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
|
@ -1408,35 +1377,63 @@ track_grad_norm
|
|||
# track the 2-norm
|
||||
trainer = Trainer(track_grad_norm=2)
|
||||
|
||||
limit_train_batches
|
||||
^^^^^^^^^^^^^^^^^^^
|
||||
tpu_cores
|
||||
^^^^^^^^^
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<video width="50%" max-width="400px" controls
|
||||
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/limit_train_batches.jpg"
|
||||
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/limit_batches.mp4"></video>
|
||||
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/tpu_cores.jpg"
|
||||
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/tpu_cores.mp4"></video>
|
||||
|
||||
|
|
||||
|
||||
How much of training dataset to check.
|
||||
Useful when debugging or testing something that happens at the end of an epoch.
|
||||
- How many TPU cores to train on (1 or 8).
|
||||
- Which TPU core to train on [1-8]
|
||||
|
||||
.. testcode::
|
||||
A single TPU v2 or v3 has 8 cores. A TPU pod has
|
||||
up to 2048 cores. A slice of a POD means you get as many cores
|
||||
as you request.
|
||||
|
||||
# default used by the Trainer
|
||||
trainer = Trainer(limit_train_batches=1.0)
|
||||
Your effective batch size is batch_size * total tpu cores.
|
||||
|
||||
.. note::
|
||||
No need to add a :class:`~torch.utils.data.distributed.DistributedSampler`,
|
||||
Lightning automatically does it for you.
|
||||
|
||||
This parameter can be either 1 or 8.
|
||||
|
||||
Example::
|
||||
|
||||
# default used by the Trainer
|
||||
trainer = Trainer(limit_train_batches=1.0)
|
||||
# your_trainer_file.py
|
||||
|
||||
# run through only 25% of the training set each epoch
|
||||
trainer = Trainer(limit_train_batches=0.25)
|
||||
# default used by the Trainer (ie: train on CPU)
|
||||
trainer = Trainer(tpu_cores=None)
|
||||
|
||||
# run through only 10 batches of the training set each epoch
|
||||
trainer = Trainer(limit_train_batches=10)
|
||||
# int: train on a single core
|
||||
trainer = Trainer(tpu_cores=1)
|
||||
|
||||
# list: train on a single selected core
|
||||
trainer = Trainer(tpu_cores=[2])
|
||||
|
||||
# int: train on all cores few cores
|
||||
trainer = Trainer(tpu_cores=8)
|
||||
|
||||
# for 8+ cores must submit via xla script with
|
||||
# a max of 8 cores specified. The XLA script
|
||||
# will duplicate script onto each TPU in the POD
|
||||
trainer = Trainer(tpu_cores=8)
|
||||
|
||||
To train on more than 8 cores (ie: a POD),
|
||||
submit this script using the xla_dist script.
|
||||
|
||||
Example::
|
||||
|
||||
python -m torch_xla.distributed.xla_dist
|
||||
--tpu=$TPU_POD_NAME
|
||||
--conda-env=torch-xla-nightly
|
||||
--env=XLA_USE_BF16=1
|
||||
-- python your_trainer_file.py
|
||||
|
||||
truncated_bptt_steps
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
|
Loading…
Reference in New Issue