Reordered sections for intuitive browsing. (e.g. limit_train_batches was at the end of the page, far from limit_test/val_batches) (#5283)

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-01-04 09:05:24 -05:00 · 2021-01-04 09:05:24 -05:00 · 0e593fb6a8
parent 17a0784c5e
commit 0e593fb6a8
1 changed files with 141 additions and 144 deletions
--- a/docs/source/trainer.rst
+++ b/docs/source/trainer.rst
@ -670,6 +670,27 @@ Under the hood the pseudocode looks like this when running *fast_dev_run* with a
    used only for debugging purposes. ``limit_train/val/test_batches`` only limits the number of batches and won't
    disable anything.

+flush_logs_every_n_steps
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. raw:: html
+
+    <video width="50%" max-width="400px" controls
+    poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/flush_logs%E2%80%A8_every_n_steps.jpg"
+    src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/flush_logs_every_n_steps.mp4"></video>
+
+|
+
+Writes logs to disk this often.
+
+.. testcode::
+
+    # default used by the Trainer
+    trainer = Trainer(flush_logs_every_n_steps=100)
+
+See Also:
+    - :ref:`logging`
+
 gpus
 ^^^^

@ -736,6 +757,35 @@ Gradient clipping value
    # default used by the Trainer
    trainer = Trainer(gradient_clip_val=0.0)

+limit_train_batches
+^^^^^^^^^^^^^^^^^^^
+
+.. raw:: html
+
+    <video width="50%" max-width="400px" controls
+    poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/limit_train_batches.jpg"
+    src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/limit_batches.mp4"></video>
+
+|
+
+How much of training dataset to check.
+Useful when debugging or testing something that happens at the end of an epoch.
+
+.. testcode::
+
+    # default used by the Trainer
+    trainer = Trainer(limit_train_batches=1.0)
+
+Example::
+
+    # default used by the Trainer
+    trainer = Trainer(limit_train_batches=1.0)
+
+    # run through only 25% of the training set each epoch
+    trainer = Trainer(limit_train_batches=0.25)
+
+    # run through only 10 batches of the training set each epoch
+    trainer = Trainer(limit_train_batches=10)

 limit_test_batches
 ^^^^^^^^^^^^^^^^^^
@ -790,6 +840,28 @@ Useful when debugging or testing something that happens at the end of an epoch.

 In the case of multiple validation dataloaders, the limit applies to each dataloader individually.

+log_every_n_steps
+^^^^^^^^^^^^^^^^^
+
+.. raw:: html
+
+    <video width="50%" max-width="400px" controls
+    poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/log_every_n_steps.jpg"
+    src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/log_every_n_steps.mp4"></video>
+
+|
+
+
+How often to add logging rows (does not write to disk)
+
+.. testcode::
+
+    # default used by the Trainer
+    trainer = Trainer(log_every_n_steps=50)
+
+See Also:
+    - :ref:`logging`
+
 log_gpu_memory
 ^^^^^^^^^^^^^^

@ -820,27 +892,6 @@ Options:

 .. note:: Might slow performance because it uses the output of ``nvidia-smi``.

-flush_logs_every_n_steps
-^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. raw:: html
-
-    <video width="50%" max-width="400px" controls
-    poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/flush_logs%E2%80%A8_every_n_steps.jpg"
-    src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/flush_logs_every_n_steps.mp4"></video>
-
-|
-
-Writes logs to disk this often.
-
-.. testcode::
-
-    # default used by the Trainer
-    trainer = Trainer(flush_logs_every_n_steps=100)
-
-See Also:
-    - :ref:`logging`
-
 logger
 ^^^^^^

@ -1019,6 +1070,32 @@ The Trainer uses 2 steps by default. Turn it off or modify it here.

 This option will reset the validation dataloader unless ``num_sanity_val_steps=0``.

+overfit_batches
+^^^^^^^^^^^^^^^
+
+.. raw:: html
+
+    <video width="50%" max-width="400px" controls
+    poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/overfit_batches.jpg"
+    src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/overfit_batches.mp4"></video>
+
+|
+
+Uses this much data of the training set. If nonzero, will use the same training set for validation and testing.
+If the training dataloaders have `shuffle=True`, Lightning will automatically disable it.
+
+Useful for quickly debugging or trying to overfit on purpose.
+
+.. testcode::
+
+    # default used by the Trainer
+    trainer = Trainer(overfit_batches=0.0)
+
+    # use only 1% of the train set (and use the train set for val and test)
+    trainer = Trainer(overfit_batches=0.01)
+
+    # overfit on 10 of the same batches
+    trainer = Trainer(overfit_batches=10)

 plugins
 ^^^^^^^
@ -1079,91 +1156,6 @@ If False will only call from NODE_RANK=0, LOCAL_RANK=0
    # use only NODE_RANK=0, LOCAL_RANK=0
    Trainer(prepare_data_per_node=False)

-tpu_cores
-^^^^^^^^^
-
-.. raw:: html
-
-    <video width="50%" max-width="400px" controls
-    poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/tpu_cores.jpg"
-    src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/tpu_cores.mp4"></video>
-
-|
-
- How many TPU cores to train on (1 or 8).
- Which TPU core to train on [1-8]
-
-A single TPU v2 or v3 has 8 cores. A TPU pod has
-up to 2048 cores. A slice of a POD means you get as many cores
-as you request.
-
-Your effective batch size is batch_size * total tpu cores.
-
-.. note::
-    No need to add a :class:`~torch.utils.data.distributed.DistributedSampler`,
-    Lightning automatically does it for you.
-
-This parameter can be either 1 or 8.
-
-Example::
-
-    # your_trainer_file.py
-
-    # default used by the Trainer (ie: train on CPU)
-    trainer = Trainer(tpu_cores=None)
-
-    # int: train on a single core
-    trainer = Trainer(tpu_cores=1)
-
-    # list: train on a single selected core
-    trainer = Trainer(tpu_cores=[2])
-
-    # int: train on all cores few cores
-    trainer = Trainer(tpu_cores=8)
-
-    # for 8+ cores must submit via xla script with
-    # a max of 8 cores specified. The XLA script
-    # will duplicate script onto each TPU in the POD
-    trainer = Trainer(tpu_cores=8)
-
-To train on more than 8 cores (ie: a POD),
-submit this script using the xla_dist script.
-
-Example::
-
-    python -m torch_xla.distributed.xla_dist
-    --tpu=$TPU_POD_NAME
-    --conda-env=torch-xla-nightly
-    --env=XLA_USE_BF16=1
-    -- python your_trainer_file.py
-
-overfit_batches
-^^^^^^^^^^^^^^^
-
-.. raw:: html
-
-    <video width="50%" max-width="400px" controls
-    poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/overfit_batches.jpg"
-    src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/overfit_batches.mp4"></video>
-
-|
-
-Uses this much data of the training set. If nonzero, will use the same training set for validation and testing.
-If the training dataloaders have `shuffle=True`, Lightning will automatically disable it.
-
-Useful for quickly debugging or trying to overfit on purpose.
-
-.. testcode::
-
-    # default used by the Trainer
-    trainer = Trainer(overfit_batches=0.0)
-
-    # use only 1% of the train set (and use the train set for val and test)
-    trainer = Trainer(overfit_batches=0.01)
-
-    # overfit on 10 of the same batches
-    trainer = Trainer(overfit_batches=10)
-
 precision
 ^^^^^^^^^

@ -1346,29 +1338,6 @@ To resume training from a specific checkpoint pass in the path here.
    # resume from a specific checkpoint
    trainer = Trainer(resume_from_checkpoint='some/path/to/my_checkpoint.ckpt')

-log_every_n_steps
-^^^^^^^^^^^^^^^^^
-
-.. raw:: html
-
-    <video width="50%" max-width="400px" controls
-    poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/log_every_n_steps.jpg"
-    src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/log_every_n_steps.mp4"></video>
-
-|
-
-
-How often to add logging rows (does not write to disk)
-
-.. testcode::
-
-    # default used by the Trainer
-    trainer = Trainer(log_every_n_steps=50)
-
-See Also:
-    - :ref:`logging`
-
-
 sync_batchnorm
 ^^^^^^^^^^^^^^

@ -1408,35 +1377,63 @@ track_grad_norm
    # track the 2-norm
    trainer = Trainer(track_grad_norm=2)

-limit_train_batches
-^^^^^^^^^^^^^^^^^^^
+tpu_cores
+^^^^^^^^^

 .. raw:: html

    <video width="50%" max-width="400px" controls
-    poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/limit_train_batches.jpg"
-    src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/limit_batches.mp4"></video>
+    poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/tpu_cores.jpg"
+    src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/tpu_cores.mp4"></video>

 |

-How much of training dataset to check.
-Useful when debugging or testing something that happens at the end of an epoch.
+- How many TPU cores to train on (1 or 8).
+- Which TPU core to train on [1-8]

-.. testcode::
+A single TPU v2 or v3 has 8 cores. A TPU pod has
+up to 2048 cores. A slice of a POD means you get as many cores
+as you request.

-    # default used by the Trainer
-    trainer = Trainer(limit_train_batches=1.0)
+Your effective batch size is batch_size * total tpu cores.
+
+.. note::
+    No need to add a :class:`~torch.utils.data.distributed.DistributedSampler`,
+    Lightning automatically does it for you.
+
+This parameter can be either 1 or 8.

 Example::

-    # default used by the Trainer
-    trainer = Trainer(limit_train_batches=1.0)
+    # your_trainer_file.py

-    # run through only 25% of the training set each epoch
-    trainer = Trainer(limit_train_batches=0.25)
+    # default used by the Trainer (ie: train on CPU)
+    trainer = Trainer(tpu_cores=None)

-    # run through only 10 batches of the training set each epoch
-    trainer = Trainer(limit_train_batches=10)
+    # int: train on a single core
+    trainer = Trainer(tpu_cores=1)
+
+    # list: train on a single selected core
+    trainer = Trainer(tpu_cores=[2])
+
+    # int: train on all cores few cores
+    trainer = Trainer(tpu_cores=8)
+
+    # for 8+ cores must submit via xla script with
+    # a max of 8 cores specified. The XLA script
+    # will duplicate script onto each TPU in the POD
+    trainer = Trainer(tpu_cores=8)
+
+To train on more than 8 cores (ie: a POD),
+submit this script using the xla_dist script.
+
+Example::
+
+    python -m torch_xla.distributed.xla_dist
+    --tpu=$TPU_POD_NAME
+    --conda-env=torch-xla-nightly
+    --env=XLA_USE_BF16=1
+    -- python your_trainer_file.py

 truncated_bptt_steps
 ^^^^^^^^^^^^^^^^^^^^