From 62320632d49d3739a574d901d6151816bc1699b2 Mon Sep 17 00:00:00 2001
From: Jeff Yang <ydcjeff@outlook.com>
Date: Sat, 3 Oct 2020 18:45:07 +0630
Subject: [PATCH] Some docs update (#3794)

* docs update

* docs update

* suggestions

* Update docs/source/introduction_guide.rst

Co-authored-by: William Falcon <waf2107@columbia.edu>
---
 docs/source/converting.rst         | 17 ++++++-----
 docs/source/introduction_guide.rst | 48 ++++++++++++++++--------------
 docs/source/new-project.rst        | 18 ++++++++---
 docs/source/performance.rst        | 44 ++++++++++++++-------------
 4 files changed, 72 insertions(+), 55 deletions(-)

diff --git a/docs/source/converting.rst b/docs/source/converting.rst
index a131166411..89b3a3987d 100644
--- a/docs/source/converting.rst
+++ b/docs/source/converting.rst
@@ -34,7 +34,7 @@ Move the model architecture and forward pass to your :class:`~pytorch_lightning.
 
 2. Move the optimizer(s) and schedulers
 =======================================
-Move your optimizers to :func:`pytorch_lightning.core.LightningModule.configure_optimizers` hook. Make sure to use the hook parameters (self in this case).
+Move your optimizers to the :func:`~pytorch_lightning.core.LightningModule.configure_optimizers` hook.
 
 .. testcode::
 
@@ -46,7 +46,8 @@ Move your optimizers to :func:`pytorch_lightning.core.LightningModule.configure_
 
 3. Find the train loop "meat"
 =============================
-Lightning automates most of the trining for you, the epoch and batch iterations, all you need to keep is the training step logic. This should go into :func:`pytorch_lightning.core.LightningModule.training_step` hook (make sure to use the hook parameters, self in this case):
+Lightning automates most of the training for you, the epoch and batch iterations, all you need to keep is the training step logic.
+This should go into the :func:`~pytorch_lightning.core.LightningModule.training_step` hook (make sure to use the hook parameters, ``batch`` and ``batch_idx`` in this case):
 
 .. testcode::
 
@@ -60,7 +61,8 @@ Lightning automates most of the trining for you, the epoch and batch iterations,
 
 4. Find the val loop "meat"
 ===========================
-To add an (optional) validation loop add logic to :func:`pytorch_lightning.core.LightningModule.validation_step` hook (make sure to use the hook parameters, self in this case).
+To add an (optional) validation loop add logic to the
+:func:`~pytorch_lightning.core.LightningModule.validation_step` hook (make sure to use the hook parameters, ``batch`` and ``batch_idx`` in this case).
 
 .. testcode::
 
@@ -72,11 +74,12 @@ To add an (optional) validation loop add logic to :func:`pytorch_lightning.core.
             val_loss = F.cross_entropy(y_hat, y)
             return val_loss
 
-.. note:: model.eval() and torch.no_grad() are called automatically for validation
+.. note:: ``model.eval()`` and ``torch.no_grad()`` are called automatically for validation
 
 5. Find the test loop "meat"
 ============================
-To add an (optional) test loop add logic to :func:`pytorch_lightning.core.LightningModule.test_step` hook (make sure to use the hook parameters, self in this case).
+To add an (optional) test loop add logic to the
+:func:`~pytorch_lightning.core.LightningModule.test_step` hook (make sure to use the hook parameters, ``batch`` and ``batch_idx`` in this case).
 
 .. testcode::
 
@@ -88,7 +91,7 @@ To add an (optional) test loop add logic to :func:`pytorch_lightning.core.Lightn
             loss = F.cross_entropy(y_hat, y)
             return loss
 
-.. note:: model.eval() and torch.no_grad() are called automatically for testing.
+.. note:: ``model.eval()`` and ``torch.no_grad()`` are called automatically for testing.
 
 The test loop will not be used until you call.
 
@@ -96,7 +99,7 @@ The test loop will not be used until you call.
 
     trainer.test()
 
-.. note:: .test() loads the best checkpoint automatically
+.. tip:: .test() loads the best checkpoint automatically
 
 6. Remove any .cuda() or to.device() calls
 ==========================================
diff --git a/docs/source/introduction_guide.rst b/docs/source/introduction_guide.rst
index c0954f20e3..5d46288904 100644
--- a/docs/source/introduction_guide.rst
+++ b/docs/source/introduction_guide.rst
@@ -98,8 +98,8 @@ Let's first start with the model. In this case we'll design a 3-layer neural net
         x = F.log_softmax(x, dim=1)
         return x
 
-Notice this is a :class:`~pytorch_lightning.core.LightningModule` instead of a `torch.nn.Module`. A LightningModule is
-equivalent to a pure PyTorch Module except it has added functionality. However, you can use it EXACTLY the same as you would a PyTorch Module.
+Notice this is a :class:`~pytorch_lightning.core.LightningModule` instead of a ``torch.nn.Module``. A LightningModule is
+equivalent to a pure PyTorch Module except it has added functionality. However, you can use it **EXACTLY** the same as you would a PyTorch Module.
 
 .. testcode::
 
@@ -274,8 +274,8 @@ Using DataModules allows easier sharing of full dataset definitions.
     model = LitModel(num_classes=imagenet_dm.num_classes)
     trainer.fit(model, imagenet_dm)
 
-.. note:: `prepare_data` is called only one 1 GPU in distributed training (automatically)
-.. note:: `setup` is called on every GPU (automatically)
+.. note:: ``prepare_data()`` is called on only one GPU in distributed training (automatically)
+.. note:: ``setup()`` is called on every GPU (automatically)
 
 Models defined by data
 ^^^^^^^^^^^^^^^^^^^^^^
@@ -292,10 +292,12 @@ When your models need to know about the data, it's best to process the data befo
     trainer.fit(model, dm)
 
 
-1. use `prepare_data` to download and process the dataset.
-2. use `setup` to do splits, and build your model internals
+1. use ``prepare_data()`` to download and process the dataset.
+2. use ``setup()`` to do splits, and build your model internals
 
-An alternative to using a DataModule is to defer initialization of the models modules to the `setup` method of your LightningModule as follows:
+|
+
+An alternative to using a DataModule is to defer initialization of the models modules to the ``setup`` method of your LightningModule as follows:
 
 .. testcode::
 
@@ -326,7 +328,7 @@ In PyTorch we do it as follows:
     optimizer = Adam(LitMNIST().parameters(), lr=1e-3)
 
 
-In Lightning we do the same but organize it under the configure_optimizers method.
+In Lightning we do the same but organize it under the :func:`~pytorch_lightning.core.LightningModule.configure_optimizers` method.
 
 .. testcode::
 
@@ -379,8 +381,8 @@ In the case of MNIST we do the following
             optimizer.step()
             optimizer.zero_grad()
 
-In Lightning, everything that is in the training step gets organized under the `training_step` function
-in the LightningModule
+In Lightning, everything that is in the training step gets organized under the
+:func:`~pytorch_lightning.core.LightningModule.training_step` function in the LightningModule.
 
 .. testcode::
 
@@ -546,7 +548,7 @@ Or multiple nodes
 
 Refer to the :ref:`distributed computing guide for more details <multi_gpu>`.
 
-train on TPUs
+Train on TPUs
 ^^^^^^^^^^^^^
 Did you know you can use PyTorch on TPUs? It's very hard to do, but we've
 worked with the xla team to use their awesome library to get this to work
@@ -578,11 +580,11 @@ In distributed training (multiple GPUs and multiple TPU cores) each GPU or TPU c
 of this program. This means that without taking any care you will download the dataset N times which
 will cause all sorts of issues.
 
-To solve this problem, make sure your download code is in the `prepare_data` method in the DataModule.
+To solve this problem, make sure your download code is in the ``prepare_data`` method in the DataModule.
 In this method we do all the preparation we need to do once (instead of on every gpu).
 
-`prepare_data` can be called in two ways, once per node or only on the root node
-(`Trainer(prepare_data_per_node=False)`).
+``prepare_data`` can be called in two ways, once per node or only on the root node
+(``Trainer(prepare_data_per_node=False)``).
 
 .. code-block:: python
 
@@ -619,7 +621,7 @@ In this method we do all the preparation we need to do once (instead of on every
         def test_dataloader(self):
             return DataLoader(self.test_dataset, batch_size=self.batch_size)
 
-The `prepare_data` method is also a good place to do any data processing that needs to be done only
+The ``prepare_data`` method is also a good place to do any data processing that needs to be done only
 once (ie: download or tokenize, etc...).
 
 .. note:: Lightning inserts the correct DistributedSampler for distributed training. No need to add yourself!
@@ -657,7 +659,7 @@ Validating
 For most cases, we stop training the model when the performance on a validation
 split of the data reaches a minimum.
 
-Just like the `training_step`, we can define a `validation_step` to check whatever
+Just like the ``training_step``, we can define a ``validation_step`` to check whatever
 metrics we care about, generate samples or add more to our logs.
 
 .. code-block:: python
@@ -676,7 +678,7 @@ Now we can train with a validation loop as well.
     trainer = Trainer(tpu_cores=8)
     trainer.fit(model, train_loader, val_loader)
 
-You may have noticed the words `Validation sanity check` logged. This is because Lightning runs 2 batches
+You may have noticed the words **Validation sanity check** logged. This is because Lightning runs 2 batches
 of validation before starting to train. This is a kind of unit test to make sure that if you have a bug
 in the validation loop, you won't need to potentially wait a full epoch to find out.
 
@@ -744,7 +746,7 @@ Just like the validation loop, we define a test loop
 
 
 However, to make sure the test set isn't used inadvertently, Lightning has a separate API to run tests.
-Once you train your model simply call `.test()`.
+Once you train your model simply call ``.test()``.
 
 .. code-block:: python
 
@@ -794,8 +796,8 @@ and use it for prediction.
     x = torch.randn(1, 1, 28, 28)
     out = model(x)
 
-On the surface, it looks like `forward` and `training_step` are similar. Generally, we want to make sure that
-what we want the model to do is what happens in the `forward`. whereas the `training_step` likely calls forward from
+On the surface, it looks like ``forward`` and ``training_step`` are similar. Generally, we want to make sure that
+what we want the model to do is what happens in the ``forward``. whereas the ``training_step`` likely calls forward from
 within it.
 
 .. testcode::
@@ -879,7 +881,7 @@ Or maybe we have a model that we use to do generation
     z = sample_noise()
     generated_imgs = model(z)
 
-How you split up what goes in `forward` vs `training_step` depends on how you want to use this model for
+How you split up what goes in ``forward`` vs ``training_step`` depends on how you want to use this model for
 prediction.
 
 ----------------
@@ -977,7 +979,7 @@ And pass the callbacks into the trainer
     Starting to init trainer!
     Trainer is init now
 
-.. note::
+.. tip::
     See full list of 12+ hooks in the :ref:`callbacks`.
 
 ----------------
@@ -1142,4 +1144,4 @@ the data to build your models.
 
 In Lightning this code is organized inside a :ref:`datamodules`.
 
-.. note:: DataModules are optional but encouraged, otherwise you can use standard DataModules
+.. tip:: DataModules are optional but encouraged, otherwise you can use standard DataLoaders
diff --git a/docs/source/new-project.rst b/docs/source/new-project.rst
index 6e2e8fa251..c617de9cdf 100644
--- a/docs/source/new-project.rst
+++ b/docs/source/new-project.rst
@@ -286,7 +286,7 @@ a forward method or trace only the sub-models you need.
 ********************
 Using CPUs/GPUs/TPUs
 ********************
-It's trivial to use CPUs, GPUs or TPUs in Lightning. There's NO NEED to change your code, simply change the :class:`~pytorch_lightning.trainer.Trainer` options.
+It's trivial to use CPUs, GPUs or TPUs in Lightning. There's **NO NEED** to change your code, simply change the :class:`~pytorch_lightning.trainer.Trainer` options.
 
 .. code-block:: python
 
@@ -377,6 +377,7 @@ If you prefer to do it manually, here's the equivalent
 Data flow
 *********
 Each loop (training, validation, test) has three hooks you can implement:
+
 - x_step
 - x_step_end
 - x_epoch_end
@@ -434,7 +435,7 @@ The lightning equivalent is:
         gpu_1_loss = losses[1]
         return (gpu_0_loss + gpu_1_loss) * 1/2
 
-The validation and test loops have the same structure.
+.. tip:: The validation and test loops have the same structure.
 
 -----------------
 
@@ -467,6 +468,10 @@ you can override the default behavior by manually setting the flags
     def training_step(self, batch, batch_idx):
         self.log('my_loss', loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
 
+.. note::
+    The loss value shown in the progress bar is smoothed (averaged) over the last values,
+    so it differs from the actual loss returned in train/validation step.
+
 You can also use any method of your logger directly:
 
 .. code-block:: python
@@ -481,6 +486,10 @@ Once your training starts, you can view the logs by using your favorite logger o
 
     tensorboard --logdir ./lightning_logs
 
+.. note::
+    Lightning automatically shows the loss value returned from ``training_step`` in the progress bar.
+    So, no need to explicitly log like this ``self.log('loss', loss, prog_bar=True)``.
+
 Read more about :ref:`loggers`.
 
 ----------------
@@ -668,8 +677,9 @@ Or read our :ref:`introduction_guide` to learn more!
 **********
 Community
 **********
-Out community of core maintainers and thousands of expert researchers is active on our Slack and Forum. Drop by to
-hang out, ask Lightning questions or even discuss research!
+Our community of core maintainers and thousands of expert researchers is active on our
+`Slack <https://join.slack.com/t/pytorch-lightning/shared_invite/zt-f6bl2l0l-JYMK3tbAgAmGRrlNr00f1A>`_
+and `Forum <https://forums.pytorchlightning.ai/>`_. Drop by to hang out, ask Lightning questions or even discuss research!
 
 Masterclass
 ===========
diff --git a/docs/source/performance.rst b/docs/source/performance.rst
index 277414aebd..2edcfea80f 100644
--- a/docs/source/performance.rst
+++ b/docs/source/performance.rst
@@ -8,7 +8,7 @@ Here are some best practices to increase your performance.
 
 Dataloaders
 -----------
-When building your Dataloader set `num_workers` > 0 and `pin_memory=True` (only for GPUs).
+When building your DataLoader set ``num_workers > 0`` and ``pin_memory=True`` (only for GPUs).
 
 .. code-block:: python
 
@@ -16,23 +16,23 @@ When building your Dataloader set `num_workers` > 0 and `pin_memory=True` (only
 
 num_workers
 ^^^^^^^^^^^
-The question of how many `num_workers` is tricky. Here's a summary of
+The question of how many ``num_workers`` is tricky. Here's a summary of
 some references, [`1 <https://discuss.pytorch.org/t/guidelines-for-assigning-num-workers-to-dataloader/813>`_], and our suggestions.
 
-1. num_workers=0 means ONLY the main process will load batches (that can be a bottleneck).
-2. num_workers=1 means ONLY one worker (just not the main process) will load data but it will still be slow.
-3. The num_workers depends on the batch size and your machine.
-4. A general place to start is to set `num_workers` equal to the number of CPUs on that machine.
+1. ``num_workers=0`` means ONLY the main process will load batches (that can be a bottleneck).
+2. ``num_workers=1`` means ONLY one worker (just not the main process) will load data but it will still be slow.
+3. The ``num_workers`` depends on the batch size and your machine.
+4. A general place to start is to set ``num_workers`` equal to the number of CPUs on that machine.
 
-.. warning:: Increasing num_workers will ALSO increase your CPU memory consumption.
+.. warning:: Increasing ``num_workers`` will ALSO increase your CPU memory consumption.
 
-The best thing to do is to increase the `num_workers` slowly and stop once you see no more improvement in your training speed.
+The best thing to do is to increase the ``num_workers`` slowly and stop once you see no more improvement in your training speed.
 
 Spawn
 ^^^^^
-When using `distributed_backend=ddp_spawn` (the ddp default) or TPU training, the way multiple GPUs/TPU cores are used is by calling `.spawn()` under the hood.
-The problem is that PyTorch has issues with `num_workers` > 0 when using .spawn(). For this reason we recommend you
-use `distributed_backend=ddp` so you can increase the `num_workers`, however your script has to be callable like so:
+When using ``distributed_backend=ddp_spawn`` (the ddp default) or TPU training, the way multiple GPUs/TPU cores are used is by calling ``.spawn()`` under the hood.
+The problem is that PyTorch has issues with ``num_workers > 0`` when using ``.spawn()``. For this reason we recommend you
+use ``distributed_backend=ddp`` so you can increase the ``num_workers``, however your script has to be callable like so:
 
 .. code-block:: bash
 
@@ -42,7 +42,7 @@ use `distributed_backend=ddp` so you can increase the `num_workers`, however you
 
 .item(), .numpy(), .cpu()
 -------------------------
-Don't call .item() anywhere on your code. Use `.detach()` instead to remove the connected graph calls. Lightning
+Don't call ``.item()`` anywhere in your code. Use ``.detach()`` instead to remove the connected graph calls. Lightning
 takes a great deal of care to be optimized for this.
 
 ----------
@@ -67,7 +67,7 @@ LightningModules know what device they are on! Construct tensors on the device d
 
 
 For tensors that need to be model attributes, it is best practice to register them as buffers in the modules's
-`__init__` method:
+``__init__`` method:
 
 .. code-block:: python
 
@@ -87,25 +87,27 @@ DP performs three GPU transfers for EVERY batch:
 2. Copy data to device.
 3. Copy outputs of each device back to master.
 
+|
+
 Whereas DDP only performs 1 transfer to sync gradients. Because of this, DDP is MUCH faster than DP.
 
 ----------
 
 16-bit precision
 ----------------
-Use 16-bit to decrease the memory (and thus increase your batch size). On certain GPUs (V100s, 2080tis), 16-bit calculations are also faster.
+Use 16-bit to decrease the memory consumption (and thus increase your batch size). On certain GPUs (V100s, 2080tis), 16-bit calculations are also faster.
 However, know that 16-bit and multi-processing (any DDP) can have issues. Here are some common problems.
 
 1. `CUDA error: an illegal memory access was encountered <https://github.com/pytorch/pytorch/issues/21819>`_.
     The solution is likely setting a specific CUDA, CUDNN, PyTorch version combination.
-2. `CUDA error: device-side assert triggered`. This is a general catch-all error. To see the actual error run your script like so:
+2. ``CUDA error: device-side assert triggered``. This is a general catch-all error. To see the actual error run your script like so:
 
-    .. code-block:: bash
+.. code-block:: bash
 
-        # won't see what the error is
-        python main.py
+    # won't see what the error is
+    python main.py
 
-        # will see what the error is
-        CUDA_LAUNCH_BLOCKING=1 python main.py
+    # will see what the error is
+    CUDA_LAUNCH_BLOCKING=1 python main.py
 
-We also recommend using 16-bit native found in PyTorch 1.6. Just install this version and Lightning will automatically use it.
+.. tip:: We also recommend using 16-bit native found in PyTorch 1.6. Just install this version and Lightning will automatically use it.