Add a performance section to TPU docs to address FAQ (#5445)

* header * update docs * punctuation * adding another note * some more notes * Update docs/source/tpu.rst Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * punctuation Co-authored-by: Lezwon Castelino <lezwon@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: chaton <thomas@grid.ai>
2021-01-11 14:12:38 +01:00 · 2021-01-11 14:12:38 +01:00 · 0192f0ce40
parent 93de5c8a40
commit 0192f0ce40
1 changed files with 25 additions and 4 deletions
--- a/docs/source/tpu.rst
+++ b/docs/source/tpu.rst
@ -40,7 +40,7 @@ To access TPUs, there are three main ways.
 ----------------
 Colab TPUs
-----------
+----------
 Colab is like a jupyter notebook with a free GPU or TPU
 hosted on GCP.
@ -129,8 +129,7 @@ That's it! Your model will train on all 8 TPU cores.
 ----------------
 TPU core training
-
+-----------------
 ------------------------
 Lightning supports training on a single TPU core or 8 TPU cores.
@ -177,7 +176,7 @@ on how to set up the instance groups and VMs needed to run TPU Pods.
 ----------------
 16 bit precision
-----------------
+----------------
 Lightning also supports training in 16-bit precision with TPUs.
 By default, TPU training will use 32-bit precision. To enable 16-bit,
 set the 16-bit flag.
@ -194,6 +193,28 @@ Under the hood the xla library will use the `bfloat16 type <https://en.wikipedia
 ----------------
 Performance considerations
 --------------------------
 The TPU was designed for specific workloads and operations to carry out large volumes of matrix multiplication,
 convolution operations and other commonly used ops in applied deep learning.
 The specialization makes it a strong choice for NLP tasks, sequential convolutional networks, and under low precision operation.
 There are cases in which training on TPUs is slower when compared with GPUs, for possible reasons listed:
 - Too small batch size.
 - Explicit evaluation of tensors during training, e.g. ``tensor.item()``
 - Tensor shapes (e.g. model inputs) change often during training.
 - Limited resources when using TPU's with PyTorch `Link <https://github.com/pytorch/xla/issues/2054#issuecomment-627367729>`_
 - XLA Graph compilation during the initial steps `Reference <https://github.com/pytorch/xla/issues/2383#issuecomment-666519998>`_
 - Some tensor ops are not fully supported on TPU, or not supported at all. These operations will be performed on CPU (context switch).
 - PyTorch integration is still experimental. Some performance bottlenecks may simply be the result of unfinished implementation.
 The official PyTorch XLA `performance guide <https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#known-performance-caveats>`_
 has more detailed information on how PyTorch code can be optimized for TPU. In particular, the
 `metrics report <https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#get-a-metrics-report>`_ allows
 one to identify operations that lead to context switching.
 About XLA
 ----------
 XLA is the library that interfaces PyTorch with the TPUs.