Add a performance section to TPU docs to address FAQ (#5445)

* header * update docs * punctuation * adding another note * some more notes * Update docs/source/tpu.rst Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * punctuation Co-authored-by: Lezwon Castelino <lezwon@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: chaton <thomas@grid.ai>
2021-01-11 14:12:38 +01:00 · 2021-01-11 14:12:38 +01:00 · 0192f0ce40
parent 93de5c8a40
commit 0192f0ce40
1 changed files with 25 additions and 4 deletions
--- a/docs/source/tpu.rst
+++ b/docs/source/tpu.rst
@ -40,7 +40,7 @@ To access TPUs, there are three main ways.
 ----------------

 Colab TPUs
-----------
+----------
 Colab is like a jupyter notebook with a free GPU or TPU
 hosted on GCP.

@ -129,8 +129,7 @@ That's it! Your model will train on all 8 TPU cores.
 ----------------

 TPU core training
-
------------------------
+-----------------

 Lightning supports training on a single TPU core or 8 TPU cores.

@ -177,7 +176,7 @@ on how to set up the instance groups and VMs needed to run TPU Pods.
 ----------------

 16 bit precision
-----------------
+----------------
 Lightning also supports training in 16-bit precision with TPUs.
 By default, TPU training will use 32-bit precision. To enable 16-bit,
 set the 16-bit flag.
@ -194,6 +193,28 @@ Under the hood the xla library will use the `bfloat16 type <https://en.wikipedia

 ----------------

+Performance considerations
+--------------------------
+
+The TPU was designed for specific workloads and operations to carry out large volumes of matrix multiplication,
+convolution operations and other commonly used ops in applied deep learning.
+The specialization makes it a strong choice for NLP tasks, sequential convolutional networks, and under low precision operation.
+There are cases in which training on TPUs is slower when compared with GPUs, for possible reasons listed:
+
+- Too small batch size.
+- Explicit evaluation of tensors during training, e.g. ``tensor.item()``
+- Tensor shapes (e.g. model inputs) change often during training.
+- Limited resources when using TPU's with PyTorch `Link <https://github.com/pytorch/xla/issues/2054#issuecomment-627367729>`_
+- XLA Graph compilation during the initial steps `Reference <https://github.com/pytorch/xla/issues/2383#issuecomment-666519998>`_
+- Some tensor ops are not fully supported on TPU, or not supported at all. These operations will be performed on CPU (context switch).
+- PyTorch integration is still experimental. Some performance bottlenecks may simply be the result of unfinished implementation.
+
+The official PyTorch XLA `performance guide <https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#known-performance-caveats>`_
+has more detailed information on how PyTorch code can be optimized for TPU. In particular, the
+`metrics report <https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#get-a-metrics-report>`_ allows
+one to identify operations that lead to context switching.
+
+
 About XLA
 ----------
 XLA is the library that interfaces PyTorch with the TPUs.