From 0192f0ce403f1d62414c15c54b91392da5b7f0b2 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Adrian=20W=C3=A4lchli?= <aedu.waelchli@gmail.com>
Date: Mon, 11 Jan 2021 14:12:38 +0100
Subject: [PATCH] Add a performance section to TPU docs to address FAQ  (#5445)

* header

* update docs

* punctuation

* adding another note

* some more notes

* Update docs/source/tpu.rst

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* punctuation

Co-authored-by: Lezwon Castelino <lezwon@gmail.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
---
 docs/source/tpu.rst | 29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/docs/source/tpu.rst b/docs/source/tpu.rst
index 5f4c48076d..549a3a1cd2 100644
--- a/docs/source/tpu.rst
+++ b/docs/source/tpu.rst
@@ -40,7 +40,7 @@ To access TPUs, there are three main ways.
 ----------------
 
 Colab TPUs
------------
+----------
 Colab is like a jupyter notebook with a free GPU or TPU
 hosted on GCP.
 
@@ -129,8 +129,7 @@ That's it! Your model will train on all 8 TPU cores.
 ----------------
 
 TPU core training
-
-------------------------
+-----------------
 
 Lightning supports training on a single TPU core or 8 TPU cores.
 
@@ -177,7 +176,7 @@ on how to set up the instance groups and VMs needed to run TPU Pods.
 ----------------
 
 16 bit precision
------------------
+----------------
 Lightning also supports training in 16-bit precision with TPUs.
 By default, TPU training will use 32-bit precision. To enable 16-bit,
 set the 16-bit flag.
@@ -194,6 +193,28 @@ Under the hood the xla library will use the `bfloat16 type <https://en.wikipedia
 
 ----------------
 
+Performance considerations
+--------------------------
+
+The TPU was designed for specific workloads and operations to carry out large volumes of matrix multiplication,
+convolution operations and other commonly used ops in applied deep learning.
+The specialization makes it a strong choice for NLP tasks, sequential convolutional networks, and under low precision operation.
+There are cases in which training on TPUs is slower when compared with GPUs, for possible reasons listed:
+
+- Too small batch size.
+- Explicit evaluation of tensors during training, e.g. ``tensor.item()``
+- Tensor shapes (e.g. model inputs) change often during training.
+- Limited resources when using TPU's with PyTorch `Link <https://github.com/pytorch/xla/issues/2054#issuecomment-627367729>`_
+- XLA Graph compilation during the initial steps `Reference <https://github.com/pytorch/xla/issues/2383#issuecomment-666519998>`_
+- Some tensor ops are not fully supported on TPU, or not supported at all. These operations will be performed on CPU (context switch).
+- PyTorch integration is still experimental. Some performance bottlenecks may simply be the result of unfinished implementation.
+
+The official PyTorch XLA `performance guide <https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#known-performance-caveats>`_
+has more detailed information on how PyTorch code can be optimized for TPU. In particular, the
+`metrics report <https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#get-a-metrics-report>`_ allows
+one to identify operations that lead to context switching.
+
+
 About XLA
 ----------
 XLA is the library that interfaces PyTorch with the TPUs.