Add a performance section to TPU docs to address FAQ (#5445)

* header

* update docs

* punctuation

* adding another note

* some more notes

* Update docs/source/tpu.rst

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* punctuation

Co-authored-by: Lezwon Castelino <lezwon@gmail.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
This commit is contained in:
Adrian Wälchli 2021-01-11 14:12:38 +01:00 committed by GitHub
parent 93de5c8a40
commit 0192f0ce40
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 25 additions and 4 deletions

View File

@ -40,7 +40,7 @@ To access TPUs, there are three main ways.
---------------- ----------------
Colab TPUs Colab TPUs
----------- ----------
Colab is like a jupyter notebook with a free GPU or TPU Colab is like a jupyter notebook with a free GPU or TPU
hosted on GCP. hosted on GCP.
@ -129,8 +129,7 @@ That's it! Your model will train on all 8 TPU cores.
---------------- ----------------
TPU core training TPU core training
-----------------
------------------------
Lightning supports training on a single TPU core or 8 TPU cores. Lightning supports training on a single TPU core or 8 TPU cores.
@ -177,7 +176,7 @@ on how to set up the instance groups and VMs needed to run TPU Pods.
---------------- ----------------
16 bit precision 16 bit precision
----------------- ----------------
Lightning also supports training in 16-bit precision with TPUs. Lightning also supports training in 16-bit precision with TPUs.
By default, TPU training will use 32-bit precision. To enable 16-bit, By default, TPU training will use 32-bit precision. To enable 16-bit,
set the 16-bit flag. set the 16-bit flag.
@ -194,6 +193,28 @@ Under the hood the xla library will use the `bfloat16 type <https://en.wikipedia
---------------- ----------------
Performance considerations
--------------------------
The TPU was designed for specific workloads and operations to carry out large volumes of matrix multiplication,
convolution operations and other commonly used ops in applied deep learning.
The specialization makes it a strong choice for NLP tasks, sequential convolutional networks, and under low precision operation.
There are cases in which training on TPUs is slower when compared with GPUs, for possible reasons listed:
- Too small batch size.
- Explicit evaluation of tensors during training, e.g. ``tensor.item()``
- Tensor shapes (e.g. model inputs) change often during training.
- Limited resources when using TPU's with PyTorch `Link <https://github.com/pytorch/xla/issues/2054#issuecomment-627367729>`_
- XLA Graph compilation during the initial steps `Reference <https://github.com/pytorch/xla/issues/2383#issuecomment-666519998>`_
- Some tensor ops are not fully supported on TPU, or not supported at all. These operations will be performed on CPU (context switch).
- PyTorch integration is still experimental. Some performance bottlenecks may simply be the result of unfinished implementation.
The official PyTorch XLA `performance guide <https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#known-performance-caveats>`_
has more detailed information on how PyTorch code can be optimized for TPU. In particular, the
`metrics report <https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#get-a-metrics-report>`_ allows
one to identify operations that lead to context switching.
About XLA About XLA
---------- ----------
XLA is the library that interfaces PyTorch with the TPUs. XLA is the library that interfaces PyTorch with the TPUs.