Add a performance section to TPU docs to address FAQ (#5445)
* header * update docs * punctuation * adding another note * some more notes * Update docs/source/tpu.rst Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * punctuation Co-authored-by: Lezwon Castelino <lezwon@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: chaton <thomas@grid.ai>
This commit is contained in:
parent
93de5c8a40
commit
0192f0ce40
|
@ -40,7 +40,7 @@ To access TPUs, there are three main ways.
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
Colab TPUs
|
Colab TPUs
|
||||||
-----------
|
----------
|
||||||
Colab is like a jupyter notebook with a free GPU or TPU
|
Colab is like a jupyter notebook with a free GPU or TPU
|
||||||
hosted on GCP.
|
hosted on GCP.
|
||||||
|
|
||||||
|
@ -129,8 +129,7 @@ That's it! Your model will train on all 8 TPU cores.
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
TPU core training
|
TPU core training
|
||||||
|
-----------------
|
||||||
------------------------
|
|
||||||
|
|
||||||
Lightning supports training on a single TPU core or 8 TPU cores.
|
Lightning supports training on a single TPU core or 8 TPU cores.
|
||||||
|
|
||||||
|
@ -177,7 +176,7 @@ on how to set up the instance groups and VMs needed to run TPU Pods.
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
16 bit precision
|
16 bit precision
|
||||||
-----------------
|
----------------
|
||||||
Lightning also supports training in 16-bit precision with TPUs.
|
Lightning also supports training in 16-bit precision with TPUs.
|
||||||
By default, TPU training will use 32-bit precision. To enable 16-bit,
|
By default, TPU training will use 32-bit precision. To enable 16-bit,
|
||||||
set the 16-bit flag.
|
set the 16-bit flag.
|
||||||
|
@ -194,6 +193,28 @@ Under the hood the xla library will use the `bfloat16 type <https://en.wikipedia
|
||||||
|
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
|
Performance considerations
|
||||||
|
--------------------------
|
||||||
|
|
||||||
|
The TPU was designed for specific workloads and operations to carry out large volumes of matrix multiplication,
|
||||||
|
convolution operations and other commonly used ops in applied deep learning.
|
||||||
|
The specialization makes it a strong choice for NLP tasks, sequential convolutional networks, and under low precision operation.
|
||||||
|
There are cases in which training on TPUs is slower when compared with GPUs, for possible reasons listed:
|
||||||
|
|
||||||
|
- Too small batch size.
|
||||||
|
- Explicit evaluation of tensors during training, e.g. ``tensor.item()``
|
||||||
|
- Tensor shapes (e.g. model inputs) change often during training.
|
||||||
|
- Limited resources when using TPU's with PyTorch `Link <https://github.com/pytorch/xla/issues/2054#issuecomment-627367729>`_
|
||||||
|
- XLA Graph compilation during the initial steps `Reference <https://github.com/pytorch/xla/issues/2383#issuecomment-666519998>`_
|
||||||
|
- Some tensor ops are not fully supported on TPU, or not supported at all. These operations will be performed on CPU (context switch).
|
||||||
|
- PyTorch integration is still experimental. Some performance bottlenecks may simply be the result of unfinished implementation.
|
||||||
|
|
||||||
|
The official PyTorch XLA `performance guide <https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#known-performance-caveats>`_
|
||||||
|
has more detailed information on how PyTorch code can be optimized for TPU. In particular, the
|
||||||
|
`metrics report <https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#get-a-metrics-report>`_ allows
|
||||||
|
one to identify operations that lead to context switching.
|
||||||
|
|
||||||
|
|
||||||
About XLA
|
About XLA
|
||||||
----------
|
----------
|
||||||
XLA is the library that interfaces PyTorch with the TPUs.
|
XLA is the library that interfaces PyTorch with the TPUs.
|
||||||
|
|
Loading…
Reference in New Issue