From d1f8b0f7669aa808df5fcc2934455be3c83b4bae Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Carlos=20Mochol=C3=AD?= Date: Sat, 30 Sep 2023 16:19:11 +0200 Subject: [PATCH] Bitsandbytes docs improvements (#18681) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Adrian Wälchli --- docs/source-fabric/fundamentals/precision.rst | 16 ++++++++-------- .../common/precision_intermediate.rst | 15 ++++++++------- 2 files changed, 16 insertions(+), 15 deletions(-) diff --git a/docs/source-fabric/fundamentals/precision.rst b/docs/source-fabric/fundamentals/precision.rst index 2ec2ff5d73..5c57eee835 100644 --- a/docs/source-fabric/fundamentals/precision.rst +++ b/docs/source-fabric/fundamentals/precision.rst @@ -218,20 +218,20 @@ Quantization via Bitsandbytes Both 4-bit (`paper reference `__) and 8-bit (`paper reference `__) quantization is supported. Specifically, we support the following modes: -* nf4: Uses the normalized float 4-bit data type. This is recommended over "fp4" based on the paper's experimental results and theoretical analysis. -* nf4-dq: "dq" stands for "Double Quantization" which reduces the average memory footprint by quantizing the quantization constants. In average, this amounts to about 0.37 bits per parameter (approximately 3 GB for a 65B model). -* fp4: Uses regular float 4-bit data type. -* fp4-dq: "dq" stands for "Double Quantization" which reduces the average memory footprint by quantizing the quantization constants. In average, this amounts to about 0.37 bits per parameter (approximately 3 GB for a 65B model). -* int8: Uses unsigned int8 data type. -* int8-training: Meant for int8 activations with fp16 precision weights. + +* **nf4**: Uses the normalized float 4-bit data type. This is recommended over "fp4" based on the paper's experimental results and theoretical analysis. +* **nf4-dq**: "dq" stands for "Double Quantization" which reduces the average memory footprint by quantizing the quantization constants. In average, this amounts to about 0.37 bits per parameter (approximately 3 GB for a 65B model). +* **fp4**: Uses regular float 4-bit data type. +* **fp4-dq**: "dq" stands for "Double Quantization" which reduces the average memory footprint by quantizing the quantization constants. In average, this amounts to about 0.37 bits per parameter (approximately 3 GB for a 65B model). +* **int8**: Uses unsigned int8 data type. +* **int8-training**: Meant for int8 activations with fp16 precision weights. While these techniques store weights in 4 or 8 bit, the computation still happens in 16 or 32-bit (float16, bfloat16, float32). This is configurable via the dtype argument in the plugin. Quantizing the model will dramatically reduce the weight's memory requirements but may have a negative impact on the model's performance or runtime. -Fabric automatically replaces the :class:`torch.nn.Linear` layers in your model with their BNB alternatives. - +The :class:`~lightning.fabric.plugins.precision.bitsandbytes.BitsandbytesPrecision` a .. code-block:: python from lightning.fabric.plugins import BitsandbytesPrecision diff --git a/docs/source-pytorch/common/precision_intermediate.rst b/docs/source-pytorch/common/precision_intermediate.rst index 3c190f1b04..bfa957b498 100644 --- a/docs/source-pytorch/common/precision_intermediate.rst +++ b/docs/source-pytorch/common/precision_intermediate.rst @@ -169,19 +169,20 @@ Quantization via Bitsandbytes Both 4-bit (`paper reference `__) and 8-bit (`paper reference `__) quantization is supported. Specifically, we support the following modes: -* nf4: Uses the normalized float 4-bit data type. This is recommended over "fp4" based on the paper's experimental results and theoretical analysis. -* nf4-dq: "dq" stands for "Double Quantization" which reduces the average memory footprint by quantizing the quantization constants. In average, this amounts to about 0.37 bits per parameter (approximately 3 GB for a 65B model). -* fp4: Uses regular float 4-bit data type. -* fp4-dq: "dq" stands for "Double Quantization" which reduces the average memory footprint by quantizing the quantization constants. In average, this amounts to about 0.37 bits per parameter (approximately 3 GB for a 65B model). -* int8: Uses unsigned int8 data type. -* int8-training: Meant for int8 activations with fp16 precision weights. + +* **nf4**: Uses the normalized float 4-bit data type. This is recommended over "fp4" based on the paper's experimental results and theoretical analysis. +* **nf4-dq**: "dq" stands for "Double Quantization" which reduces the average memory footprint by quantizing the quantization constants. In average, this amounts to about 0.37 bits per parameter (approximately 3 GB for a 65B model). +* **fp4**: Uses regular float 4-bit data type. +* **fp4-dq**: "dq" stands for "Double Quantization" which reduces the average memory footprint by quantizing the quantization constants. In average, this amounts to about 0.37 bits per parameter (approximately 3 GB for a 65B model). +* **int8**: Uses unsigned int8 data type. +* **int8-training**: Meant for int8 activations with fp16 precision weights. While these techniques store weights in 4 or 8 bit, the computation still happens in 16 or 32-bit (float16, bfloat16, float32). This is configurable via the dtype argument in the plugin. Quantizing the model will dramatically reduce the weight's memory requirements but may have a negative impact on the model's performance or runtime. -The Trainer automatically replaces the :class:`torch.nn.Linear` layers in your model with their BNB alternatives. +The :class:`~lightning.pytorch.plugins.precision.bitsandbytes.BitsandbytesPrecisionPlugin` automatically replaces the :class:`torch.nn.Linear` layers in your model with their BNB alternatives. .. code-block:: python