Bitsandbytes docs improvements (#18681)

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
This commit is contained in:
Carlos Mocholí 2023-09-30 16:19:11 +02:00 committed by GitHub
parent df959aeb4f
commit d1f8b0f766
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 16 additions and 15 deletions

View File

@ -218,20 +218,20 @@ Quantization via Bitsandbytes
Both 4-bit (`paper reference <https://arxiv.org/abs/2305.14314v1>`__) and 8-bit (`paper reference <https://arxiv.org/abs/2110.02861>`__) quantization is supported.
Specifically, we support the following modes:
* nf4: Uses the normalized float 4-bit data type. This is recommended over "fp4" based on the paper's experimental results and theoretical analysis.
* nf4-dq: "dq" stands for "Double Quantization" which reduces the average memory footprint by quantizing the quantization constants. In average, this amounts to about 0.37 bits per parameter (approximately 3 GB for a 65B model).
* fp4: Uses regular float 4-bit data type.
* fp4-dq: "dq" stands for "Double Quantization" which reduces the average memory footprint by quantizing the quantization constants. In average, this amounts to about 0.37 bits per parameter (approximately 3 GB for a 65B model).
* int8: Uses unsigned int8 data type.
* int8-training: Meant for int8 activations with fp16 precision weights.
* **nf4**: Uses the normalized float 4-bit data type. This is recommended over "fp4" based on the paper's experimental results and theoretical analysis.
* **nf4-dq**: "dq" stands for "Double Quantization" which reduces the average memory footprint by quantizing the quantization constants. In average, this amounts to about 0.37 bits per parameter (approximately 3 GB for a 65B model).
* **fp4**: Uses regular float 4-bit data type.
* **fp4-dq**: "dq" stands for "Double Quantization" which reduces the average memory footprint by quantizing the quantization constants. In average, this amounts to about 0.37 bits per parameter (approximately 3 GB for a 65B model).
* **int8**: Uses unsigned int8 data type.
* **int8-training**: Meant for int8 activations with fp16 precision weights.
While these techniques store weights in 4 or 8 bit, the computation still happens in 16 or 32-bit (float16, bfloat16, float32).
This is configurable via the dtype argument in the plugin.
Quantizing the model will dramatically reduce the weight's memory requirements but may have a negative impact on the model's performance or runtime.
Fabric automatically replaces the :class:`torch.nn.Linear` layers in your model with their BNB alternatives.
The :class:`~lightning.fabric.plugins.precision.bitsandbytes.BitsandbytesPrecision` a
.. code-block:: python
from lightning.fabric.plugins import BitsandbytesPrecision

View File

@ -169,19 +169,20 @@ Quantization via Bitsandbytes
Both 4-bit (`paper reference <https://arxiv.org/abs/2305.14314v1>`__) and 8-bit (`paper reference <https://arxiv.org/abs/2110.02861>`__) quantization is supported.
Specifically, we support the following modes:
* nf4: Uses the normalized float 4-bit data type. This is recommended over "fp4" based on the paper's experimental results and theoretical analysis.
* nf4-dq: "dq" stands for "Double Quantization" which reduces the average memory footprint by quantizing the quantization constants. In average, this amounts to about 0.37 bits per parameter (approximately 3 GB for a 65B model).
* fp4: Uses regular float 4-bit data type.
* fp4-dq: "dq" stands for "Double Quantization" which reduces the average memory footprint by quantizing the quantization constants. In average, this amounts to about 0.37 bits per parameter (approximately 3 GB for a 65B model).
* int8: Uses unsigned int8 data type.
* int8-training: Meant for int8 activations with fp16 precision weights.
* **nf4**: Uses the normalized float 4-bit data type. This is recommended over "fp4" based on the paper's experimental results and theoretical analysis.
* **nf4-dq**: "dq" stands for "Double Quantization" which reduces the average memory footprint by quantizing the quantization constants. In average, this amounts to about 0.37 bits per parameter (approximately 3 GB for a 65B model).
* **fp4**: Uses regular float 4-bit data type.
* **fp4-dq**: "dq" stands for "Double Quantization" which reduces the average memory footprint by quantizing the quantization constants. In average, this amounts to about 0.37 bits per parameter (approximately 3 GB for a 65B model).
* **int8**: Uses unsigned int8 data type.
* **int8-training**: Meant for int8 activations with fp16 precision weights.
While these techniques store weights in 4 or 8 bit, the computation still happens in 16 or 32-bit (float16, bfloat16, float32).
This is configurable via the dtype argument in the plugin.
Quantizing the model will dramatically reduce the weight's memory requirements but may have a negative impact on the model's performance or runtime.
The Trainer automatically replaces the :class:`torch.nn.Linear` layers in your model with their BNB alternatives.
The :class:`~lightning.pytorch.plugins.precision.bitsandbytes.BitsandbytesPrecisionPlugin` automatically replaces the :class:`torch.nn.Linear` layers in your model with their BNB alternatives.
.. code-block:: python