omitted mention of QuantizationAwareTraining callback (#17646)
Co-authored-by: bas <bas.krahmer@talentflyxpert.com>
This commit is contained in:
parent
deb7046496
commit
a37f5a546c
|
@ -20,7 +20,7 @@ Model Quantization
|
||||||
|
|
||||||
Model quantization is an efficient model optimization tool that can accelerate the model inference speed and decrease the memory load while still maintaining the model accuracy.
|
Model quantization is an efficient model optimization tool that can accelerate the model inference speed and decrease the memory load while still maintaining the model accuracy.
|
||||||
|
|
||||||
Different from the inherent model quantization callback "QuantizationAwareTraining" in PyTorch Lightning, Intel® Neural Compressor provides a convenient model quantization API to quantize the already-trained Lightning module with Post-training Quantization and Quantization Aware Training. This extension API exhibits the merits of an ease-of-use coding environment and multi-functional quantization options. The user can easily quantize their fine-tuned model by adding a few clauses to their original code. We only introduce post-training quantization in this document.
|
Intel® Neural Compressor provides a convenient model quantization API to quantize the already-trained Lightning module with Post-training Quantization and Quantization Aware Training. This extension API exhibits the merits of an ease-of-use coding environment and multi-functional quantization options. The user can easily quantize their fine-tuned model by adding a few clauses to their original code. We only introduce post-training quantization in this document.
|
||||||
|
|
||||||
There are two post-training quantization types in Intel® Neural Compressor, post-training static quantization and post-training dynamic quantization. Post-training dynamic quantization is a recommended starting point because it provides reduced memory usage and faster computation without additional calibration datasets. This type of quantization statically quantizes only the weights from floating point to integer at conversion time. This optimization provides latencies close to post-training static quantization. But the outputs of ops are still stored with the floating point, so the increased speed of dynamic-quantized ops is less than a static-quantized computation.
|
There are two post-training quantization types in Intel® Neural Compressor, post-training static quantization and post-training dynamic quantization. Post-training dynamic quantization is a recommended starting point because it provides reduced memory usage and faster computation without additional calibration datasets. This type of quantization statically quantizes only the weights from floating point to integer at conversion time. This optimization provides latencies close to post-training static quantization. But the outputs of ops are still stored with the floating point, so the increased speed of dynamic-quantized ops is less than a static-quantized computation.
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue