Latest developments in low-bit quantization for LLMs, like AQLM and AutoRound, at the moment are displaying acceptable ranges of degradation in downstream duties, particularly for big fashions. That stated, 2-bit quantization nonetheless introduces noticeable accuracy loss typically.
One promising algorithm for low-bit quantization is VPTQ (MIT license), proposed by Microsoft. It was launched in October 2024 and has since proven glorious efficiency and effectivity in quantizing massive fashions.
On this article, we are going to:
- Evaluation the VPTQ quantization algorithm.
- Reveal how one can use VPTQ fashions, a lot of that are already obtainable. As an illustration, we will simply discover low-bit variants of Llama 3.3 70B, Llama 3.1 405B, and Qwen2.5 72B.
- Consider these fashions and focus on the outcomes to know when VPTQ fashions could be a good selection for LLMs in manufacturing.
Remarkably, 2-bit quantization with VPTQ virtually achieves efficiency corresponding to the unique 16-bit mannequin on duties akin to MMLU. Furthermore, it allows operating Llama 3.1 405B on a single GPU, whereas utilizing much less reminiscence than a 70B mannequin!