2-bit VPTQ: 6.5x Smaller LLMs while Preserving 95% Accuracy

Very correct 2-bit quantization for operating 70B LLMs on a 24 GB GPU

Latest developments in low-bit quantization for LLMs, like AQLM and AutoRound, at the moment are displaying acceptable ranges of degradation in downstream duties, particularly for big fashions. That stated, 2-bit quantization nonetheless introduces noticeable accuracy loss typically.

One promising algorithm for low-bit quantization is VPTQ (MIT license), proposed by Microsoft. It was launched in October 2024 and has since proven glorious efficiency and effectivity in quantizing massive fashions.

On this article, we are going to:

Evaluation the VPTQ quantization algorithm.
Reveal how one can use VPTQ fashions, a lot of that are already obtainable. As an illustration, we will simply discover low-bit variants of Llama 3.3 70B, Llama 3.1 405B, and Qwen2.5 72B.
Consider these fashions and focus on the outcomes to know when VPTQ fashions could be a good selection for LLMs in manufacturing.

Remarkably, 2-bit quantization with VPTQ virtually achieves efficiency corresponding to the unique 16-bit mannequin on duties akin to MMLU. Furthermore, it allows operating Llama 3.1 405B on a single GPU, whereas utilizing much less reminiscence than a 70B mannequin!

Source link

2-bit VPTQ: 6.5x Smaller LLMs while Preserving 95% Accuracy

9 AI Hentai Chatbots No Sign Up

Your DNA Is a Machine Learning Model: It’s Already Out There

Inside Google’s Agent2Agent (A2A) Protocol: Teaching AI Agents to Talk to Each Other

How to Design My First AI Agent

Decision Trees Natively Handle Categorical Data

Landing your First Machine Learning Job: Startup vs Big Tech vs Academia

Masks and distancing protect chimps from human diseases

London-based Latent Technology raises €7 million to redefine game animation with generative physics

The Best Car Vacuums (2025), Tested and Reviewed

Air Fryers Are the Best Warm Weather Kitchen Appliance, and I Have Data to Prove It

Featured Picks

How AI is Revolutionizing Video Content Creation

SoftBank in Talks to Invest Up to $25 Billion in OpenAI

How the delivery app Fantuan grew its US presence to 50+ cities since 2019 by focusing on Asian cuisines, with an app design reminiscent of Chinese platforms (Alex Harring/CNBC)

2-bit VPTQ: 6.5x Smaller LLMs while Preserving 95% Accuracy

Very correct 2-bit quantization for operating 70B LLMs on a 24 GB GPU

Related Posts