2-bit VPTQ: 6.5x Smaller LLMs while Preserving 95% Accuracy

Very correct 2-bit quantization for operating 70B LLMs on a 24 GB GPU

Latest developments in low-bit quantization for LLMs, like AQLM and AutoRound, at the moment are displaying acceptable ranges of degradation in downstream duties, particularly for big fashions. That stated, 2-bit quantization nonetheless introduces noticeable accuracy loss typically.

One promising algorithm for low-bit quantization is VPTQ (MIT license), proposed by Microsoft. It was launched in October 2024 and has since proven glorious efficiency and effectivity in quantizing massive fashions.

On this article, we are going to:

Evaluation the VPTQ quantization algorithm.
Reveal how one can use VPTQ fashions, a lot of that are already obtainable. As an illustration, we will simply discover low-bit variants of Llama 3.3 70B, Llama 3.1 405B, and Qwen2.5 72B.
Consider these fashions and focus on the outcomes to know when VPTQ fashions could be a good selection for LLMs in manufacturing.

Remarkably, 2-bit quantization with VPTQ virtually achieves efficiency corresponding to the unique 16-bit mannequin on duties akin to MMLU. Furthermore, it allows operating Llama 3.1 405B on a single GPU, whereas utilizing much less reminiscence than a 70B mannequin!

Source link

2-bit VPTQ: 6.5x Smaller LLMs while Preserving 95% Accuracy

How to Design My First AI Agent

Decision Trees Natively Handle Categorical Data

Landing your First Machine Learning Job: Startup vs Big Tech vs Academia

Pairwise Cross-Variance Classification | Towards Data Science

Building a Modern Dashboard with Python and Gradio

The Journey from Jupyter to Programmer: A Quick-Start Guide

EU-Funded Startups Are Powering Europe’s Tech Future

Elon Musk’s Feud With President Trump Wipes $152 Billion Off Tesla’s Market Cap

Galaxy Lockscreens Can Use AI to Show You in Outfits You Might Want to Buy

Getting Past Procastination – IEEE Spectrum

Featured Picks

Demystifying Policy Optimization in RL: An Introduction to PPO and GRPO

Bike makers invited to ditch the derailleur for combined motor/gearbox

China Is Investigating Google Over Trump’s Tariffs

2-bit VPTQ: 6.5x Smaller LLMs while Preserving 95% Accuracy

Very correct 2-bit quantization for operating 70B LLMs on a 24 GB GPU

Related Posts