Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Lamborghini Design 90: The superbike nobody wanted
    • Canyon Spectral:ON CF 8 Electric Mountain Bike: Beginner-Friendly, Under $5K
    • US-sanctioned currency exchange says $15 million heist done by “unfriendly states”
    • This New Air Purifier Filter Can Remove Cannabis Smoke Odor, Just in Time for 4/20
    • Portable water filter provides safe drinking water from any source
    • MAGA Is Increasingly Convinced the Trump Assassination Attempt Was Staged
    • NCAA seeks faster trial over DraftKings disputed March Madness branding case
    • AI Trusted Less Than Social Media and Airlines, With Grok Placing Last, Survey Says
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Saturday, April 18
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»AI Technology News»Did Google’s TurboQuant Actually Solve AI Memory Crunch?
    AI Technology News

    Did Google’s TurboQuant Actually Solve AI Memory Crunch?

    Editor Times FeaturedBy Editor Times FeaturedApril 2, 2026Updated:April 2, 2026No Comments11 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link





    On March 25, 2026, Google Analysis revealed a weblog submit a couple of compression algorithm known as TurboQuant.

    Inside 48 hours, SK Hynix had misplaced 7.3% of its market worth. Micron dropped 3%. Western Digital fell 4.7%. SanDisk gave up 5.7%. Kioxia, the Japanese flash reminiscence firm, dropped almost 6%. The selloff unfold throughout two continents, wiping out tens of billions in market cap.

    Cloudflare’s CEO Matthew Prince known as it “Google’s DeepSeek second.” Half the web in contrast it to Pied Piper, the fictional startup from HBO’s Silicon Valley. The memes moved quicker than the precise analysis.

    So what truly occurred? And does this algorithm change something in regards to the reminiscence scenario the AI trade has been panicking about for the previous 18 months?

    Let’s decode.


    Why Trendy AI Is So Hungry for Reminiscence

    When an LLM generates textual content, it does not recompute all the pieces from the start with each new phrase. As a substitute, it shops all its prior calculations in a fast-access buffer known as the key-value cache, or KV cache. Each token the mannequin has seen in a dialog will get saved there, so when the mannequin processes the subsequent token, it could look again at what got here earlier than with out redoing all the maths.

    The issue is the cache grows repeatedly. A mannequin working by means of a 100,000-token doc is holding a large quantity of lively knowledge in GPU reminiscence simply to take care of context. And this obtained considerably worse when reasoning fashions turned mainstream. Reasoning means lengthy context, lengthy context means a big KV cache, giant KV cache means you want a number of reminiscence. By 2024, anybody taking note of the trajectory of AI fashions may see the place this was heading and the market principally did not catch up till costs began reflecting it.

    How the KV Cache Fills a GPU: Brief Dialog vs 100,000 Token Doc

    And the trade has been combating this downside for years, with real ingenuity, and TurboQuant is the most recent step in that arc.


    What TurboQuant Is and How It Works

    TurboQuant compresses that KV cache down to three bits per worth, from the usual 16. The claimed discount is 6x in reminiscence footprint, with an 8x speedup in consideration computation on Nvidia H100 GPUs, and no measurable accuracy loss in benchmarks.

    The maths works in two levels. 

    The primary stage, PolarQuant, converts knowledge vectors from Cartesian coordinates into polar coordinates. In Cartesian kind, a degree is described by how far it sits alongside the X axis and Y axis: a grid of (x, y). In polar kind, the identical level is described by its distance from the origin (r) and the angle it makes from a reference path (θ). The conversion is: r = √(x² + y²) and θ = arctan(y/x). Going again: x = r·cos(θ) and y = r·sin(θ). In increased dimensions, the identical precept extends.

    Why this issues for compression is as a result of in polar area, the angular distribution of AI consideration knowledge clusters in predictable, concentrated patterns. Conventional quantization strategies must retailer further normalization constants alongside compressed knowledge so the system can decompress precisely later. These constants add one or two bits per worth proper again in, partially undoing the financial savings. PolarQuant eliminates that overhead as a result of the construction of the info in polar area makes these constants pointless.

    How Cartesian Knowledge Clusters in Polar Area to Allow KV Cache Compression

    The second stage handles the residual error left over from stage one. Every leftover error quantity will get decreased to a single signal bit, optimistic or unfavourable. That signal bit acts as a statistical zero-bias corrector, which means the compressed cache stays equal to the full-precision authentic when the mannequin computes consideration scores. The mannequin does not discover the distinction.

    Google examined TurboQuant on 5 customary benchmarks for long-context fashions, together with LongBench and Needle in a Haystack, utilizing Gemma, Mistral, and Llama. At 3 bits, it matched or beat KIVI, the usual baseline for KV cache quantization. On needle-in-a-haystack duties the place the mannequin has to find a particular truth buried in an extended doc, it hit good scores at 6x compression. 

    Curious to study extra?

    See how our brokers can automate doc workflows at scale.


    Book a demo


    The Crunch That Was Years within the Making

    The explanation a compression paper may transfer the reminiscence chip market by 6% in two days is that the reminiscence scenario going into 2026 was already excessive. To grasp it, you must return to 2023.

    In 2023, reminiscence producers had been shedding cash. DRAM costs had collapsed after the pandemic oversupply, and Samsung, SK Hynix, and Micron all pulled again on capital expenditure. They weren’t constructing new fabs as a result of there was no margin to justify it. Nevertheless it coincided exactly with the start of the reasoning mannequin period, which was about to create a requirement curve nobody had seen earlier than on this trade.

    Let’s perceive why AI is so laborious on reminiscence. A GPU wants knowledge to maneuver at excessive speeds to maintain its processors fed. An HBM4 stack, the kind of reminiscence utilized in Nvidia’s newest chips, transfers reminiscence at roughly 2.5 terabytes per second. A comparable space of normal DDR5, the reminiscence in your laptop computer, does someplace round 64 to 128 gigabytes per second. Shopper reminiscence is constructed for a totally completely different job. 

    HBM4 vs DDR5 Reminiscence Bandwidth: Why AI GPUs Want 2.5 TB/s and Laptops Get 128 GB/s

    HBM is constructed otherwise, stacked in a number of layers, linked with hundreds of micro-connections known as through-silicon vias, and it is terribly costly to supply. Producing one gigabyte of HBM consumes 4 occasions the wafer capability of normal DRAM. To place that in GPU phrases: a single Nvidia H100 presently prices between $25,000 and $30,000 per chip, and reminiscence accounts for roughly 30% of the price of deploying AI at scale. When Meta constructed its preliminary H100 coaching cluster with 24,000 of these chips, the GPU {hardware} invoice alone crossed $800 million, earlier than a single energy cable was run or a server rack assembled. That is one cluster, hyperscalers are constructing dozens. Of the $600 billion in mixed Huge Tech capital spending this yr, roughly $180 billion goes to reminiscence alone.

    Individuals often make the “simply make extra reminiscence” argument. World silicon wafer manufacturing capability is rising, however solely at round 6 to 7% per yr. AI infrastructure spending is rising at charges many occasions that. The fabs that can finally shut the hole began building after the demand sign hit, which suggests the significant new capacities do not come on-line till 2027-2028 and the crunch can doubtlessly final till 2030.


    The Compression Arms Race That Was Already Taking place

    The trade has been chipping away on the KV cache reminiscence downside for years.GPT-2 XL, the biggest 2019 variant, used the only potential design: each consideration head stored its personal impartial set of keys and values. Value: round 300 kilobytes per token. By 2024, Llama 3 8B launched grouped-query consideration, the place a number of heads share the identical saved representations as an alternative of sustaining separate copies. Value dropped to 128 kilobytes per token, lower than half, with nearly no high quality loss on benchmarks. Then DeepSeek V3 went additional with multi-head latent consideration, compressing the key-value pairs right into a lower-dimensional kind earlier than storing them and decompressing at inference time. Value: 68.6 kilobytes per token, on a mannequin with 671 billion whole parameters, although solely 37 billion are lively at any second.

    KV Cache Per Token: GPT-2 XL to Llama 3 to DeepSeek V3 and the Shannon Restrict TurboQuant Is Approaching

    That development, 300 to 128 to 68 kilobytes per token, is the compression arc that existed earlier than TurboQuant confirmed up. Every step traded one thing, often some architectural complexity or slight recall degradation, for significant reminiscence financial savings. Every step additionally captured the simpler beneficial properties first. What remained obtained tougher.

    So by the point TurboQuant arrived, the low-hanging fruit was gone. TurboQuant issues much less as a result of it saves further reminiscence and extra as a result of it marks the place KV cache compression is approaching the information-theoretic restrict. You are near the Shannon ceiling. Each further bit squeezed out from right here prices extra engineering effort and dangers extra high quality degradation than the final.

    There’s additionally an issue no compression algorithm touches. When the KV cache grows too giant for out there GPU reminiscence, fashions typically summarize their very own context right into a shorter kind and proceed from the abstract. The compression is lossy in methods the mannequin cannot detect. A selected funds determine turns into “roughly that quantity.” A nuanced instruction turns into “one thing about tips.” The mannequin retains going, assured in data that not absolutely exists. Compression makes the cache smaller. It does not resolve the issue of deciding what’s truly value maintaining.


    So Why the Market Response Was Unsuitable

    The shares fell for a similar cause markets typically overreact to technical bulletins: most buyers learn the headline, not the paper.

    TurboQuant solely addresses inference reminiscence, particularly the KV cache throughout inference. Coaching a mannequin, the months-long, multi-billion-dollar means of instructing the mannequin within the first place, requires basically completely different reminiscence, pushed by activations, gradients, and optimizer states. TurboQuant has zero impact on any of that. The huge HBM buildout that hyperscalers are funding exists primarily to coach and retrain ever-larger fashions. That demand curve is untouched by a KV cache compression algorithm.

    Past coaching, TurboQuant is a analysis consequence with no manufacturing deployment. The paper was initially revealed in 2025 and obtained re-featured on the weblog forward of ICLR. Google itself hasn’t deployed it extensively within the yr because the math was first documented.

    The 6x headline additionally deserves scrutiny. It is benchmarked towards 16-bit full-precision. Business inference already runs at 4 or 8 bits as customary observe. So the true marginal achieve over deployed methods is smaller than the quantity suggests.

    Jevons Paradox is one other factor to speak about. When DeepSeek launched dramatically extra environment friendly inference in early 2025, the identical worry unfold: HBM demand would drastically fall but it surely did not. As a result of cheaper inference expanded the set of organizations that would economically deploy AI, which drove extra whole demand for infrastructure. When inference prices fall, extra purposes grow to be viable, extra fashions keep lively, and reminiscence firms find yourself because the long-run beneficiary.

    Jevons Paradox in AI Reminiscence: How DeepSeek and TurboQuant Each Drove Greater HBM Demand Regardless of Effectivity Positive aspects

    The market has now seen this precise film twice, however panicked each occasions. Bizarre proper?

    Curious to study extra?

    See how our brokers can automate doc workflows at scale.


    Book a demo


    So What TurboQuant Really Adjustments

    The algorithm does have actual implications. They’re simply completely different from what the market priced in.

    Probably the most speedy is inference economics. TurboQuant compresses the KV cache, which determines what number of concurrent customers a single GPU can serve and the way lengthy a context window is sensible at scale. If it will get deployed throughout manufacturing inference stacks, the throughput per GPU will increase. That issues for AI merchandise operating thousands and thousands of queries per day, the place inference price is the recurring expense that determines profitability. Something that modifications the memory-to-compute ratio per question shifts the price construction of operating AI merchandise.

    The longer-term implication is on-device AI. Proper now, operating a succesful language mannequin domestically on a cellphone or laptop computer requires both compromising on high quality or shopping for costly {hardware}. If TurboQuant’s method will get applied in native inference runtimes at scale, the {hardware} ground for operating a significant AI mannequin drops. Fashions that presently require cloud infrastructure may run domestically.  Nevertheless it performs out over years, not quarters, and it has extra to do with software program ecosystem adoption than with whether or not reminiscence chip shares are appropriately priced right this moment.

    It’s undoubtedly actual math that compresses one particular kind of reminiscence utilization throughout one section of AI operation. Nevertheless it does not construct fabs and it does not change coaching economics. Reminiscence will get in-built clear rooms in South Korea and Idaho, by individuals working instruments that price a whole lot of thousands and thousands of {dollars} every. That a part of the provision chain strikes on a totally completely different clock than an algorithm (or only a analysis paper.)

    So the crunch solely ends when the fabs are executed.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    How robots learn: A brief, contemporary history

    April 17, 2026

    Vibe Coding Best Practices: 5 Claude Code Habits

    April 16, 2026

    Why having “humans in the loop” in an AI war is an illusion

    April 16, 2026

    Making AI operational in constrained public sector environments

    April 16, 2026

    Treating enterprise AI as an operating layer

    April 16, 2026

    Building trust in the AI era with privacy-led UX

    April 15, 2026

    Comments are closed.

    Editors Picks

    Lamborghini Design 90: The superbike nobody wanted

    April 18, 2026

    Canyon Spectral:ON CF 8 Electric Mountain Bike: Beginner-Friendly, Under $5K

    April 18, 2026

    US-sanctioned currency exchange says $15 million heist done by “unfriendly states”

    April 18, 2026

    This New Air Purifier Filter Can Remove Cannabis Smoke Odor, Just in Time for 4/20

    April 18, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Northrop Grumman’s New Hypersonic IMU

    December 3, 2025

    New eyedrop offers hope for AMD treatment without injections

    September 24, 2025

    Hack Exposes Kansas City’s Secret Police Misconduct List

    November 3, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.