Prefill Is Compute-Bound. Decode Is Memory-Bound. Why Your GPU Shouldn’t Do Both.

a big enterprise measurement a Kubernetes cluster for real-time inference on their customer-facing LLM product. We began with 64 H100 SXM GPUs throughout 8 nodes, all working vLLM in monolithic mode. The outcomes had been not the place we’d like them to be. Throughout prefill bursts the tensor cores hit 92% utilization. Ten milliseconds later, throughout decode, the identical GPUs cratered to twenty-eight%. We had been paying for 64 H100s and getting significant work out of perhaps 20 of them for 90% of every request’s lifetime. The finance group wished to know why the GPU invoice seemed like we had been coaching a basis mannequin after we had been simply serving one.

A Llama 70B mannequin working inference on an H100 GPU hits 92% compute utilization throughout prefill. Thirty milliseconds later, throughout decode, that very same GPU drops to 30%. The {hardware} hasn’t modified. The mannequin weights are an identical. The arithmetic depth of the workload fell by 5x between one section and the following.

That mismatch sits beneath each inference value downside, and most serving architectures fake it doesn’t exist.

Each time a big language mannequin processes a request, it does two various things. First, it reads your whole immediate in parallel, filling a key-value cache with the eye state of each enter token. That’s the prefill section. Then it generates output tokens separately, studying from that cache on every step. That’s decode. Prefill is a matrix multiplication downside. Decode is a reminiscence bandwidth downside. They’ve reverse {hardware} necessities, and the usual follow is to run each on the identical pool of GPUs.

That commonplace follow is dear. The GPU that’s completely sized in your prefill burst is massively overprovisioned for decode. The GPU that’s cost-efficient for decode can’t sustain with prefill. You pay for the worst case in each instructions.

Disaggregated inference splits these two phases onto separate {hardware} swimming pools, every sized for what it truly does. The concept surfaced in a 2024 OSDI paper called DistServe, out of UC San Diego’s Hao AI Lab. Eighteen months later, Perplexity runs it in manufacturing. Meta, LinkedIn, and Mistral serve visitors by means of it. NVIDIA constructed a complete framework (Dynamo) round it. vLLM, SGLang, and TensorRT-LLM all help it natively.

Right here’s the way it works, the place the trade-offs are, and whenever you shouldn’t trouble.

The 2 phases of inference aren’t the identical workload

In the event you’ve run inference on any transformer mannequin, you’ve already encountered each phases. However most monitoring dashboards flatten them right into a single “inference” metric, which hides the issue.

Prefill processes all enter tokens concurrently. For a 4,096-token immediate, the eye computation includes giant matrix multiplications throughout the total sequence size. That is compute-bound work. The GPU’s tensor cores are the bottleneck. On an H100 SXM, prefill achieves 200-400 arithmetic operations per byte of reminiscence accessed. Utilization sits between 90% and 95%. The reminiscence bandwidth, at 3.35 TB/s, is barely taxed.

Decode generates one token at a time. Every step reads your entire KV-cache from HBM to compute a single consideration output. The tensor cores end in microseconds after which watch for the following reminiscence learn. Arithmetic depth drops to 60-80 ops/byte. GPU utilization falls to 20-40%. The tensor cores sit idle whereas the reminiscence bus saturates.

These aren’t tough estimates. The InfoQ technical evaluation from September 2025 measured prefill at 90-95% utilization and decode at 20-40%, with 3-4x higher vitality effectivity per operation throughout prefill. The SPAD paper from UT Austin simulated an H100 and located that lowering reminiscence bandwidth by 40% solely elevated prefill latency by 17%, as a result of prefill doesn’t use the bandwidth. Decreasing compute capability by 50% solely elevated decode latency by 22%, as a result of decode doesn’t use the compute.

Two workloads. Reverse bottlenecks. Identical GPU.

Determine 1. LLM inference knowledge circulate with consideration matrix element. Prefill computes a full S x S consideration matrix per head per layer (16.7M multiply-adds at S=4,096), saturating tensor cores at 200-400 ops/byte. Decode computes a single 1 x (S+t) dot product per head, studying your entire KV-cache from HBM every step. The arithmetic drops by orders of magnitude whereas reminiscence reads keep the identical.
[Source: Author (SVG created using Inkscape)]

What monolithic serving truly prices you

In a normal vLLM or TensorRT-LLM deployment, a single GPU pool handles each phases. The scheduler interleaves prefill and decode requests throughout the identical batch.

The rapid downside is interference. When a brand new prefill request enters the batch, energetic decode requests have to attend. Prefill is compute-heavy and takes longer per step. These decode requests, which must be returning tokens at a gradual cadence, expertise a latency spike. That is the inter-token latency (ITL) jitter that manufacturing programs obsess over. A person watching a streaming response sees the textual content pause mid-sentence whereas the GPU processes another person’s immediate.

The slower-burning downside is utilization waste. The GPU is provisioned to deal with prefill peaks, so it’s overprovisioned for the decode section that dominates the request lifecycle. A typical era produces 200-500 output tokens. Every decode step takes 10-30ms. A 300-token response spends roughly 3-9 seconds in decode and perhaps 200ms in prefill. The GPU runs decode for 90%+ of wall-clock time, and through that 90%, it’s utilizing 30% of its compute capability. You’re paying for an H100 and getting H100-level utilization for one-tenth of the request period.

Determine 2. Inner GPU state throughout prefill vs. decode on the identical H100 SXM. Throughout prefill (a), all 132 SMs hearth repeatedly on matrix multiplications whereas the HBM bus sits at ~30% utilization. Throughout decode (b), most SMs idle ready on reminiscence reads whereas the HBM bus saturates at 85-95%. Identical GPU, identical hourly value, reverse useful resource utilization.
[Source: Author (SVG created using Inkscape)]

The vLLM group addressed a part of this with chunked prefill, which breaks an extended prefill into smaller items and interleaves them with decode steps. This smooths out the worst ITL spikes, but it surely doesn’t remedy the utilization mismatch. The GPU remains to be doing each jobs.

Splitting the inference path in two

Disaggregated inference runs prefill and decode on separate GPU swimming pools related by a quick community. A request arrives, will get routed to a prefill employee, which processes the total immediate and generates the KV-cache. That cache is then transferred over the community to a decode employee, which handles the autoregressive token era.

Three items make it work.

1. A KV-aware router sits in entrance of each swimming pools. It assigns incoming requests to out there prefill staff and, as soon as prefill completes, routes the KV-cache to a decode employee with out there capability. The router tracks which decode staff maintain which KV-caches, enabling options like prefix caching throughout requests that share widespread immediate prefixes (system prompts, few-shot examples, software descriptions).

2. A prefill pool incorporates GPUs optimized for high-throughput matrix multiplication. These staff course of prompts, construct KV-caches, and hand them off. They don’t generate output tokens. As a result of prefill is compute-bound, these GPUs can use much less HBM bandwidth with out penalty.

3. A decode pool incorporates GPUs optimized for reminiscence bandwidth. These staff obtain KV-caches and generate tokens autoregressively. As a result of decode is memory-bound, these GPUs can use much less compute with out penalty. With giant batch sizes (a whole lot of concurrent decode requests), decode utilization rises as a result of the reminiscence reads are amortized throughout extra work.

Determine 3. The interference downside in monolithic serving and the way disaggregation eliminates it. In (a), request B’s prefill steals GPU cycles from request A’s energetic decode, creating a visual pause in token output. In (b), prefill and decode run on separate swimming pools with KV-cache switch through RDMA. Decode requests stream tokens at a gradual cadence with no prefill-induced stalls.
[Source: Author (SVG created using Inkscape)]

The 2 swimming pools scale independently. In case your workload is prompt-heavy (lengthy system prompts, agentic workflows with software descriptions), you add prefill staff. In the event you’re serving many concurrent customers producing lengthy responses, you add decode staff. This decoupling is the first financial profit: you cease paying for compute capability throughout decode, and also you cease paying for reminiscence bandwidth throughout prefill.

The tax you pay: transferring the KV-cache

Disaggregation isn’t free. The KV-cache produced throughout prefill has to maneuver from the prefill GPU to the decode GPU over the community, and these caches aren’t small.

For a 70B parameter mannequin utilizing grouped-query consideration with 80 layers, 8 KV heads per layer, 128 dimensions per head, saved in FP16: every token’s KV state is 327,680 bytes. A 4,096-token immediate produces 1.34 GB of KV-cache. That whole block has to switch earlier than the decode employee can start producing.

Right here’s the calculation you may run for any mannequin:

# KV-cache measurement calculation for any transformer mannequin
def kv_cache_bytes(n_layers, n_kv_heads, head_dim, seq_len, dtype_bytes=2):
    """Returns KV-cache measurement in bytes for a single request."""
    per_token = n_layers * n_kv_heads * head_dim * 2 * dtype_bytes  # 2 for Okay and V
    return per_token * seq_len

# Llama 70B with GQA
cache = kv_cache_bytes(n_layers=80, n_kv_heads=8, head_dim=128, seq_len=4096)
print(f"Per token: {80 * 8 * 128 * 2 * 2:,} bytes")
print(f"Full cache: {cache / 1e9:.2f} GB")  # 1.34 GB

# Llama 8B (smaller mannequin, identical train)
cache_8b = kv_cache_bytes(n_layers=32, n_kv_heads=8, head_dim=128, seq_len=4096)
print(f"Llama 8B cache: {cache_8b / 1e9:.2f} GB")  # 0.54 GB

Determine 4. KV-cache switch: naive vs. pipelined. Naive switch sends all 80 layers (1.34 GB) earlier than decode can start. Pipelined switch streams layers individually (16.7 MB every), so the decode employee begins processing layer 0 whereas later layers are nonetheless in transit. This reduces efficient TTFT from the total switch time to a single layer’s latency. Perplexity implements this with RDMA through libfabric and cuMem.
[Source: Author (SVG created using Inkscape)]

At 100 Gbps (a normal RDMA hyperlink with EFA or ConnectX NICs), 1.34 GB takes roughly 107 milliseconds. At 400 Gbps, it takes 27ms. These numbers matter as a result of they set the ground on time-to-first-token: the decode employee can’t begin till the cache arrives.

Perplexity’s engineering group constructed their disaggregated serving stack on prime of libfabric and RDMA, utilizing cuMem for cross-process GPU reminiscence sharing. Their KV messenger coordinates transfers layer-by-layer, so the decode employee can start processing early layers whereas later layers are nonetheless in transit. This pipelining reduces the efficient switch latency beneath the uncooked bandwidth calculation.

The query is whether or not this switch overhead is price it. In a monolithic system, there’s no KV switch, however there’s queuing delay. When a brand new prefill burst lands, energetic decode requests stall. Measurements from the DistServe paper present P99 latency enhancements of 2x or extra with disaggregation, as a result of eliminating the prefill-decode interference removes the tail latency spikes that queuing causes. The 27ms switch value replaces 200-500ms of queuing delay.

There’s a crossover level, although. For brief prompts with excessive prefix cache hit charges, native prefill on the decode employee will be quicker than transferring the cache over the community. The BentoML inference handbook reviews 20-30% efficiency degradation when disaggregation is utilized to workloads that don’t really need it. This isn’t an structure you apply blindly.

What the manufacturing stack appears like right now

Eighteen months after DistServe, disaggregated inference has gone from a analysis prototype to the default serving structure at a number of firms working LLMs at scale.

Determine 5. Lifecycle of a single request by means of disaggregated serving for a Llama 70B mannequin. A 4K-token immediate enters the router, will get processed on a prefill employee in 150-400ms (compute-bound), transfers 1.34 GB of KV-cache over RDMA in 27-107ms, then generates 300 tokens on a decode employee over 3-9 seconds (memory-bound). The timeline bar exhibits decode dominating wall time.
[Source: Author (SVG created using Inkscape)]

NVIDIA’s Dynamo, introduced at GTC 2025, is a datacenter-scale orchestration layer that treats prefill and decode staff as first-class entities. It helps vLLM, SGLang, and TensorRT-LLM as inference backends, and features a KV-aware router that handles cache placement and request scheduling throughout swimming pools.

On the Kubernetes-native facet, llm-d (an open-source challenge from Pink Hat and IBM Analysis) implements disaggregated serving as a set of Kubernetes customized assets. Prefill and decode swimming pools are separate deployments that autoscale independently based mostly on queue depth and GPU utilization. KV-cache routing is dealt with by means of a gateway that tracks cache places. For groups already working inference on EKS, GKE, or OpenShift, this maps cleanly onto current cluster administration workflows.

In the event you’re utilizing vLLM straight, a minimal disaggregated setup appears like this:

# Begin the prefill occasion
python -m vllm.entrypoints.openai.api_server 
    --model meta-llama/Llama-3.1-70B-Instruct 
    --tensor-parallel-size 4 
    --kv-transfer-config '{
        "kv_connector": "PyNcclConnector",
        "kv_role": "kv_producer",
        "kv_parallel_size": 2
    }'

# Begin the decode occasion (separate node or GPU set)
python -m vllm.entrypoints.openai.api_server 
    --model meta-llama/Llama-3.1-70B-Instruct 
    --tensor-parallel-size 4 
    --kv-transfer-config '{
        "kv_connector": "PyNcclConnector",
        "kv_role": "kv_consumer",
        "kv_parallel_size": 2
    }'

The kv_connector specifies the switch protocol (NCCL on this case, RDMA through MooncakeConnector for larger throughput). The kv_role tells every occasion whether or not it produces or consumes KV-caches. vLLM handles the routing and cache handoff between the 2.

SGLang examined DeepSeek-R1 with disaggregated serving on 96 H100 GPUs: 3 nodes (24 GPUs) for prefill, 9 nodes (72 GPUs) for decode. They measured 52,300 enter tokens/second and 22,300 output tokens/second per node. A follow-up on GB200 NVL72 confirmed 3.8x prefill and 4.8x decode throughput beneficial properties in comparison with the identical workload on H100.

vLLM’s implementation (llm-d 0.3, October 2025) reached 2,200 tokens/second per H200 GPU with 32-way professional parallelism. Perplexity’s manufacturing system makes use of RDMA-based KV switch with layer-pipelined transmission, supporting speculative decoding and structured output era throughout the prefill-decode boundary.

When disaggregation makes issues worse

Not each workload advantages. The BentoML inference handbook reviews 20-30% efficiency degradation when disaggregation is utilized to workloads that don’t want it, and that’s an actual quantity measured on actual {hardware}.

Quick prompts damage probably the most. In case your median immediate is underneath 512 tokens and era size stays beneath 100 tokens, the KV switch overhead eats into no matter latency financial savings you’d get from section separation. Native prefill on the decode employee takes a couple of milliseconds for a brief immediate. Transport the cache throughout RDMA takes longer than simply computing it once more. For chatbots dealing with fast Q&A, monolithic serving is less complicated and quicker.

Prefix cache reuse modifications the equation too. Multi-turn conversations and agentic workflows are inclined to share giant system prompts throughout requests. If 80%+ of the KV-cache already lives on the decode employee from a earlier flip, the incremental prefill is tiny. Sending a contemporary cache from a separate prefill employee wastes community bandwidth on knowledge that’s already native.

Scale issues. A single node with 2-4 GPUs doesn’t generate sufficient concurrent requests to maintain two separate swimming pools busy. The scheduling overhead alone can tank throughput at that measurement. Disaggregation begins paying off when you will have dozens of GPUs and sufficient visitors to maintain each swimming pools saturated.

After which there’s the community. With out RDMA or a minimum of 100 Gbps hyperlinks, KV switch turns into the brand new bottleneck. TCP works however provides latency. Perplexity constructed their whole stack on RDMA with libfabric for a cause.

The Hao AI Lab put it bluntly of their November 2025 retrospective: “disaggregated prefill doesn’t enhance throughput.” The beneficial properties are latency management and price effectivity by means of higher utilization. In the event you’re assembly latency targets with monolithic serving, including a KV switch hop between phases simply provides you extra transferring elements to debug at 2 AM.

The price arithmetic

The financial case for disaggregation rests on one commentary: you need to use cheaper {hardware} for every section once they’re separated.

Prefill staff want excessive FLOPS however don’t want huge HBM capability. An H100 SXM with 80GB HBM is well-suited. An H200 with 141GB HBM is overkill for prefill; you’re paying for reminiscence bandwidth you received’t use.

Decode staff want excessive reminiscence bandwidth and enormous HBM to carry KV-caches for a lot of concurrent requests, however don’t want peak FLOPS. An H200, with its bigger HBM3e capability and better bandwidth, is definitely higher suited to decode than an H100. The additional reminiscence permits you to batch extra decode requests collectively, amortizing the reminiscence reads and pushing utilization larger.

The SPAD paper takes this logic to its excessive: they suggest “right-sizing” GPU designs into separate prefill and decode chips. Their simulation exhibits {that a} prefill chip with 40% much less reminiscence bandwidth than an H100 solely loses 17% of prefill efficiency. A decode chip with 50% much less compute solely loses 22% of decode efficiency. The price financial savings from eradicating unused silicon in every chip are substantial.

On the cluster stage, the InfoQ evaluation reviews 15-40% complete infrastructure value discount from disaggregation. The financial savings come from not overprovisioning {hardware}, from eliminating idle tensor cores throughout decode, and from with the ability to add decode staff with out shopping for prefill capability you don’t want.

Meta, LinkedIn, Mistral, and HuggingFace are all working vLLM with disaggregated serving in manufacturing. The throughput beneficial properties reported vary from 2x to six.4x relying on the workload and {hardware}. SGLang’s DeepSeek-R1 deployment on GB200 measured 4.8x decode throughput enchancment over monolithic H100 serving.

Do you have to disaggregate? A choice framework

Earlier than you refactor your serving stack, run by means of these 5 checks. They take ten minutes and can prevent from both over-engineering a easy deployment or under-investing in one which’s bleeding cash.

1. Measure your prefill-to-decode time ratio. Run /standing or equal monitoring in your present deployment and document what proportion of wall-clock GPU time is spent in every section. If decode accounts for lower than 70% of request period, your workload is prefill-heavy and disaggregation has a smaller utilization payoff. If decode is 85%+ of wall time, you’re paying for idle tensor cores many of the day.

2. Calculate your KV-cache switch measurement utilizing the system above. If it exceeds 500 MB per request and your community is underneath 100 Gbps, the switch latency will eat into your TTFT funds. Run the numbers in your precise mannequin and median immediate size, not for a theoretical worst case.

3. Test your prefix cache hit price. In the event you’re working agentic or multi-turn workloads with shared system prompts, measure how usually the decode employee already holds a usable KV-cache from a earlier flip. Hit charges above 80% cut back the worth of a separate prefill pool as a result of most prefill work is already skipped.

4. Rely your GPUs. Beneath ~16 GPUs, the scheduling overhead of sustaining two swimming pools sometimes exceeds the utilization achieve. Above 32 GPUs with sustained visitors, the fee financial savings from right-sized {hardware} begin to compound.

5. Audit your community. Test whether or not your nodes have RDMA-capable NICs (EFA on AWS, ConnectX on naked steel) and what bandwidth they help. In the event you’re restricted to TCP, disaggregation can nonetheless work for long-context workloads, however the switch latency shall be larger than the numbers on this article assume.

If checks 1, 4, and 5 all come again favorable, disaggregation will nearly actually cut back your per-token serving value. Begin with vLLM’s built-in disaggregated prefilling mode, measure the impression on TTFT and ITL, and scale from there.

What comes subsequent

Most disaggregated deployments right now nonetheless use the identical GPU mannequin for each swimming pools. The SPAD analysis from UT Austin and NVIDIA’s personal roadmap counsel that purpose-built prefill and decode silicon is coming. Within the meantime, the sensible model is utilizing completely different cloud occasion varieties: compute-optimized for prefill, memory-optimized for decode. Moonshot AI’s Mooncake structure goes additional, utilizing CPU DRAM and SSDs as overflow tiers for KV-cache storage, retaining GPU HBM free for energetic decoding.

The Hao AI Lab referred to as disaggregation “the defining precept of the trendy inference stack” of their November 2025 retrospective. Eighteen months from analysis paper to manufacturing default at Perplexity, Meta, and NVIDIA is quick, even by ML infrastructure requirements. The groups who plan for it now, with correct capability modeling and community design, will spend much less when visitors scales than those who undertake it reactively.

Source link

Prefill Is Compute-Bound. Decode Is Memory-Bound. Why Your GPU Shouldn’t Do Both.

Is an Online Master’s Degree in AI a Good Idea?

I Built a C++ Backend So My GPU Would Stop Eating Air

I Spent May Evaluating Different Engines for OCR

Why AI Is NOT Stealing Your Job

What AI Agents Should Never Do on Their Own

Exploring Income Patterns with Python Pandas, Matplotlib, and Seaborn

I Took 200 Photos With the Motorola Razr Ultra and Here’s What I Learned

Is an Online Master’s Degree in AI a Good Idea?

How courts are coping with a flood of AI-generated lawsuits

Foregen aims to reverse circumcision with bio-engineered tissue

Featured Picks

Google takes aim at sweepstakes games, removing ‘social’ classification

Today’s NYT Wordle Hints, Answer and Help for Aug. 3 #1506

AI demand means data centres are worsening drought in Mexico

Prefill Is Compute-Bound. Decode Is Memory-Bound. Why Your GPU Shouldn’t Do Both.

The 2 phases of inference aren’t the identical workload

What monolithic serving truly prices you

Splitting the inference path in two

The tax you pay: transferring the KV-cache

What the manufacturing stack appears like right now

When disaggregation makes issues worse

The price arithmetic

Do you have to disaggregate? A choice framework

What comes subsequent

Related Posts