I Built a C++ Backend So My GPU Would Stop Eating Air

This can be a humorous-but-real tour of the — protecting VRAM-aware bin packing, pinned-memory transfers, and the way to make your LLM as much as 5.89× sooner by being mildly impolite to PyTorch.

Repo: github.com/AnubhabBanerjee/WarpGroup-backend

TL;DR: Normal LLM batching pads brief sequences with zeros in order that they match the longest one. Your GPU then dutifully performs billions of multiplications on these zeros, which is the computational equal of paying a chef to prepare dinner an empty plate. WarpGroup-Backend replaces this with a small C++ engine that crams variable-length sequences collectively like a really anxious Tetris champion. End result: 2.08× throughput on an H100, 5.89× on a GTX 1080, and nil OOM crashes. The article tells the entire story with code, jokes, and solely a reasonable quantity of yelling at NVIDIA.

(Fast confession earlier than we begin: I got here at this from a 5G/6G RAN engineering background. Because it seems, GPU bin packing is shockingly near what the MAC scheduler in your cellphone has been doing for twenty years. There’s a complete part on that under — part 7 — but it surely’s additionally why I’m penning this within the first place.)

1. A confession: most of your GPU’s “work” is pretend

For those who’ve ever batched variable-length textual content via a transformer, here’s what actually occurs, dramatized:

You: “Please summarize these 8 paperwork, GPU.”

PyTorch: “Completely. Let me simply make all of them the identical form first.”

You: “Wait, they’re 80, 90, 110, 130, 95, 1850, 2000, and 60 tokens. Don’t — “

PyTorch: “Padded every part to 2000. Have enjoyable.” 🫡

GPU: cheerfully burns ~half its compute and reminiscence bandwidth on padded zeros

Your AWS invoice: *develops a humorousness*

That’s the joke. That’s the entire trade’s soiled secret. Variable-length information + rectangular tensors = your GPU getting paid by the hour to do fake work.

WarpGroup-Backend is what occurs once you determine sufficient is sufficient and also you’d fairly write 30% C++ than hold paying for that nonsense.

2. Why does padding exist in any respect? (a one-minute crash course)

Skip this when you already know. For everybody else:

A GPU loves rectangles. Particularly, it loves matrices the place each row is similar size, as a result of that lets it run 1000’s of equivalent math operations in parallel. That is what makes a GPU a GPU as a substitute of an costly paperweight.

However textual content is just not a rectangle. Textual content is a ragged mess. One tweet is 12 tokens. One authorized contract is 4,000. If you wish to feed 8 of them without delay via a transformer, you’ve two choices:

Pad all of them to the longest one with a token. The maths runs on rectangles, your code is easy, and your GPU now spends most of its time multiplying zero by one other zero. (HuggingFace’s default.)
Concatenate them into one lengthy 1-D ribbon and inform the eye kernel “hey, please don’t let token 12 speak to token 4001, they’re from completely different paperwork.” That is variable-length consideration, and it’s what FlashAttention-2’s flash_attn_varlen_func exists to do.

Choice 2 is clearly higher, and a rising variety of manufacturing inference stacks (vLLM, TensorRT-LLM, SGLang, FlashInfer, TGI internals) already use variants of it.

However calling flash_attn_varlen_func is the simple half. The onerous half — the half everybody retains re-implementing in barely completely different shapes — is organizing the ribbon: deciding which paperwork go wherein batch, in what order, as much as what whole size, whereas preserving the GPU saturated and the host-side overhead invisible.

WarpGroup-Backend is the annoying-organizer half of that equation, written in C++ so it may be quick about it.

3. The “simply pack them” lightbulb (and why it’s tougher than it sounds)

The pitch is easy: take your queue of variable-length sequences and pack them into bins like a very intense moving-day grandma. Every bin holds as much as N tokens whole. You stuff as many small sequences as you’ll be able to match alongside every huge one, then ship the entire bin to the GPU as one flat ribbon.

This can be a basic bin-packing drawback, and the textbook resolution is First-Match Lowering (FFD):

Type sequences from longest to shortest.
For every sequence, stroll via the open bins. Drop it within the first one which has room.
If none have room, open a brand new bin.

This is similar algorithm your mind makes use of when packing a suitcase: huge rocks first, then pebbles fill the cracks. It’s not optimum, but it surely will get inside ~22% of optimum within the worst case whereas being O(n log n) as a substitute of taking till the warmth dying of the universe.

Now listed here are the three issues that make this not a 10-line Python script:

Downside A: How huge ought to the bin be?

Simple reply: “as huge as VRAM lets me.” Flawed reply: that’s not a quantity you’ll be able to search for. The precise usable VRAM is dependent upon:

Mannequin dimension (a 7B in bf16 eats ~14 GB earlier than you’ve performed something).
The activation/KV-cache reminiscence for this particular sequence size (which scales weirdly).
PyTorch’s allocator fragmentation (an artwork, not a science).
Whether or not you sneezed on the driving force.

You may’t compute this. It’s a must to measure it. Part 0 of WarpGroup is actually “the GPU’s job interview” — hold asking it to swallow bigger and bigger sequences till it OOMs, then again off 10% for security.

Downside B: GPUs are choosy eaters (the 16-token factor)

NVIDIA Tensor Cores execute their matrix multiply-accumulate (MMA) directions on mounted tile shapes — for instance m16n8k16 on Hopper, with the precise dimensions various by datatype and structure. The sensible heuristic that falls out of all that complexity is delightfully easy: GPU kernels (together with FlashAttention-2) are inclined to hit finest effectivity when sequence-related dimensions are multiples of 16 (generally 8, relying on dtype and arch). So we spherical each sequence size up: 137 → 144, 200 → 208. It’s not that the GPU actually can’t course of the final 9 tokens — it’s that letting them dangle ragged prices you on reminiscence coalescing, warp tiling, and the GEMM shapes the downstream kernels really need to see.

That is the silliest a part of the entire train and but ignoring it leaves measurable throughput on the ground.

Downside C: Python is just too sluggish to do that within the scorching loop

I really like Python. I’ve tattooed import this on my soul. However Python has the GIL, the worldwide interpreter lock, which is mainly a bouncer that solely lets one thread into the Python nightclub at a time. Whereas Python is busy tokenizing PDFs, it can not even be busy packing bins. Whereas it’s packing bins, it can’t be busy feeding the GPU. The producer-consumer pipeline collapses right into a sluggish, unhappy single-file line.

Resolution: do the packing in C++, in a background thread, and launch the GIL on the PyBind11 boundary so the Python aspect can hold tokenizing whereas C++ retains packing. Now you’ve precise parallelism. The bouncer is letting two buddies in without delay. Get pleasure from!

4. The five-phase pipeline (the actually-cool half)

Right here’s the high-level structure, with jokes:

Part 0: Empirically measure how a lot VRAM the mannequin has left. (Python)
Part 1: Tokenize textual content and yeet integers throughout the C++ border. (Python → C++)
Part 2: Catch yeets in a thread-safe queue with no GIL drama. (C++ async dispatcher)
Part 3: Type, align, and Tetris them into 1-D bins. (C++ bin packer)
Part 4: Hand bins to the GPU by way of pinned-memory async DMA. (C++ reminiscence pool)
Part 5: FlashAttention-2 enjoys its zero-padding lunch. (PyTorch)

Let’s stroll via each with the precise code. I’ll hold snippets brief; the total information are linked.

Part 0 — The GPU job interview

We have to know precisely what number of tokens slot in VRAM. So we ask the GPU, politely however firmly, till it cries.

This lives in streaming_dataloader.py:

def determine_vram_capacity(mannequin, gadget, start_tokens=0, step_size=5000, vocab_size=32000):
    """
    Part 0: this technique detects the gadget (GPU) on which it's presently hosted, masses the mannequin into the VRAM, see how a lot area is left
    and what number of tokens might be match into the remaining area of the VRAM. It then increments the variety of tokens by the step dimension and 
    repeats the method till the max capability is reached. It returns the utmost variety of tokens that may be match into the VRAM whereas 
    the mannequin is loaded.
    """
    # this half masses the mannequin into the VRAM and checks if it may be loaded. if it may well, it is okay. if it can't be performed,
    # it raises an error.
    strive:
        mannequin.to(gadget)
    besides Exception:
        sys.exit("error loading mannequin into VRAM")
    print("nStarting Part 0: mannequin loaded efficiently in VRAMnn")

    # this half checks how a lot area is left, after the mannequin is loaded, when it comes to variety of tokens
    # Probe remaining capability by operating artificial forwards till CUDA OOM; then scale down by 0.9 for fragmentation.
    mannequin.eval()  # inference-only (dropout off, batchnorm stats mounted), matching actual packing inference

    max_tokens = start_tokens  # present sequence size to strive; grows by step_size after every success
    safe_limit = 0  # final max_tokens that accomplished a ahead with out OOM ({hardware} redline earlier than the failed step)

    with torch.no_grad():  # no autograd state = roughly half the activation reminiscence vs coaching mode
        whereas True:
            strive:
                # Artificial token IDs: distribution doesn't matter for reminiscence; dimension drives activations/consideration workspace
                dummy_tokens = torch.randint(0, vocab_size, (max_tokens,), gadget=gadget)

                # Single packed sequence [0, max_tokens): worst-case one long sequence for max seqlen in the pass
                dummy_cu_seqlens = torch.tensor([0, max_tokens], dtype=torch.int32, gadget=gadget)

                # Actual ahead on course gadget: stresses the identical kernels/allocations as manufacturing (in contrast to padding-only assessments)
                # Signature should match your mannequin (e.g. FlashAttention varlen). Alter kwargs in case your ahead differs.
                _ = mannequin(dummy_tokens, cu_seqlens=dummy_cu_seqlens, max_seqlen=max_tokens)

                safe_limit = max_tokens  # this size match; deal with as the most effective recognized redline to date
                max_tokens += step_size  # strive an extended stretch subsequent (coarse search; reduces what number of forwards you run)

                del _, dummy_tokens, dummy_cu_seqlens  # drop references so the subsequent trial’s peak reminiscence is just not inflated by leftovers

            besides RuntimeError as e:
                if "out of reminiscence" in str(e).decrease():
                    # Ahead at max_tokens exceeded free VRAM: cease; safe_limit is the final profitable size
                    torch.cuda.empty_cache()  # return cached blocks to the pool (no-op on CPU-only builds in lots of variations)
                    gc.acquire()  # free any Python objects nonetheless holding gadget tensor views
                    break
                # Form/key phrase errors imply the dummy name doesn't match the mannequin API, not “out of VRAM”
                increase

    if safe_limit == 0:
        # If the very first strive at start_tokens OOMs, there is no such thing as a profitable redline; decrease start_tokens (no computerized halving right here)
        sys.exit("Error: Mannequin can not course of the beginning token rely with out OOM. Lower start_tokens.")

    # 0.9x: headroom for allocator fragmentation and small spikes on actual batches vs this idealized probe
    optimal_capacity = int(safe_limit * 0.9)
    print(f"nOptimal bin capability locked at: {optimal_capacity} tokens.n")
    
    return optimal_capacity

That’s it. That’s the entire “autotune.” Throw greater and larger pretend sequences on the mannequin till CUDA flips a desk, write down the final one which survived, multiply by 0.9 for security. No vendor whitepaper. No theoretical system. Simply bullying the GPU till it tells you its actual restrict.

On an H100 with Qwen2.5–7B-Instruct, this comes out to round 76,500 tokens per bin. Strive becoming that right into a batch-of-8 mindset.

Part 1 — Tokenize and yeet

Python reads paperwork (PDFs, JSONL, no matter), tokenizes them with the HuggingFace tokenizer, and submits every as an inventory of integers throughout the PyBind11 boundary. That is in reader_and_tokennizer.py:

for textual content in _iter_text_units(data_source, json_key):
        tokens = tokenizer.encode(textual content, add_special_tokens=True)

        if len(tokens) > effective_cap:
            if max_length is just not None and effective_cap == max_length:
                tokens = tokens[:effective_cap]
            else:
                print(
                    f"Warning: Sequence size ({len(tokens)}) exceeds VRAM capability "
                    f"({effective_cap}). Truncating sequence...",
                    file=sys.stderr,
                )
                tokens = tokens[:effective_cap]

        if cxx_backend is just not None:
            cxx_backend.submit_sequence(tokens)

Discover how boring that is. Good. Boring Python = quick C++. The Python aspect does I/O and tokenization and that’s it. No batching logic, no packing logic, no GPU choreography. Only one single duty, child.

Part 2 — The C++ catcher’s mitt

On the C++ aspect, async_dispatcher.cpp catches each submit_sequence name and drops it right into a thread-safe std::deque behind a mutex:

void AsyncDispatcher::submit_sequence(std::vector tokens) {
    {
        std::lock_guard<:mutex> lk(queue_mutex);
        if (!engine_started.load()) {
            // Quiet no-op earlier than initialize_engine() — avoids crashing notebooks
            // that by accident submit early; uncomment throw when you choose fail-fast.
            return;
        }
        pending_queue.push_back(std::transfer(tokens));
    }
    cv_data.notify_one();
}

In the meantime, a background employee thread sits in a wait() loop. When tokens present up, it wakes, grabs a batch, and runs the packer. The intelligent bit: it doesn’t get up immediately, as a result of that might imply packing one sequence at a time, which defeats the whole level of bin-packing.

As an alternative, it waits for both 16 sequences to build up or 5 ms to move:

            // Accumulation window: if there are few pending sequences and the
            // producer continues to be feeding (not input_done, not shutting down),
            // wait a short second for extra earlier than swapping. This lets the
            // FFD packer see an actual batch as a substitute of 1 sequence at a time.
            // Wakes early on: reaching kPackMinBatch, shutdown, or input_done.
            if (pending_queue.dimension() < kPackMinBatch &&
                !stop_flag.load() && !input_done_flag.load()) {
                cv_data.wait_for(lk, kPackWaitWindow, [&] );
            }

That is the distinction between “bin packing” and “bin… eh, a single-sequence bin is technically a bin.” 5 ms is shorter than a single ahead move on this {hardware}, so the latency value is invisible. The density acquire is as much as 8×.

The PyBind11 layer in bindings.cpp wraps every part with py::gil_scoped_release in order that any C++ work that blocks or sleeps doesn’t maintain the Python GIL:

m.def(
        "get_next_bin",
        []() -> std::tuple<:tensor torch::tensor=""> {
            PackedBin bin;
            {
                py::gil_scoped_release launch;
                bin = engine().get_next_ready_bin();
            }
            // Re-acquired the GIL right here -- creating torch::Tensor objects and
            // returning them to Python touches CPython refcounts.
            if (bin.flat_tokens.empty()) {
                auto opts = torch::TensorOptions().dtype(torch::kInt32).gadget(torch::kCUDA);
                return std::make_tuple(torch::empty({0}, opts), torch::empty({0}, opts));
            }
            return engine().get_memory_pool().create_zero_copy_tensors(bin);
        },
        "Block till a packed bin exists; returns (token_ids, cu_seqlens) on CUDA.");

You probably have ever debugged a Python-C++ impasse, the feedback on this file will provide you with flashbacks. It’s positively price a learn, significantly once you drank one too many coffees and having troubles falling asleep!

Part 3 — Tetris with guidelines (FFD + 16-token alignment)

That is the guts of the entire venture, and it’s fantastically brief. From bin_packer.cpp:

int BinPacker::align_to_tensor_core(int raw_length) {
    /*
     * NVIDIA Tensor Cores execute matrix math in 16x16 or 32x32 tiles.
     * If a sequence size is just not completely divisible by 16, the Tensor Core 
     * can not course of the ragged edge, inflicting the GPU to stall.
     * We calculate the rest and spherical as much as the closest a number of of 16.
     */
    int the rest = raw_length % 16;
    if (the rest == 0) {
        return raw_length;
    }
    return raw_length + (16 - the rest);
}

Three traces. That’s the “16-token Tensor Core alignment” that NVIDIA weblog posts make sound like a PhD thesis. It’s roundup(n, 16). That’s it. That’s the factor.

The packing itself:

std::vector BinPacker::pack_queue(std::deque<:vector>>& pending_queue) {
    std::vector ready_bins;
    
    if (pending_queue.empty()) {
        return ready_bins;
    }

    // Step 1: Drain the queue into a regular vector so we will kind it.
    // We use std::transfer to switch reminiscence possession immediately with out copying information.
    std::vector<:vector>> sequences;
    whereas (!pending_queue.empty()) {
        sequences.push_back(std::transfer(pending_queue.entrance()));
        pending_queue.pop_front();
    }

    // Step 2: Type Lowering (Longest sequences first)
    // Packing giant rocks first, then filling gaps with pebbles yields the best density.
    std::kind(sequences.start(), sequences.finish(), 
              [](const std::vector& a, const std::vector& b) {
                  return a.dimension() > b.dimension(); 
              });

    // Step 3: First-Match Packing
    for (auto& seq : sequences) {
        int raw_len = seq.dimension();
        int aligned_len = align_to_tensor_core(raw_len);
        
        // Edge case security: If alignment pushes it barely over the VRAM restrict, clamp it.
        if (aligned_len > max_vram_capacity) {
            aligned_len = max_vram_capacity;
            seq.resize(aligned_len, 0); 
        } else if (aligned_len > raw_len) {
            // Bodily inject invisible '0' pad tokens to succeed in the 16-boundary
            seq.insert(seq.finish(), aligned_len - raw_len, 0);
        }

        bool positioned = false;

        // Attempt to match the sequence into an present open bin (First-Match)
        for (auto& bin : ready_bins) {
            if (bin.current_token_count + aligned_len <= max_vram_capacity) {
                // It suits! File the beginning boundary in cu_seqlens
                bin.cu_seqlens.push_back(bin.current_token_count);
                
                // Append the tokens to the flat 1D array
                bin.flat_tokens.insert(bin.flat_tokens.finish(), seq.start(), seq.finish());
                bin.current_token_count += aligned_len;
                
                positioned = true;
                break;
            }
        }

        // If it did not slot in ANY present bin, we should allocate a brand new bin
        if (!positioned) {
            PackedBin new_bin;
            new_bin.current_token_count = aligned_len;
            
            // The primary sequence in a brand new bin all the time begins at index 0
            new_bin.cu_seqlens = {0}; 
            
            // Transfer the tokens into the flat array
            new_bin.flat_tokens = std::transfer(seq);
            
            ready_bins.push_back(std::transfer(new_bin));
        }
    }

Learn that twice. Internalize it. That is the algorithm. All the pieces else on this repo — the C++ threads, the pinned reminiscence, the GIL gymnastics — exists to feed this 30-line loop and to ship its output to the GPU in a single async DMA with no additional host-side copies.

cu_seqlens is the magic decoder ring. It’s a small integer array of cumulative offsets. In case your bin incorporates three paperwork of size 200, 144, and 64 (already aligned), then cu_seqlens = [0, 200, 344, 408]. FlashAttention-2 reads this array and goes, “ah, I see, three sub-sequences,” and runs a precise, padding-free consideration over them. No large dense masks tensor materialized in reminiscence. No cross-document contamination. No wasted FLOPs on padded areas.

Part 4 — The “zero-copy” magic trick (one DMA, no additional copies)

OK, so we now have a packed bin in a std::vector on the CPU. We want it on the GPU. The naive method: memcpy right into a PyTorch tensor, then .to('cuda'). This works however has two prices paid on each single bin: (1) an additional host-side copy (your vector → some intermediate tensor → GPU), and (2) on x86_64 the OS can yank your reminiscence pages round at any time, which forces CUDA to stage transfers via a bounce buffer as a substitute of letting the DMA engine contact your bytes straight.

The trick is pinned (page-locked) host reminiscence, allotted by way of cudaHostAlloc. Pinned reminiscence is RAM that the OS has been forbidden from swapping or shifting. As a result of the tackle is steady, the GPU’s DMA engine can pull the bytes throughout the PCIe bus in a single asynchronous switch, without having the CPU to stage an intermediate copy first. The switch itself nonetheless occurs — this isn’t actually “zero-copy” within the strict UVA/UM sense, the host→gadget DMA is actual — but it surely’s one copy as a substitute of two, it runs async to the CPU, and it lands straight in gadget reminiscence. Everybody within the inference world calls this “zero-copy” anyway, and the operate within the repo is known as accordingly. Pedants, please direct complaints to the remark part; we’ll tackle them after lunch.

From memory_pool.cpp:

// ---------------------------------------------------------
// 1. ALLOCATE PINNED MEMORY (Occurs as soon as throughout Part 0)
// ---------------------------------------------------------
MemoryPool::MemoryPool(size_t max_vram_tokens) : max_capacity(max_vram_tokens) {
    
    // Allocate the token buffer. cudaHostAlloc locks this reminiscence into bodily RAM.
    cudaError_t err1 = cudaHostAlloc((void**)&pinned_token_buffer, 
                                     max_capacity * sizeof(int), 
                                     cudaHostAllocDefault);
                                     
    // Allocate the sequence size boundaries buffer.
    // +1 as a result of cu_seqlens all the time has another aspect than the variety of sequences.
    cudaError_t err2 = cudaHostAlloc((void**)&pinned_seqlens_buffer, 
                                     (max_capacity + 1) * sizeof(int), 
                                     cudaHostAllocDefault);

    if (err1 != cudaSuccess || err2 != cudaSuccess) {
        throw std::runtime_error("[MemoryPool] Deadly: Did not allocate pinned reminiscence. "
                                 "Host system could also be out of RAM.");
    }
    
    std::cout << "[MemoryPool] Efficiently locked " 
              << (max_capacity * sizeof(int)) / 1024 
              << " KB of DMA-ready pinned reminiscence." << std::endl;
}

Then, after we need to ship a bin to the GPU, we copy our packed bin into the pinned buffer (one quick std::copy) and wrap that pinned buffer as a PyTorch tensor with out allocating new tensor storage:

std::tuple<:tensor torch::tensor=""> MemoryPool::create_zero_copy_tensors(const PackedBin& bin) {
    
    // Step 1: Quick C++ copy from our packing algorithm into the pinned reminiscence block.
    // std::copy is extremely optimized by the compiler on the meeting degree.
    std::copy(bin.flat_tokens.start(), bin.flat_tokens.finish(), pinned_token_buffer);
    std::copy(bin.cu_seqlens.start(), bin.cu_seqlens.finish(), pinned_seqlens_buffer);

    // Step 2: The PyTorch Metadata Shell
    // torch::from_blob does NOT allocate new reminiscence. It merely wraps our present 
    // pinned_token_buffer pointer in a PyTorch Tensor object so Python can work together with it.
    
    auto token_opts = torch::TensorOptions().dtype(torch::kInt32).gadget(torch::kCPU);
    
    torch::Tensor token_tensor = torch::from_blob(
        pinned_token_buffer,                   // The uncooked pinned pointer
        {static_cast(bin.current_token_count)}, // The precise dimension of this particular batch
        token_opts
    );

    torch::Tensor seqlens_tensor = torch::from_blob(
        pinned_seqlens_buffer, 
        {static_cast(bin.cu_seqlens.dimension())}, 
        token_opts
    );

    // Step 3: Set off the PCIe DMA switch.
    // As a result of the underlying reminiscence is pinned, the .to(cuda) name triggers an 
    // asynchronous DMA switch. The CPU instantly strikes on to packing the subsequent bin 
    // whereas the GPU {hardware} silently pulls the information over the bus.
    
    torch::Tensor gpu_tokens = token_tensor.to(torch::kCUDA, /*non_blocking=*/true);
    torch::Tensor gpu_seqlens = seqlens_tensor.to(torch::kCUDA, /*non_blocking=*/true);

    return std::make_tuple(gpu_tokens, gpu_seqlens);
}

torch::from_blob is likely one of the most fantastically evil features in PyTorch. It says, “I’m not going to allocate any new tensor storage. I’m going to wrap this uncooked pointer as a tensor view, and you’re chargeable for preserving the underlying reminiscence alive.” It’s the C++ equal of taking a sticky observe that claims “TENSOR” and slapping it on an present pile of reminiscence. PyTorch believes it. Everybody goes house completely happy. (Sure, a small tensor metadata struct nonetheless will get allotted. The storage doesn’t, which is the half that prices you per-bin. HPC pedants, please re-holster the pitchforks; we’re practically via the part.)

The non_blocking=True on .to(torch::kCUDA) then kicks off an asynchronous PCIe DMA switch and instantly returns. The CPU goes off to pack the subsequent bin whereas the GPU silently slurps the earlier bin throughout the bus. The packer and the GPU at the moment are operating in parallel. That is the half the place you begin listening to the GPU fan spin up prefer it lastly has one thing actual to do.

Part 5 — FlashAttention-2 enjoys its zero-padding lunch

The Python aspect, in main_working_file.py, is now hilariously brief:

# Part 0, Step C: Initialize the C++ Background Engine
    # Locks in {hardware} limits and spawns background employee thread
    warpgroup_backend.initialize_engine(vram_capacity)

    # Part 1: Ingest and Tokenize
    # Streams tokens into the C++ background queue
    ingest_and_tokenize(
        file_path, 
        tokenizer, 
        vram_capacity, 
        cxx_backend=warpgroup_backend
    )

    # Part 4 & 5: Inference Execution
    print("nStarting Inference Part...")
    
    strive:
        # Part 4: Orchestration & Queue Administration
        whereas not warpgroup_backend.is_queue_empty():
            # Retrieve the hardware-optimized bin tensors
            # This triggers the zero-copy DMA handoff from pinned reminiscence
            bin_tensors = warpgroup_backend.get_next_bin()
            
            # Part 5: GPU Execution (FlashAttention-2)
            with torch.no_grad():
                # The wrapper interprets the 1D bin to 2D for the HF mannequin
                output = mannequin(bin_tensors[0], cu_seqlens=bin_tensors[1])
                
            print(f"Executed FlashAttention-2 for bin dimension: {bin_tensors[0].form[0]} tokens.")

Small however essential disclaimer for the cautious reader: a inventory HuggingFace ahead() doesn’t settle for cu_seqlens straight. The mannequin here’s a skinny VarlenModelWrapper from streaming_dataloader.py that unsqueezes the 1-D packed stream into the (1, N) form HF expects and — crucially — rebuilds position_ids so each sub-sequence begins at place 0 (in any other case doc B’s first token sees place len(A) and rotary embeddings go sideways). For bitwise-correct cross-document masking on a multi-sequence bin you additionally need the underlying HF mannequin loaded with attn_implementation='flash_attention_2' and the eye layers wired to name flash_attn_varlen_func with the identical cu_seqlens — the wrapper’s personal docstring spells out precisely why and the way. FA-2 is what really performs the variable-length consideration; WarpGroup’s contribution is feeding it densely and on-time.

A number of traces of enterprise logic. That’s all that’s left. All the pieces else is going on within the C++ engine whereas Python sips espresso.

5. The receipts (i.e., the numbers)

Time to humiliate the baseline. All numbers from the repo’s README.

Fast observe on benchmarking methodology, earlier than anybody reaches for the rocks: each comparability under runs the similar mannequin checkpoint, the similar tokenizer, the similar enter corpus, and the similar dtype (bf16) on the similar GPU at default clocks. The “baseline (HuggingFace)” path is HF’s inventory padded-batch pipeline with attn_implementation="flash_attention_2" — i.e., it’s already utilizing FA-2, not a intentionally handicapped naïve loop. The optimized path makes use of the identical FA-2 kernel. The solely axis of distinction is how sequences are batched (FFD packing right into a VRAM-aware 1-D bin vs. padding to an oblong batch_size × longest_sequence tensor). Workload sort is prefill-style doc analysis, not autoregressive streaming decode — that distinction issues for the subsequent subsection. Repro scripts are in example_runs/ when you’d prefer to argue with the numbers.

Stress take a look at: H100, Qwen2.5–7B, 400 mixed-length PDFs

The dataset intentionally interleaves tiny paperwork (45–130 phrases) with large ones (1820–2000 phrases) — mainly the worst case for padded batching.

Metric	Baseline (HF)	WarpGroup	Enchancment
Padding overhead	48.41%	0.55%	47.9 pp absolute discount
Throughput	14,713 tok/s	30,672 tok/s	2.08× increased
Peak VRAM	19.88 GB	16.50 GB	17% decrease (3.38 GB saved)
Dynamic VRAM (est.)*	~5.38 GB	~2.00 GB	~62% decrease dynamic reminiscence
Wall clock	28.69 s	13.76 s	2.08× sooner

Translation: the baseline spent half its tokens padding zeros. Half. Think about ordering a pizza and 4 of the 8 slices are simply cardboard.

Manufacturing scaling: similar {hardware}, uniform 50–1900 phrase docs

Metric	Baseline (HF)	WarpGroup	Enchancment
Padding overhead	36.20%	0.67%	35.5 pp absolute discount
Throughput	18,047 tok/s	30,700 tok/s	1.70× increased
Wall clock	17.86 s	10.50 s	1.70× sooner

Even on a “good,” uniformly-distributed dataset (i.e., the one variety that benchmark weblog posts ever use), WarpGroup nonetheless wins by 70%, as a result of padding overhead exists even when your distribution is well-behaved.

Entry-level {hardware}: GTX 1080 (8 GB), SmolLM2–360M

Metric	Baseline (HF)	WarpGroup	Enchancment
Padding overhead	41.13%	0.00%	Baseline padding eradicated
Throughput	405 tok/s	2,387 tok/s	5.89× increased
Peak VRAM	2.85 GB	1.86 GB	35% decrease

This one is my favourite. Why? As a result of the smaller your {hardware}, the more serious padding hurts. A GTX 1080 going 5.89× sooner means small startups, hobbyists, college labs, and anybody operating on a single client card simply received a free {hardware} improve. The identical 1080 that was “barely sufficient” is now “really fairly good.”

Bonus spherical: not crashing

Take away the MAX_LEN cap totally, and the baseline does this:

torch.OutOfMemoryError: Tried to allocate 30.00 GiB

As a result of it tried to make a (batch_size × longest_sequence) rectangle and the longest sequence was, uh, huge.

WarpGroup: completes efficiently, peak VRAM 3.60 GB. As a result of the Part-0 autotune locks a strict hardware-aligned enter price range, and the packer refuses to confess any bin that exceeds it. Allocator fragmentation and kernel scratch workspaces can nonetheless shock you in idea; in observe, the pathological padding-driven OOMs that fixed-shape batching journeys on merely cease taking place. No 3 AM Slack messages out of your on-call cousin. Simply sequences, packed, executed, performed.

“OK, however how is that this completely different from vLLM / paged consideration / steady batching?”

Cheap query, and value answering straight as a result of the inference-infra world has quite a lot of overlapping primitives and an HPC reader will ask this within the first remark.

vLLM / steady batching is optimized for decode-time serving: many concurrent requests at completely different era steps, schedule the subsequent token throughout them, hold the GPU saturated underneath streaming load. Its headline primitive is paged consideration — a KV-cache reminiscence supervisor that pages bodily non-contiguous blocks like an OS pagetable.
TensorRT-LLM, SGLang, FlashInfer all help variants of varlen consideration. Their packing logic sometimes lives inside a serving runtime tuned for stay, latency-sensitive request streams.
WarpGroup-Backend targets the opposite half of the workload spectrum: offline / high-throughput, prefill-style jobs. Doc analysis, RAG indexing, batched embedding extraction, batched OAM-log summarization, bulk classification, eval harnesses. The unit of labor is a finite corpus of variable-length sequences, not a streaming firehose of decode requests. The main focus is host-side packing density, GIL-free async dispatch, and a decent pinned-memory handoff — not KV-cache paging.

Consider it this fashion: vLLM is a restaurant supervisor seating arriving diners throughout tables in actual time. WarpGroup is a catering operation packing the day’s field lunches into supply vans earlier than the vans depart the depot. Totally different issues, complementary primitives, steadily co-deployable in the identical constructing.

6. So… how do I really strive it?

The repo ships with a one-shot reproducer:

# 1. Clone
git clone https://github.com/AnubhabBanerjee/WarpGroup-backend.git
cd WarpGroup-backend
# 2. Python env + deps
python3.12 -m venv .venv
supply .venv/bin/activate
pip set up --upgrade pip
pip set up -r necessities.txt
pip set up -e .
# 3. Compile the C++ backend
mkdir construct && cd construct
cmake ..
make -j4
cp warpgroup_backend*.so ..
cd ..
# 4. Smoke take a look at
python3 main_working_file.py

If you need the correct benchmark side-by-side (baseline HF vs. WarpGroup), the example_runs/ folder has a scripted end-to-end runner that generates an artificial PDF corpus, runs each stacks, and writes JSON outcomes + bar charts.

You’ll need:

Linux, CUDA toolkit, an NVIDIA GPU (client or datacenter, each work).
A PyTorch construct with CUDA help (don’t ship the CPU-only one after which act stunned when nothing accelerates).
An LLM that helps flash_attention_2 (Qwen2.5, Llama-3, Mistral, SmolLM2, and so on.).

7. Plot twist — that is simply MAC scheduling in a CUDA costume

I ought to in all probability confess at this level: I’m not a “GPU particular person” by coaching. I got here up via telecom — 5G NR, with a foot creeping firmly into 6G analysis — and I began LLM inference infrastructure as a result of each drawback on this codebase felt unsettlingly acquainted.

Have a look at this side-by-side and inform me with a straight face these are completely different issues:

5G NR MAC scheduler (on the gNB)WarpGroup-Backend (on the GPU)Variable-size MAC SDUs from every UE / logical channelVariable-size token sequences from every documentPack right into a Transport Block each TTIPack right into a VRAM bin each dispatch cycleTB dimension bounded by accessible PRBs × MCS bitsBin dimension bounded by empirical VRAM price rangeShould align to LDPC code block segmentation (TS 38.212 §5.2.2)Should align to 16-token Tensor Core tilesLogical Channel Prioritization (LCP) picks what goes inFirst-Match Lowering picks what goes inHard deadline: one slot (0.5 ms at numerology μ=1)Delicate deadline: hold the GPU fedSkip a TB → PDSCH throughput cratersSkip a bin → GPU sits idle, throughput craters

You probably have ever learn 3GPP TS 38.321, you’re staring on the similar algorithm. The MAC scheduler on the base station has been packing variable-size SDUs into Transport Blocks — sized to suit a hard and fast PRB grid, aligned to LDPC code-block thresholds, prioritized throughout logical channels — since LTE-Superior. The one issues that change in WarpGroup are the items (tokens, not bits), the price range (VRAM, not PRBs), and the alignment quantum (16-token tiles, not LDPC code-block sizes).

Even Part 0 has a telecom doppelgänger. The repo probes the GPU with artificial sequences till it OOMs, then backs off 10%. The RAN does the equal each TTI: it watches CQI / SINR reviews, picks an MCS, watches BLER, then backs off. Each are saying the identical factor — the spec provides you a theoretical most, however the one sincere quantity is the one you measure underneath stay circumstances.

A fast apart to 2 very completely different audiences

To my HPC and CUDA-first buddies studying this: I do know. You’ve been doing precisely this for the reason that first GPGPU papers landed in 2003. Bin packing is a freshman algorithms class, cudaHostAlloc is in each CUDA tutorial, and pinned reminiscence is — for you — mainly a character trait. None of that is information. Please put the pitchforks down.

Nevertheless it is information for telecom engineers, and that’s half the explanation this text exists. For twenty years our world was FPGAs, ASICs, and PRBs. We optimized spectrum, not silicon. Ask the common RAN engineer to elucidate Tensor Core tile alignment and also you’ll get a well mannered stare. Then AI-RAN, NWDAF, NVIDIA Aerial, SoftBank AITRAS, the AI-RAN Alliance, and the 3GPP Rel-20 research objects all occurred inside roughly the identical eighteen months, and the subsequent decade of telecom careers now calls for being bilingual between spectrum-world and GPU-world — with quite a lot of us ranging from roughly zero on the GPU half. If the time period “pinned reminiscence” seemed like a international language till ten minutes in the past: welcome, you’re not behind, you’re early. The instinct interprets cleanly anyway. You already know the way to pack variable-size payloads right into a fixed-budget window underneath hardware-alignment constraints. You simply used to name it MAC scheduling. Identical animal, new zoo.

Take into account this text a half-step on that highway.

Why a working telecom engineer ought to care proper now

This isn’t an summary analogy. 4 concrete causes it lands in 2026:

6G is formally AI-native. ITU-R IMT-2030, 3GPP Rel-20 research objects, the AI-RAN Alliance, O-RAN’s AI/ML working teams — all of them assume LLMs and huge ML fashions stay inside the community, not bolted onto the OSS/BSS later. Beam administration, RIC xApps/rApps, NWDAF analytics, intent-based configuration, agentic OAM — every of those is a candidate workload for an inference stack that does precisely what WarpGroup does.
Voice is already a token stream. Neural audio codecs (Encodec, SoundStream, Mimi) tokenize speech at 25–75 Hz. Voice-LLMs like Moshi, AudioPaLM, and Spirit-LM eat these tokens straight. A name heart dealing with 10,000 concurrent calls is, computationally, a thundering herd of variable-length token streams arriving asynchronously from heterogeneous sources — precisely the workload this repo’s stress take a look at simulates. Exchange “400 PDFs” with “400 RTP streams” and the maths is equivalent: similar skew, similar padding tax, similar OOM cliff.
MEC has tiny GPUs. Multi-access Edge Compute nodes on the gNB / UPF tier don’t get to play with 8× H100 racks. They get one L4, perhaps an A10, perhaps a single H100 if procurement was in a superb temper. Squeezing 2–6× extra throughput out of a single edge GPU is the distinction between “AI options accessible on the edge” and “AI options solely on the core DC, plus 30 ms of additional round-trip.” That hole is strictly the place URLLC-class functions stay or die.
OAM telemetry is the boring killer app. PM counters, syslog occasions, CDRs, NetFlow / IPFIX information, NF traces — these are variable-length streams of structured-ish textual content that profit massively from LLM-based summarization, anomaly detection, and intent translation. Additionally they have brutal size variance: a traditional name hint is just a few hundred tokens; a single 5G handover-failure hint can run previous 50,000. Padded batching on that distribution will make your inference cluster cry, then your operations director cry, then your CFO cry.

So after I see a codebase that does VRAM-aware FFD packing with hardware-tile alignment and a single-DMA pinned-memory handoff, I don’t see “GPU optimization.” I see the inference-side analog of a MAC scheduler. The explanation I’m spending evenings on this isn’t a profession pivot — it’s the identical job, on completely different silicon, for the subsequent era of telecom workloads that can stay half within the spectrum and half within the GPU.

Additionally, frankly, after a decade of studying 3GPP specs, a codebase you’ll be able to git clone the whole scheduler from in 30 seconds is a trip.

8. The ethical, when you got here right here for one

There are three takeaways that I believe generalize past this particular repo:

1. Default batching is a well mannered lie. “batch_size = 8” tells you nothing about how full your GPU is. The proper unit is tokens in VRAM, and it’s a must to measure it as a result of no library will let you know the reality. The day you begin considering in tokens-per-bin as a substitute of items-per-batch is the day your throughput graph stops embarrassing you.

2. The attention-grabbing efficiency work is on the boundaries. The costly components of an LLM pipeline will not be the matrix multiplies — these have been hand-optimized by NVIDIA engineers with mortgages driving on it. The costly components are the transitions: tokens-to-tensors, host-to-device, Python-to-C++, scheduler-to-GPU. Nearly each “Wait, I made it 2× sooner” story in trendy ML is a boundary story.

3. Generally the proper reply is “write the C++.” Not all of it. Not even most of it. Python can completely coordinate high-performance inference techniques — vLLM proves it on daily basis. However shifting the hot-path packing loop right into a background C++ thread sheds two particular Python-side prices that chunk underneath load: GIL competition with the ingest thread, and interpreter + allocator overhead in a decent scheduling loop the place each microsecond is a part of the price range. The suitable Python / C++ ratio for high-throughput inference infra isn’t 100/0 or 0/100 — it’s a skinny, well-defined PyBind11 boundary with the latency-critical scheduler on the C++ aspect and all of the attention-grabbing stuff (mannequin code, enterprise logic, glue) on the Python aspect. WarpGroup is ~64% Python, ~30% C++, ~5% construct glue. That ratio is just not an accident.

9. The place this goes subsequent

The repo’s roadmap hints on the apparent subsequent transfer: multi-GPU sharding. Prolong the dispatcher to handle a number of C++ queues and distribute dynamically sized bins throughout native interconnects (NVLink / PCIe). The onerous half isn’t the C++ — it’s deciding the way to steadiness bin sizes throughout units when sequence-length distributions are skewed. (Pull requests welcome, and so on.)

If you wish to nerd out additional, the components I’d like to see explored:

Adaptive kPackWaitWindow — that 5 ms accumulation window is a hand-tuned fixed. It in all probability desires to be a operate of noticed producer charge.
Speculative bin reservation — pre-pin a second bin’s price of host reminiscence so we will pack the subsequent batch whereas the present one continues to be on the wire.
Steady batching for era — proper now it is a prefill/eval pipeline. Hooking it as much as streaming decode for a chat server can be the pure extension.

10. Wrap

Padding is the silent tax on each LLM workload that touches variable-length textual content. WarpGroup-Backend’s contribution isn’t a brand new algorithm — FFD bin packing has existed for the reason that Seventies — it’s the engineering integration: empirical VRAM autotuning, GIL-free async dispatch, 16-token alignment for the downstream kernels, and a single-DMA pinned-memory handoff into FlashAttention-2’s varlen kernel, all glued collectively so {that a} single python3 main_working_file.py produces 2× to six× throughput on actual {hardware}.

For those who construct LLM inference infrastructure for a residing, clone the repo, learn the C++ information (they’ve beneficiant feedback and the occasional dry joke), and contemplate what number of of your individual pipelines are presently paying the padding tax.

For those who construct telecom techniques for a residing and you believe you studied the subsequent decade of your job goes to contain much more inference servers than you initially signed up for — similar recommendation. The MAC scheduler in your gNB and the bin packer on this repo are studying from the identical playbook.

For those who’re a newbie who simply wished to grasp why GPUs hate variable-length textual content — congratulations, you now know greater than 80% of individuals constructing these items for a residing. Go forth and cease padding issues.

In regards to the repo

For those who loved this, the kindest issues you are able to do are: ⭐ the repo, share this submit, and inform one PyTorch person in your life that their batch_size is mendacity to them.

Now go yell at your GPU. Lovingly.

Source link

I Built a C++ Backend So My GPU Would Stop Eating Air

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

Toyota Hilux HD brings serious payload to world’s toughest truck

Today’s NYT Mini Crossword Answers for April 3

Armed police handcuff teen after AI mistakes crisp packet for gun in US

I Built a C++ Backend So My GPU Would Stop Eating Air

1. A confession: most of your GPU’s “work” is pretend

2. Why does padding exist in any respect? (a one-minute crash course)

3. The “simply pack them” lightbulb (and why it’s tougher than it sounds)

Downside A: How huge ought to the bin be?

Downside B: GPUs are choosy eaters (the 16-token factor)

Downside C: Python is just too sluggish to do that within the scorching loop

4. The five-phase pipeline (the actually-cool half)

Part 0 — The GPU job interview

Part 1 — Tokenize and yeet

Part 2 — The C++ catcher’s mitt

Part 3 — Tetris with guidelines (FFD + 16-token alignment)

Part 4 — The “zero-copy” magic trick (one DMA, no additional copies)

Part 5 — FlashAttention-2 enjoys its zero-padding lunch

5. The receipts (i.e., the numbers)

Stress take a look at: H100, Qwen2.5–7B, 400 mixed-length PDFs

Manufacturing scaling: similar {hardware}, uniform 50–1900 phrase docs

Entry-level {hardware}: GTX 1080 (8 GB), SmolLM2–360M

Bonus spherical: not crashing

“OK, however how is that this completely different from vLLM / paged consideration / steady batching?”

6. So… how do I really strive it?

7. Plot twist — that is simply MAC scheduling in a CUDA costume

A fast apart to 2 very completely different audiences

Why a working telecom engineer ought to care proper now

8. The ethical, when you got here right here for one

9. The place this goes subsequent

10. Wrap

In regards to the repo

Related Posts