Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • How small businesses can leverage AI
    • Robots-Blog | Humanoide Robotik aus Deutschland: igus bringt neuen Serviceroboter auf den Markt
    • GM reimagines Hummer off-roader with California ideas unit
    • London’s DEScycle secures over €10 million in grant funding to scale critical metals recovery platform
    • How to Edit, Merge, and Split PDFs With Free Online Tools
    • Florida crackdown targets illegal machines in Sarasota
    • Audiophile-Oriented Noble Audio Debuts More Affordable Osprey Earbuds
    • New radio bursts detected from binary stars
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Tuesday, June 2
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Optimizing Token Generation in PyTorch Decoder Models
    Artificial Intelligence

    Optimizing Token Generation in PyTorch Decoder Models

    Editor Times FeaturedBy Editor Times FeaturedFebruary 24, 2026No Comments18 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    which have pervaded almost each aspect of our each day lives are autoregressive decoder fashions. These fashions apply compute-heavy kernel operations to churn out tokens one after the other in a fashion that, at first look, appears extraordinarily inefficient. Given the large demand for generative AI, it’s no shock that extraordinary engineering effort is being invested into its optimization. Whether or not or not it’s by way of customized CUDA kernels, CUDA Graphs, devoted AI accelerators, or speculative sampling — any approach that reduces latency and/or value by even a fraction of a proportion is a win.

    On this publish, we exhibit a way for optimizing token technology in PyTorch utilizing CUDA stream interleaving. Whereas easy to implement, the strategy addresses a particular, typically neglected bottleneck and may result in significant efficiency boosts. Whereas pipelining mannequin execution utilizing CUDA streams is widespread in AI programs engineering, we didn’t discover any tutorial documenting the particular PyTorch-level software we describe right here. Should you discover the approach helpful, please be so variety as to reference this publish.

    To facilitate our dialogue, we’ll use a easy GPT-2 PyTorch decoder mannequin from HuggingFace’s transformers (v5.1.0) library. We’ll run our experiments on an NVIDIA L40S GPU and PyTorch (2.10.0).

    Disclaimer: The code we’ll share is meant for demonstrative functions. Please don’t depend on its accuracy or optimality. Please don’t interpret our mentions of any library, platform, or service as an endorsement of its use.

    Importantly, the worth of the CUDA stream-based methodology we’ll focus on can fluctuate tremendously primarily based on the small print of your mannequin and runtime surroundings. Please make sure to run your personal benchmarks earlier than integrating its use.

    Our focus on this publish is on PyTorch-native inference workloads which stay extraordinarily prevalent in improvement and check settings. Nevertheless, it is very important word that for manufacturing environments devoted LLM inference libraries similar to vLLM or NVIDIA TensorRT-LLM are likely to ship higher efficiency and ought to be used every time related.

    A Toy GPT-2 Mannequin

    To simplify our dialogue, we’ll use a GPT-2 decoder mannequin from the HuggingFace transformers library and have it run autoregressively on a batch of empty prompts.

    Within the following code block, we initialize the mannequin and outline a naive token technology perform that creates a batch of random streams as much as a given size.

    import torch
    from transformers import GPT2LMHeadModel, GPT2Config
    
    torch.set_float32_matmul_precision('excessive')
    
    DEVICE = "cuda"
    
    # outline the decoder mannequin
    config = GPT2Config.from_pretrained("gpt2")
    mannequin = GPT2LMHeadModel(config).to(DEVICE).eval()
    
    
    @torch.inference_mode()
    def generate_sequence(mannequin, max_seqlen, batch_size):
        # Initialize prompts with BOS token
        all_tokens = torch.full(
            (batch_size, 1),
            config.bos_token_id,
            machine=DEVICE,
            dtype=torch.lengthy
        )
        completed = torch.zeros(batch_size, machine=DEVICE, dtype=torch.bool)
        
        for i in vary(max_seqlen):
            outputs = mannequin(all_tokens)
            # extract new token
            logits = outputs.logits[:, -1, :]
            new_tokens = torch.argmax(logits, dim=-1)
            # append new token to sequence
            all_tokens = torch.cat(
                [all_tokens, new_tokens.unsqueeze(-1)],
                dim=-1
            )
            completed |= (new_tokens == config.eos_token_id)
            stop_gpu = torch.all(completed)
            
            # checking cease situation
            if stop_gpu.merchandise():
                print(f"All sequences completed at step {i+1}")
                break
        
        return all_tokens

    Subsequent, we outline a easy benchmarking perform which we use to measure the runtime efficiency and reminiscence utilization of our token generator in several situations.

    import time, statistics
    
    
    def benchmark(func, num_runs=10):
        # Warmup
        func()
        torch.cuda.synchronize()
        
        runtimes = []
        
        for _ in vary(num_runs):
            # reset reminiscence stats earlier than every run
            torch.cuda.empty_cache()
            torch.cuda.reset_peak_memory_stats()
            torch.cuda.synchronize()
            
            begin = time.perf_counter()
            _ = func()
            torch.cuda.synchronize()
            finish = time.perf_counter()
            
            runtimes.append(finish - begin)
        
        # Get reminiscence allocator stats from final run
        mem_stats = torch.cuda.memory_stats()
        allocated_peak = mem_stats.get('allocated_bytes.all.peak', 0)
        reserved_peak = mem_stats.get('reserved_bytes.all.peak', 0)
        f_peak = reserved_peak - allocated_peak
        f_pct = (
            100 * f_peak / reserved_peak
            if reserved_peak > 0 else 0
        )
        
        print(f"n{'='*60}")
        print(f"Runtime Outcomes:")
        print(f" Imply:               {statistics.imply(runtimes):.4f}s")
        print(f" Std:                {statistics.stdev(runtimes):.4f}s")
        print(f" Min:                {min(runtimes):.4f}s")
        print(f" Max:                {max(runtimes):.4f}s")
    
        print(f"nMemory Stats:")
        print(f" Allotted bytes (peak): {allocated_peak / 1e9:.3f} GB")
        print(f" Reserved bytes (peak):  {reserved_peak / 1e9:.3f} GB")
        print(f" Fragmentation (peak):   {f_peak / 1e9:.3f} GB ({f_pct:.1f}%)")
        print(f"{'='*60}n")
    
    
    batch_size = 32
    for max_seqlen in [100, 200, 400]:
        print(
            f"Benchmarking technology with batch measurement {batch_size} "
            f"and max sequence size {max_seqlen}..."
        )
        benchmark(
            lambda: generate_sequence(
                mannequin, max_seqlen=max_seqlen, batch_size=batch_size
            )
        )

    Within the desk under we seize the outcomes for a batch measurement of 32 and a number of other totally different sequence lengths:

    Baseline Outcomes (By Creator)

    Because the sequence size doubles, the runtime quadruples — showing to observe a basic O(N²) scaling sample. Moreover, excessive reminiscence fragmentation factors to extreme pressure on the CUDA reminiscence allocator, which can lead to frequent reminiscence faults and degrade runtime efficiency. The fragmentation outcomes from every step asking for barely bigger tensor allocations, a sample which finally ends up leaving a number of pockets of unusable reminiscence.

    Our first optimization, KV caching, addresses the runtime complexity of our decoder mannequin.

    KV Caching

    Our naive generator is extraordinarily inefficient — relatively than storing and reusing the intermediate tensors from earlier tokens, it recalculates the complete sequence at each step.

    We deal with the computation inefficiency through the use of KV caching: We retailer and reuse the intermediate Key and Worth tensors for earlier tokens. KV caching reduces the runtime complexity of token technology from O(N²) to O(N).

    Within the following code block, we make the most of the transformers library’s built-in assist for KV caching to reprogram our token technology perform to compute a single batch of tokens in every step.

    @torch.inference_mode()
    def generate_sequence(mannequin, max_seqlen, batch_size, use_cache=False):
        # Initialize prompts with BOS token
        all_tokens = torch.full(
            (batch_size, 1),
            config.bos_token_id,
            machine=DEVICE,
            dtype=torch.lengthy
        )
        completed = torch.zeros(batch_size, machine=DEVICE, dtype=torch.bool)
    
        # past_key_values is used to retailer the cached key/values for every layer
        past_key_values = None
    
        for i in vary(max_seqlen):
            current_input = (
                all_tokens if past_key_values is None
                else all_tokens[:, -1:]
            )
            outputs = mannequin(
                current_input,
                past_key_values=past_key_values,
                use_cache=use_cache
            )
            # replace cache for subsequent step
            past_key_values = outputs.past_key_values
            logits = outputs.logits[:, -1, :]
            new_tokens = torch.argmax(logits, dim=-1)
            # append new token to sequence
            all_tokens = torch.cat(
                [all_tokens, new_tokens.unsqueeze(-1)],
                dim=-1
            )
            completed |= (new_tokens == config.eos_token_id)
            stop_gpu = torch.all(completed)
            
            # checking cease situation
            if stop_gpu.merchandise():
                print(f"All sequences completed at step {i+1}")
                break
        
        return all_tokens

    The ensuing efficiency numbers are captured within the following desk:

    Token Technology With KV Caching (By Creator)

    The efficiency enchancment is profound and, as anticipated, will increase as a perform of the sequence size.

    Though considerably higher than in our baseline experiment, the diploma of reminiscence fragmentation stays a priority. To deal with this we discover two strategies, expandable reminiscence allocations and static KV caching.

    Expandable CUDA Reminiscence Allocations

    To scale back CUDA reminiscence fragmentation, we program PyTorch to make use of expandable memory segments. As of the time of this writing, this reminiscence optimization is an experimental characteristic and ought to be used with warning. Please see the PyTorch documentation for particulars. To make use of the characteristic we set the next surroundings variable:

    export PYTORCH_ALLOC_CONF="expandable_segments:True"

    Rerunning our benchmark ends in the next desk:

    KV Caching With Expandable Reminiscence Segments (By Creator)

    Not solely can we see a marked enchancment in fragmentation, however we additionally get a further (marginal) enchancment in runtime efficiency.

    KV Caching With StaticCache

    The default cache in HuggingFace is dynamic — it grows because the variety of keys and values will increase through the technology progresses. HuggingFace helps a fixed-size cache, StaticCache, which pre-allocates a most cache measurement for the KV pairs and reduces pressure on the CUDA reminiscence allocator. The drawback of utilizing StaticCache is that the total size of the cache participates within the consideration computation at every token technology step, the place irrelevant tokens are masked out. This ends in a waste of computation that grows with the sequence size. For instance, when producing a sequence of 400 tokens, the eye computation for every token can be run on full 400X400-sized tensors.

    Within the code block under we improve our sequence generator to assist the usage of a StaticCache:

    che:
    
    from transformers import StaticCache
    
    @torch.inference_mode()
    def generate_sequence(
        mannequin, max_seqlen, batch_size, use_cache=False, use_static_cache=False
    ):
        # Initialize prompts with BOS token
        all_tokens = torch.full(
            (batch_size, 1),
            config.bos_token_id,
            machine=DEVICE,
            dtype=torch.lengthy
        )
        completed = torch.zeros(batch_size, machine=DEVICE, dtype=torch.bool)
        
        # Initialize static cache if requested
        if use_cache and use_static_cache:
            past_key_values = StaticCache(
                config=config,
                max_batch_size=batch_size,
                max_cache_len=max_seqlen,
                machine=DEVICE,
                dtype=mannequin.dtype
            )
        else:
            past_key_values = None
        
        # Initialize cache place monitoring for static cache
        cache_positions = torch.arange(max_seqlen, machine=DEVICE)
        
        for i in vary(max_seqlen):
            current_input = (
                all_tokens if past_key_values is None
                else all_tokens[:, -1:]
            )
            cache_position = (
                cache_positions[i:i+1] if use_static_cache else None
            )
            outputs = mannequin(
                current_input,
                past_key_values=past_key_values,
                cache_position=cache_position,
                use_cache=use_cache
            )
            # replace cache for subsequent step
            past_key_values = outputs.past_key_values
            logits = outputs.logits[:, -1, :]
            new_tokens = torch.argmax(logits, dim=-1)
            # append new token to sequence
            all_tokens = torch.cat(
                [all_tokens, new_tokens.unsqueeze(-1)],
                dim=-1
            )
            completed |= (new_tokens == config.eos_token_id)
            stop_gpu = torch.all(completed)
            
            # checking cease situation
            if stop_gpu.merchandise():
                print(f"All sequences completed at step {i+1}")
                break
        
        return all_tokens

    The up to date outcomes are captured under:

    Token Technology With Static KV Cache (By Creator)

    Utilizing a fixed-sized cache tremendously improves reminiscence utilization as indicated by the lower in reminiscence fragmentation. Nevertheless, its impression on runtime efficiency is combined — for 100 tokens it reduces efficiency in comparison with a dynamic cache, whereas for 200 and 400 tokens it boosts efficiency by 9% and 10%, respectively.

    There are extra superior strategies of implementing consideration that optimize for reminiscence utilization with out the price of wasted computation. In a earlier publish, Optimizing Transformer Models for Variable-Length Input Sequences, we lined some PyTorch strategies for computing consideration sparsely to cut back computation waste. For manufacturing settings, libraries similar to vLLM use PagedAttention for maximizing reminiscence utilization. These strategies are exterior the scope of this publish.

    For extra particulars on caching in HuggingFace, please see the caching strategies overview.

    Mannequin Compilation

    One of many documented benefits of utilizing a fixed-sized cache is that it permits for making the most of many just-in-time (JIT) optimizations.

    Within the following code block we apply our benchmark to a PyTorch-compiled model of our decoder mannequin:

    batch_size = 32
    max_seqlen = 100
    
    mannequin = torch.compile(mannequin)
    
    benchmark(
        lambda: generate_sequence(
            mannequin,
            max_seqlen=max_seqlen,
            batch_size=batch_size,
            use_cache=True,
            use_static_cache=True
        )
    )

    Mannequin compilation ends in a further increase to runtime efficiency as proven within the desk under:

    Token Technology With torch.compile (By Creator)

    Be aware that we will apply mannequin compilation when utilizing dynamic caching, as effectively. Nevertheless, torch.compile offers one of the best outcomes when the computation graph consists of fixed-sized tensors (e.g., see here for extra particulars).

    The Efficiency Penalty of Early Stopping

    An integral a part of widespread token mills is checking for the end-of-sequence (EOS) on the finish of every step. With out this check, token mills would all the time run for max_seqlen, even when all of the sequences within the batch have ended. This might lead to appreciable computation waste and pointless latency — particularly when widespread sequence lengths are a lot shorter than the utmost size. Within the case of our toy experiment, we watch for all of the sequences within the batch to finish and discontinue token technology. Manufacturing-grade implementations will generally carry out steady batching — changing accomplished sequences with new prompts on the enter queue.

            completed |= (new_tokens == config.eos_token_id)
            stop_gpu = torch.all(completed)
            
            # checking cease situation
            if stop_gpu.merchandise():
                print(f"All sequences completed at step {i+1}")
                break

    Importantly, the .merchandise() name on the stop_gpu tensor, triggers a blocking host-device synchronization occasion. Extra particularly, as a way to consider the conditional if assertion, the CPU should watch for the GPU to finish its computation and replica the contents of the tensor to host reminiscence. Whereas the CPU waits, it’s blocked from executing the subsequent step of the token technology loop, or extra precisely, it’s blocked from loading the subsequent computation kernels onto the GPU.

    To measure the impression of the stopping situation on runtime efficiency, we add instrumentation for efficiency profiling with NVIDIA Nsight™ Systems (nsys) utilizing the torch.cuda.profiler and nvtx (v0.2.14) APIs. (See our recent post for extra particulars on efficiency profiling with nsys).

    ore particulars on efficiency profiling with nsys).
    
    import nvtx
    from torch.cuda import profiler
    
    @torch.inference_mode()
    def generate_sequence(
        mannequin, max_seqlen, batch_size, use_cache=False, use_static_cache=False
    ):
        # Initialize prompts with BOS token
        all_tokens = torch.full(
            (batch_size, 1),
            config.bos_token_id,
            machine=DEVICE,
            dtype=torch.lengthy
        )
        completed = torch.zeros(batch_size, machine=DEVICE, dtype=torch.bool)
        
        # Initialize static cache if requested
        if use_cache and use_static_cache:
            past_key_values = StaticCache(
                config=config,
                max_batch_size=batch_size,
                max_cache_len=max_seqlen,
                machine=DEVICE,
                dtype=mannequin.dtype
            )
        else:
            past_key_values = None
        
        # Initialize cache place monitoring for static cache
        cache_positions = torch.arange(max_seqlen, machine=DEVICE)
        
        for i in vary(max_seqlen):
            if i == 30:
                # begin nsys profiler
                torch.cuda.synchronize()
                profiler.begin()
            elif i == 50:
                # cease nsys profiler
                torch.cuda.synchronize()
                profiler.cease()
            with nvtx.annotate(f"Step {i+1}", shade="blue"):
                with nvtx.annotate("Mannequin Ahead", shade="inexperienced"):
                    current_input = (
                        all_tokens if past_key_values is None
                        else all_tokens[:, -1:]
                    )
                    cache_position = (
                        cache_positions[i:i+1] if use_static_cache else None
                    )
                    outputs = mannequin(
                        current_input,
                        past_key_values=past_key_values,
                        cache_position=cache_position,
                        use_cache=use_cache
                    )
                    past_key_values = outputs.past_key_values
                    logits = outputs.logits[:, -1, :]
                    new_tokens = torch.argmax(logits, dim=-1)
                                    all_tokens = torch.cat(
                        [all_tokens, new_tokens.unsqueeze(-1)],
                        dim=-1
                    )
                    completed |= (new_tokens == config.eos_token_id)
                    stop_gpu = torch.all(completed)
                with nvtx.annotate("Examine Cease Situation", shade="purple"):
                    # checking cease situation
                    if stop_gpu.merchandise():
                        print(f"All sequences completed at step {i+1}")
                        break
        
        return all_tokens

    We run our script utilizing the cudaProfilerApi possibility to begin and cease the profiler programmatically. Please see the official documentation for full particulars on profiling from the nsys CLI.

    nsys profile 
      --capture-range=cudaProfilerApi 
      --trace=cuda,nvtx,osrt 
      --output=baseline 
      python prepare.py

    The next hint, captured for a batch measurement of 16 and sequence size of 100, reveals the GPU idling for about 110 microseconds in between steps — an eternity within the context of high-performance GPU workloads. This can be a direct results of the synchronization occasion triggered by the EOS check.

    GPU Utilization Drops Between Every Step (By Creator)

    In production-grade implementations such synchronization points are prevented by some mixture of 1) use of decrease stage (e.g., C/C++) code that avoids the limitation of the Python interpreter, 2) utilizing CUDA graphs to cut back overhead of kernel loading, 3) shifting conditional checks onto the GPU utilizing conditional nodes, and 4) repeatedly and asynchronously getting ready subsequent requests whereas the EOS examine is in progress.

    Within the subsequent part, we exhibit a way for hiding the overhead of the host-device synchronization in PyTorch utilizing CUDA streams.

    A CUDA Stream Optimization

    A CUDA stream is a linear sequence of operations (kernels, reminiscence copies, and so on.) that execute so as on the GPU. Whereas operations inside a single stream are assured to execute sequentially, operations in several streams can execute concurrently or overlap.

    In earlier posts (e.g., here and here) we demonstrated the usage of CUDA streams in pipelining widespread AI/ML workloads, e.g., executing a mannequin on batch N whereas getting ready batch N+1. On this publish we’ll use CUDA streams to allow the CPU to load the GPU kernels of step N+1 earlier than checking the stopping standards of step N. Opposite to our earlier demonstrations of CUDA streams, our present instance won’t essentially contain concurrent GPU kernel execution.
    We implement an alternate token technology perform that interleaves two CUDA streams, working the next operations iteratively:

    Program stream ipercent2 to: (A) watch for stream (i-1)%2 to finish its technology of token i-1, (B) use the up to date tensors to calculate the token i, (C) run the EOS check for token i on the GPU, and (D) carry out a (non-blocking) copy of the EOS check end result to pinned reminiscence on the CPU.

    On the default CUDA stream, watch for stream (i-1)%2 to finish its technology of token i-1.

    On the default CUDA stream, examine if the stopping standards for token i-1 had been met. In that case, halt the generator and return. In any other case, increment i and return to step 1.

    Whereas beforehand, the initialization of token i technology was blocked by the EOS check on token i-1, the usage of CUDA streams permits us to program the technology of token i earlier than we examine the results of the EOS check on token i-1. In observe, the EOS check for token i-1 on the CPU runs whereas the GPU is computing token i.

    @torch.inference_mode()
    def generate_sequence_pipelined(
        mannequin,
        max_seqlen,
        batch_size,
        use_cache=False,
        use_static_cache=False
    ):
        # Initialize prompts with BOS token
        all_tokens = torch.full(
            (batch_size, 1),
            config.bos_token_id,
            machine=DEVICE,
            dtype=torch.lengthy
        )
        completed = torch.zeros(batch_size, machine=DEVICE, dtype=torch.bool)
        past_key_values = None
        
        # Initialize static cache if requested
        if use_cache and use_static_cache:
            past_key_values = StaticCache(
                config=config,
                max_batch_size=batch_size,
                max_cache_len=max_seqlen,
                machine=DEVICE,
                dtype=mannequin.dtype
            )
        
        # Initialize cache place monitoring for static cache
        cache_positions = torch.arange(max_seqlen, machine=DEVICE)
        
        # Twin streams for pipelining
        streams = [torch.cuda.Stream(), torch.cuda.Stream()]
        stop_host = [
            torch.tensor(False, pin_memory=True),
            torch.tensor(False, pin_memory=True)
        ]
        
        for i in vary(max_seqlen):
            curr_idx, prev_idx = i % 2, (i+1) % 2
            curr_s, prev_s = streams[curr_idx], streams[prev_idx]
            
            # Launch iteration i in present stream
            with torch.cuda.stream(curr_s):
                # program stream to attend for earlier stream to finish
                curr_s.wait_stream(prev_s)
                current_input = (
                    all_tokens if past_key_values is None
                    else all_tokens[:, -1:]
                )
                cache_position = (
                    cache_positions[i:i+1] if use_static_cache else None
                )
                outputs = mannequin(
                    current_input,
                    past_key_values=past_key_values,
                    cache_position=cache_position,
                    use_cache=use_cache
                )
                past_key_values = outputs.past_key_values
                logits = outputs.logits[:, -1, :]
                new_tokens = torch.argmax(logits, dim=-1)
                all_tokens = torch.cat(
                    [all_tokens, new_tokens.unsqueeze(-1)],
                    dim=-1
                )
                
                completed |= (new_tokens == config.eos_token_id)
                stop_gpu = torch.all(completed)
                stop_host[curr_idx].copy_(stop_gpu, non_blocking=True)
            
            # Examine earlier iteration's cease sign
            torch.cuda.current_stream().wait_stream(prev_s)
            if stop_host[prev_idx].merchandise():
                print(f"All sequences completed at step {i}")
                break
        
        return all_tokens

    The picture under captures the nsys hint for our new token generator:

    Fixed GPU Exercise When Making use of CUDA Streams (By Creator)

    Within the CUDA part of the hint we will see the usage of two CUDA streams, with token technology being handed forwards and backwards in a kind of ping-pong impact: One stream generates the entire odd tokens and second the entire even tokens. The CPU is about half a step forward of the GPU — permitting it to program step i whereas the GPU is computing step i-1. The CPU-side EOS stop-check of step i-1 (in purple) happens after step i is totally programmed (and has began working). Most significantly, we now discover the GPU utilization to be constant — the idling we noticed earlier than is gone.

    The CUDA stream interleaving ends in a further efficiency increase, as proven within the desk under:

    Token Technology With CUDA Streams (By Creator)

    We’d count on the advantage of the ping-pong answer now we have carried out to be impacted by the ratio between the GPU idle time (i.e., the overhead of kernel loading) and the kernel computation time. To check this, we repair the sequence size at 100 and rerun the benchmark for a lot of batch sizes:

    Impression of Pipelining for Various Batch Dimension (By Creator)

    As anticipated, the very best efficiency achieve, 11.6%, happens when the batch measurement is smallest and the kernel computation load is at its lowest. Because the kernel compute will increase, the ratio of kernel loading to kernel compute time decreases as does the impression of CUDA stream interleaving.

    Be aware that there’s some overhead to the usage of CUDA streams. This may be demonstrated by evaluating our interleaving answer to a token generator that skips the EOS check altogether:

    Overhead of CUDA Stream Interleaving (By Creator)

    The Potential Efficiency Pitfalls of Utilizing CUDA Streams

    CUDA streams ought to be used with excessive warning. When utilizing the default stream we will depend on PyTorch to carry out any obligatory synchronization when knowledge is moved round. Nevertheless, when utilizing CUDA streams, we should guarantee acceptable synchronization explicitly. Specifically, we should guarantee acceptable knowledge switch between the streams. In any other case, we could expertise CUDA errors (e.g., “device-side assert triggered”) — if we’re fortunate. If we’re much less fortunate, we could expertise knowledge corruption with out even understanding it. See the PyTorch CUDA stream documentation for extra particulars on acceptable use.

    For AI/ML workloads with massive CUDA reminiscence utilization, similar to LLMs, one other consideration is reminiscence utilization. The PyTorch caching allocator manages reminiscence on a per-stream foundation; utilizing a number of streams can result in elevated reminiscence reservation and fragmentation. These might lead to elevated reminiscence faults which may overshadow the potential features from the usage of streams.

    Outcomes

    Within the desk under we summarize the runtime outcomes of making use of static caching, compilation, and pipelining on a batch of 32 sequences and a most sequence size of 100. The outcomes are sorted in rising order of efficiency:

    Token Technology Optimization Outcomes (By Creator)

    Within the case of our toy GPT-2 mannequin, one of the best outcomes — almost 5 occasions the baseline efficiency — are achieved when using PyTorch compilation and the CUDA stream interleaving methodology mentioned on this publish. Nevertheless, as now we have seen, the impression of CUDA interleaving might fluctuate tremendously primarily based on the properties of the workload and runtime surroundings, notably on the ratio between the kernel loading time and the kernel compute time. Please make sure to run your personal benchmarks earlier than adopting this methodology.

    Abstract

    In high-performance AI engineering, any trace of GPU under-utilization presents a possibility for optimization. One of many main optimization instruments on NVIDIA GPUs is CUDA streams. On this publish, we demonstrated their use in fixing the idle GPU time that outcomes from the host-device synchronization related to early-stopping in PyTorch-native autoregressive token technology. By interleaving CUDA streams in a “ping-pong” sample, we efficiently hid the latency imposed by the EOS-check which resulted in a significant enhance the workload’s throughput. By combining this system with the well-known strategies of mannequin compilation and static caching, we will maximize the efficiency of PyTorch-native inference.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Escaping the Valley of Choice in BI

    June 2, 2026

    Ensuring Data Integrity with Cryptographic Hashing and the Ethereum Blockchain

    June 1, 2026

    RAG Is Not Machine Learning, and the ML Toolkit Solves the Wrong Problem

    June 1, 2026

    How to Combine Claude Code and Codex for Maximum Coding Power

    June 1, 2026

    It’s the Lessons We Learned Along the Way. Or, Is It?

    June 1, 2026

    Proxy-Pointer RAG: Eliminating Wasteful Entity & Relations Extraction in Knowledge Graphs

    May 31, 2026

    Comments are closed.

    Editors Picks

    How small businesses can leverage AI

    June 2, 2026

    Robots-Blog | Humanoide Robotik aus Deutschland: igus bringt neuen Serviceroboter auf den Markt

    June 2, 2026

    GM reimagines Hummer off-roader with California ideas unit

    June 2, 2026

    London’s DEScycle secures over €10 million in grant funding to scale critical metals recovery platform

    June 2, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Keyless car theft devices selling online for £20,000, BBC finds

    November 17, 2025

    Trump Wants Venezuela’s Oil. Getting It Might Not Be So Simple

    January 4, 2026

    Today’s NYT Strands Hints, Answer and Help for June 16 #470

    June 15, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.