Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Francis Bacon and the Scientific Method
    • Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval
    • Sulfur lava exoplanet L 98-59 d defies classification
    • Hisense U7SG TV Review (2026): Better Design, Great Value
    • Google is in talks with Marvell Technology to develop a memory processing unit that works alongside TPUs, and a new TPU for running AI models (Qianer Liu/The Information)
    • Premier League Soccer: Stream Man City vs. Arsenal From Anywhere Live
    • Dreaming in Cubes | Towards Data Science
    • Onda tiny house flips layout to fit three bedrooms and two bathrooms
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Sunday, April 19
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Overcoming the Hidden Performance Traps of Variable-Shaped Tensors: Efficient Data Sampling in PyTorch
    Artificial Intelligence

    Overcoming the Hidden Performance Traps of Variable-Shaped Tensors: Efficient Data Sampling in PyTorch

    Editor Times FeaturedBy Editor Times FeaturedDecember 3, 2025No Comments11 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    is the a part of a series of posts on the subject of analyzing and optimizing PyTorch fashions. All through the sequence, we now have advocated for utilizing the PyTorch Profiler in AI mannequin improvement and demonstrated the potential affect of efficiency optimization on the pace and value of operating AI/ML workloads. One frequent phenomenon we now have seen is how seemingly harmless code can hamper runtime efficiency. On this submit, we discover among the penalties related to the naive use of variable-shaped tensors — tensors whose form relies on previous computations and/or inputs. Whereas not relevant to all conditions, there are occasions when the usage of variable-shaped tensors might be averted — though this will likely come on the expense of extra compute and/or reminiscence. We’ll display the tradeoffs of those options on a toy implementation of information sampling in PyTorch.

    Three Downsides of Variable Formed Tensors

    We inspire the dialogue by presenting three disadvantages to the usage of variable-shaped tensors:

    Host-System Sync Occasions

    In a great situation, the CPU and GPU are capable of run in parallel in an asynchronous method, with the CPU constantly feeding the GPU with enter samples, allocating required GPU reminiscence, and loading GPU compute kernels, and the GPU executing the loaded kernels on the supplied inputs utilizing the allotted reminiscence. The presence of dynamic-shaped tensors throws a wrench into this parallelism. To be able to allocate the suitable quantity reminiscence, the CPU should look forward to the GPU to report the tensor’s form, after which the GPU should look forward to the CPU to allocate the reminiscence and proceed with the kernel loading. The overhead of this sync occasion could cause a drop within the GPU utilization and sluggish runtime efficiency.

    We noticed an instance of this in part three of this sequence once we studied a naive implementation of the frequent cross-entropy loss that included calls to torch.nonzero and torch.unique. Each APIs return tensors with shapes which can be dynamic and depending on the contents of the enter. When these features are run on the GPU, a host-device synchronization occasion happens. Within the case of the cross-entropy loss, we found the inefficiency by the usage of PyTorch Profiler and have been capable of simply overcome it with another implementation that averted the usage of variable-shaped tensors and demonstrated significantly better runtime efficiency.

    Graph Compilation

    In a recent post we explored the efficiency advantages of making use of just-in-time (JIT) compilation utilizing the torch.compile operator. Certainly one of our observations was that graph compilation supplied significantly better outcomes when the graph was static. The presence of dynamic shapes within the graph limits the extent of the optimization by way of compilation: In some circumstances, it fails utterly; in others it ends in decrease efficiency good points. The identical implications additionally apply to different types of graph compilation, comparable to XLA, ONNX, OpenVINO, and TensorRT.

    Information Batching

    One other optimization we now have encountered in a number of of our posts (e.g., here) is sample-batching. Batching improves efficiency in two main methods:

    1. Decreasing overhead of kernel loading: Moderately than loading the GPU kernels required for the computation pipeline as soon as per enter pattern, the CPU can load the kernels as soon as per batch.
    2. Maximizing parallelization throughout compute items: GPUs are extremely parallel compute engines. The extra we’re capable of parallelize computation, the extra we are able to saturate the GPU and enhance its utilization. By batching we are able to probably enhance the diploma of parallelization by an element of the batch dimension.

    Regardless of their downsides, the usage of variable-shaped tensors is usually unavoidable. However typically we are able to modify our mannequin implementation to bypass them. Generally these adjustments shall be simple (as within the cross-entropy loss instance). Different instances they might require some creativity in arising with a distinct sequence of fixed-shape PyTorch APIs that present the identical numerical outcome. Typically, this effort can ship significant rewards in runtime and prices.

    Within the subsequent sections, we are going to examine the usage of variable-shaped tensors within the context of the info sampling operation. We’ll begin with a trivial implementation and analyze its efficiency. We’ll then suggest a GPU-friendly various that avoids the usage of variable-shaped tensors.

    To check our implementations, we are going to use an Amazon EC2 g6e.xlarge with an NVIDIA L40S operating an AWS Deep Learning AMI (DLAMI) with PyTorch (2.8). The code we are going to share is meant for demonstration functions. Please don’t depend on it for accuracy or optimality. Please don’t interpret our point out of any framework, library, or platform and an endorsement of its use.

    Sampling in AI Mannequin Workloads

    Within the context of this submit, sampling refers back to the choice of a subset of things from a big set of candidates for the needs of computational effectivity, balancing of datatypes, or regularization. Sampling is frequent in lots of AI/ML fashions, comparable to detection, rating, and contrastive studying techniques.

    We outline a easy variation of the sampling drawback: Given an inventory of N tensors every with a binary label, we’re requested to return a subset of Okay tensors containing each optimistic and adverse examples, in random order. If the enter record incorporates sufficient samples of every label (Okay/2), the returned subset ought to be evenly cut up. Whether it is missing samples of 1 kind, these ought to be crammed with random samples of the second kind.

    The code block beneath incorporates a PyTorch implementation of our sampling operate. The implementation is impressed by the favored Detectron2 library (e.g., see here and here). For the experiments on this submit, we are going to repair the sampling ratio to 1:10.

    import torch
    
    INPUT_SAMPLES = 10000
    SUB_SAMPLE = INPUT_SAMPLES // 10
    FEATURE_DIM = 16
    
    def sample_data(input_array, labels):
        system = labels.system
        optimistic = torch.nonzero(labels == 1, as_tuple=True)[0]
        adverse = torch.nonzero(labels == 0, as_tuple=True)[0]
        num_pos = min(optimistic.numel(), SUB_SAMPLE//2)
        num_neg = min(adverse.numel(), SUB_SAMPLE//2)
        if num_neg < SUB_SAMPLE//2:
            num_pos = SUB_SAMPLE - num_neg
        elif num_pos < SUB_SAMPLE//2:
            num_neg = SUB_SAMPLE - num_pos
    
        # randomly choose optimistic and adverse examples
        perm1 = torch.randperm(optimistic.numel(), system=system)[:num_pos]
        perm2 = torch.randperm(adverse.numel(), system=system)[:num_neg]
    
        pos_idxs = optimistic[perm1]
        neg_idxs = adverse[perm2]
    
        sampled_idxs = torch.cat([pos_idxs, neg_idxs], dim=0)
        rand_perm = torch.randperm(SUB_SAMPLE, system=labels.system)
        sampled_idxs = sampled_idxs[rand_perm]
        return input_array[sampled_idxs], labels[sampled_idxs]

    Efficiency Evaluation With PyTorch Profiler

    Even when not instantly apparent, the usage of dynamic shapes is well identifiable within the PyTorch Profiler Hint view. We use the next operate to allow PyTorch Profiler:

    def profile(fn, enter, labels):
        
        def export_trace(p):
            p.export_chrome_trace(f"{fn.__name__}.json")
            
        with torch.profiler.profile(
                actions=[torch.profiler.ProfilerActivity.CPU,
                            torch.profiler.ProfilerActivity.CUDA],
                with_stack=True,
                schedule=torch.profiler.schedule(wait=0, warmup=10, energetic=5),
                on_trace_ready=export_trace
        ) as prof:
            for _ in vary(20):
                fn(enter, labels)
                torch.cuda.synchronize()  # specific sync for hint readability
                prof.step()
    
    # create random enter
    input_samples = torch.randn((INPUT_SAMPLES, FEATURE_DIM), system='cuda')
    labels = torch.randint(0, 2, (INPUT_SAMPLES,), 
                           system='cuda', dtype=torch.int64)
    
    # run with profiler
    profile(sample_data, input_samples, labels)

    The picture beneath was captured for the worth of ten million enter samples. It clearly reveals the presence of sync occasions coming from the torch.nonzero name, in addition to the corresponding drops in GPU utilization:

    Profiler Hint of Sampler (by Writer)

    Using torch.nonzero in our implementation is just not preferrred, however can it’s averted?

    A GPU-Pleasant Information Sampler

    We suggest another implementation of our sampling operate that replaces the dynamic torch.nonzero operate with a inventive mixture of the static torch.count_nonzero, torch.topk, and different APIs:

    def opt_sample_data(enter, labels):
        pos_mask = labels == 1
        neg_mask = labels == 0
        num_pos_idxs = torch.count_nonzero(pos_mask, dim=-1)
        num_neg_idxs = torch.count_nonzero(neg_mask, dim=-1)
        half_samples = labels.new_full((), SUB_SAMPLE // 2)
        num_pos = torch.minimal(num_pos_idxs, half_samples)
        num_neg = torch.minimal(num_neg_idxs, half_samples)
        num_pos = torch.the place(
            num_neg < SUB_SAMPLE // 2,
            SUB_SAMPLE - num_neg,
            num_pos
        )
        num_neg = SUB_SAMPLE - num_pos
    
        # create random ordering on pos and neg entries
        rand = torch.rand_like(labels, dtype=torch.float32)
        pos_rand = torch.the place(pos_mask, rand, -1)
        neg_rand = torch.the place(neg_mask, rand, -1)
    
        # choose high pos entries and invalidate others
        # since CPU does not know num_pos, we assume most to keep away from sync
        top_pos_rand, top_pos_idx = torch.topk(pos_rand, okay=SUB_SAMPLE)
        arange = torch.arange(SUB_SAMPLE, system=labels.system)
        if num_pos.numel() > 1:
            # unsqueeze to assist batched enter
            arange = arange.unsqueeze(0)
            num_pos = num_pos.unsqueeze(-1)
            num_neg = num_neg.unsqueeze(-1)
        top_pos_rand = torch.the place(arange >= num_pos, -1, top_pos_rand)
    
        # repeat for neg entries
        top_neg_rand, top_neg_idx = torch.topk(neg_rand, okay=SUB_SAMPLE)
        top_neg_rand = torch.the place(arange >= num_neg, -1, top_neg_rand)
    
        # mix and blend collectively optimistic and adverse idxs
        cat_rand = torch.cat([top_pos_rand, top_neg_rand], dim=-1)
        cat_idx = torch.cat([top_pos_idx, top_neg_idx], dim=-1)
        topk_rand_idx = torch.topk(cat_rand, okay=SUB_SAMPLE)[1]
        sampled_idxs = torch.collect(cat_idx, dim=-1, index=topk_rand_idx)
        sampled_input = torch.collect(enter, dim=-2, 
                                     index=sampled_idxs.unsqueeze(-1))
        sampled_labels = torch.collect(labels, dim=-1, index=sampled_idxs)
        return sampled_input, sampled_labels

    Clearly, this operate requires extra reminiscence and extra operations than our first implementation. The query is: Do the efficiency advantages of a static, synchronization-free implementation outweigh the additional value in reminiscence and compute?

    To evaluate the tradeoffs between the 2 implementations, we introduce the next benchmarking utility:

    def benchmark(fn, enter, labels):
        # warm-up
        for _ in vary(20):
            _ = fn(enter, labels)
    
        iters = 100
        begin = torch.cuda.Occasion(enable_timing=True)
        finish = torch.cuda.Occasion(enable_timing=True)
        torch.cuda.synchronize()
        begin.document()
        for _ in vary(iters):
            _ = fn(enter, labels)
        finish.document()
        torch.cuda.synchronize()
        avg_time = begin.elapsed_time(finish) / iters
        
        print(f"{fn.__name__} common step time: {(avg_time):.4f} ms")
    
    benchmark(sample_data, input_samples, labels)
    benchmark(opt_sample_data, input_samples, labels)

    The next desk compares the common runtime of every of the implementations for a wide range of enter pattern sizes:

    Comparative Step Time Efficiency — Decrease is Higher (by Writer)

    For a lot of the enter pattern sizes, the overhead of the host-device sync occasion is both comparable or decrease than the extra compute of the static implementation. Disappointingly, we solely see a significant profit from the sync-free various when the enter pattern dimension reaches ten million. Pattern sizes that enormous are unusual in AI/ML settings. But it surely’s not our tendency to surrender so simply. As famous above, the static implementation allows different optimizations like graph compilation and enter batching.

    Graph Compilation

    Opposite to the unique operate — which fails to compile — our static implementation is absolutely appropriate with torch.compile:

    benchmark(torch.compile(opt_sample_data), input_samples, labels)

    The next desk consists of the runtimes of our compiled operate:

    Comparative Step Time Efficiency — Decrease is Higher (by Writer)

    The outcomes are considerably higher — offering a 70–75 % increase over the unique sampler implementation within the 1–10 thousand vary. However we nonetheless have another optimization up our sleeve.

    Maximizing Efficiency with Batched Enter

    As a result of the unique implementation incorporates variable-shaped operations, it can not deal with batched enter immediately. To course of a batch, we now have no alternative however to use it to every enter individually, in a Python loop:

    BATCH_SIZE = 32
    
    def batched_sample_data(inputs, labels):
        sampled_inputs = []
        sampled_labels = []
        for i in vary(inputs.dimension(0)):
            inp, lab = sample_data(inputs[i], labels[i])
            sampled_inputs.append(inp)
            sampled_labels.append(lab)
        return torch.stack(sampled_inputs), torch.stack(sampled_labels)

    In distinction, our optimized operate helps batched inputs as is — no adjustments essential.

    input_batch = torch.randn((BATCH_SIZE, INPUT_SAMPLES, FEATURE_DIM),
                              system='cuda')
    labels = torch.randint(0, 2, (BATCH_SIZE, INPUT_SAMPLES),
                           system='cuda', dtype=torch.int64)
    
    benchmark(batched_sample_data, input_batch, labels)
    benchmark(opt_sample_data, input_batch, labels)
    benchmark(torch.compile(opt_sample_data), input_batch, labels)

    The desk beneath compares the step instances of our sampling features on a batch dimension of 32:

    Step Time Efficiency on Batched Enter — Decrease is Higher (by Writer)

    Now the outcomes are definitive: By utilizing a static implementation of the info sampler, we’re capable of increase efficiency by 2X–52X(!!) the variable-shaped possibility, relying on the enter pattern dimension.

    Be aware that though our experiments have been run on a GPU system, the mannequin compilation and enter batching optimizations additionally apply to a CPU surroundings. Thus, avoiding variable shapes may have implications on AI/ML mannequin efficiency on CPU, as effectively.

    Abstract

    The optimization course of we demonstrated on this submit generalizes past the particular case of information sampling:

    • Discovery by way of Efficiency Profiling: Utilizing the PyTorch Profiler we have been capable of determine drops in GPU utilization and uncover their supply: the presence of variable-shaped tensors ensuing from the torch.nonzero operation.
    • An Alternate Implementation: Our profiling findings allowed us to develop another implementation that completed the identical aim whereas avoiding the usage of variable-shaped tensors. Nevertheless, this step got here at the price of extra compute and reminiscence overhead. As seen in our preliminary benchmarks, the sync-free various demonstrated worse efficiency on frequent enter sizes.
    • Unlocking Additional Potential for Optimization: The true breakthrough got here as a result of the static-shaped implementation was compilation-friendly and supported batching. These optimizations supplied efficiency good points that dwarfed the preliminary overhead, resulting in a 2x to 52x speedup over the unique implementation.

    Naturally, not all tales will finish as fortunately as ours. In lots of circumstances, we could come throughout PyTorch code that performs poorly on the GPU however doesn’t have another implementation, or it might have one which requires considerably extra compute assets. Nevertheless, given the potential for significant good points in efficiency and reductions in value, the method of figuring out runtime inefficiencies and exploring various implementations is a necessary a part of AI/ML improvement.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval

    April 19, 2026

    Dreaming in Cubes | Towards Data Science

    April 19, 2026

    AI Agents Need Their Own Desk, and Git Worktrees Give Them One

    April 18, 2026

    Your RAG System Retrieves the Right Data — But Still Produces Wrong Answers. Here’s Why (and How to Fix It).

    April 18, 2026

    Europe Warns of a Next-Gen Cyber Threat

    April 18, 2026

    How to Learn Python for Data Science Fast in 2026 (Without Wasting Time)

    April 18, 2026

    Comments are closed.

    Editors Picks

    Francis Bacon and the Scientific Method

    April 19, 2026

    Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval

    April 19, 2026

    Sulfur lava exoplanet L 98-59 d defies classification

    April 19, 2026

    Hisense U7SG TV Review (2026): Better Design, Great Value

    April 19, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Today’s NYT Mini Crossword Answers for March 14

    March 14, 2026

    How to Use Parallels to Run Windows on a Mac

    August 10, 2025

    Samsung Brings the Rugged Galaxy XCover6 Pro and Galaxy Tab Active4 Pro to the U.S.

    September 10, 2024
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.