Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • June deadline approaches for Hawthorne sale process
    • Today’s NYT Mini Crossword Answers for June 4
    • New tiny nudibranch species discovered in Taiwan
    • Why the Budget’s CGT changes are a disaster for angel investors and startups
    • OpenAI and Anthropic Sign Letter to Prevent AI-Developed Biological Weapons
    • New York sports betting statements bill advances
    • SwitchBot Launches the Most Complete Home Weather Station I’ve Seen
    • What It Takes for Future-Ready Power Distribution
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Thursday, June 4
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Pipelining AI/ML Training Workloads with CUDA Streams
    Artificial Intelligence

    Pipelining AI/ML Training Workloads with CUDA Streams

    Editor Times FeaturedBy Editor Times FeaturedJune 26, 2025No Comments13 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    ninth in our collection on performance profiling and optimization in PyTorch aimed toward emphasizing the vital function of efficiency evaluation and optimization in machine studying improvement. All through the collection we now have reviewed all kinds of sensible instruments and methods for analyzing and boosting the runtime efficiency of PyTorch-based AI/ML fashions. Our purpose has been twofold:

    1. To emphasise the significance of routine analysis and optimization of AI/ML workloads.
    2. To show the accessibility of all kinds instruments and methods for analyzing and optimizing AI/ML runtime efficiency. You don’t should be a CUDA knowledgeable to meaningfully enhance your mannequin efficiency and scale back compute prices.

    On this put up, we are going to discover using CUDA streams, a robust function of NVIDIA’s CUDA programming mannequin that provides a complicated technique of overlapping GPU operations and working them concurrently. Though we usually affiliate our AI/ML mannequin coaching workload with a single monolithic (a.okay.a., “unbreakable”) computation graph G working on the GPU, there are some situations the place the graph will be decomposed into two distinct subgraphs G1 and G2, the place G=G2*G1. In such instances CUDA streams allow “pipelining” the computation graph, i.e., programming our coaching step to run G1 (on batch enter n+1) in parallel to G2 (on the nth output of G1). This method is very helpful when:

    • Neither subgraph totally makes use of the GPU when run alone, and
    • The 2 subgraphs are of comparable computational value (i.e., neither dominates runtime).

    We’ll discover two widespread situations the place “pipelining” is possible:

    1. Partial-model coaching or finetuning:
      It’s widespread to freeze a pre-trained mannequin spine (e.g., function extractor or encoder) and prepare solely a mannequin head (e.g., decoder). Because the frozen spine doesn’t depend on gradients from the head, the 2 will be executed concurrently.
    2. Offloading knowledge preprocessing to the GPU:
      A standard technique for addressing bottlenecks within the enter pipeline (also referred to as GPU hunger), knowledge preprocessing will be moved to the GPU. Whereas prepending preprocessing operations to the mannequin graph improves efficiency, further beneficial properties will be achieved by working preprocessing on a separate CUDA stream in parallel with mannequin execution—assuming preprocessing isn’t trivial in comparison with mannequin compute.

    To facilitate our dialogue, we are going to outline two toy coaching scripts and measure the coaching efficiency below totally different situations. The experiments had been run on an Amazon EC2 g5.2xlarge occasion (containing an NVIDIA A10G GPU and eight vCPUs) working a PyTorch (2.6) Deep Learning AMI (DLAMI).

    Please word: the code snippets that we share are for demonstration functions solely —please don’t depend on their correctness or optimality. The affect of utilizing CUDA streams will differ relying on mannequin structure and system configuration. We encourage you to conduct your individual profiling and experimentation earlier than integrating CUDA streams (or some other software approach we seek advice from) into your workflow.

    Half 1: Pipelining an Encoder-Decoder Mannequin

    The primary use-case we discover entails a CNN-based picture segmentation mannequin consisting of a hard and fast (pre-trained) encoder and a trainable decoder. On this state of affairs, because the encoder weights are frozen and unaffected by backpropagation, the encoder will be executed independently of the decoder’s coaching. On this part, we assess the affect of pipelining the coaching course of utilizing CUDA streams.

    A Toy Picture Segmentation Coaching Experiment

    We start by defining a easy CNN-based picture encoder together with its corresponding decoder.

    undefined

    Subsequent, we assemble an artificial dataset of random photos and segmentation maps.

    from torch.utils.knowledge import DataLoader
    from torchvision.datasets.imaginative and prescient import VisionDataset
    
    # A dataset with random photos and per-pixel labels
    class FakeDataset(VisionDataset):
        def __init__(self):
            tremendous().__init__(root=None)
            self.dimension = 1000000
    
        def __getitem__(self, index):
            # create a random picture
            img = torch.randint(0, 256, (3, img_size, img_size),
                                dtype=torch.uint8)
    
            # create a random label map
            goal = torch.randint(0, num_classes, (img_size, img_size))
    
            return img, goal
    
        def __len__(self):
            return self.dimension
    
    train_set = FakeDataset()
    
    train_loader = DataLoader(
        dataset=train_set,
        batch_size=8,
        num_workers=8
    )

    Lastly, we outline the loss perform, optimizer, and coaching loop. Observe, that we freeze the encoder’s weights and prepare solely the decoder.

    import time
    
    machine = torch.machine("cuda")
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(decoder.parameters())
    
    # Freeze the encoder weights
    encoder.requires_grad_(False)
    encoder.eval().to(machine)
    
    decoder.prepare().to(machine)
    
    warmup = 10
    active_batches = 100
    total_iters = warmup + active_batches
    
    for idx, knowledge in enumerate(train_loader):
        inputs = knowledge[0].to(machine=machine, non_blocking=True).float()
        labels = knowledge[1].to(machine=machine, non_blocking=True)
        optimizer.zero_grad()
        with torch.no_grad():
            options = encoder(inputs)
        output = decoder(options)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
    
        if idx == warmup:
            # sync the GPU and begin the timer
            torch.cuda.synchronize()
            t0 = time.perf_counter()
    
        if idx == total_iters:
            break
    
    # look ahead to the GPU to finnish after which cease the timer
    torch.cuda.synchronize()
    total_time = time.perf_counter() - t0
    print(f'throughput: {active_batches / total_time}')

    Our baseline coaching script achieves a mean throughput of 83 steps per second, with a mean GPU utilization of 85%.

    Pipelining the Mannequin Execution With CUDA Streams

    Within the revised model of the coaching loop proven beneath, we introduce two CUDA streams: one for executing the encoder and one for coaching the decoder. In every iteration, we carry out two operations concurrently:

    1. Prepare the decoder utilizing the picture options and labels from batch N.
    2. Execute the encoder on enter batch N+1 to generate its picture options.
    encoder_stream = torch.cuda.Stream()
    decoder_stream = torch.cuda.Stream()
    
    # initialize the options to None
    options = None
    
    for idx, knowledge in enumerate(train_loader):
        inputs = knowledge[0].to(machine, non_blocking=True).float()
        labels_next = knowledge[1].to(machine, non_blocking=True)
    
        if options will not be None:
            with torch.cuda.stream(decoder_stream):
                decoder_stream.wait_stream(encoder_stream)
    
                optimizer.zero_grad()
                output = decoder(options)
                loss = criterion(output, labels)
                loss.backward()
                optimizer.step()
    
        with torch.cuda.stream(encoder_stream):
            with torch.no_grad():
                options =  encoder(inputs)
            # Report that options was produced on s1_backbone
            options.record_stream(encoder_stream)
    
        labels = labels_next
    
        if idx == warmup:
            # sync the GPU and begin the timer
            torch.cuda.synchronize()
            t0 = time.perf_counter()
        if idx == total_iters:
            break
    
    # look ahead to the GPU to complete after which cease the timer
    torch.cuda.synchronize()
    total_time = time.perf_counter() - t0
    print(f'throughput: {active_batches / total_time}')

    This modification yields a mean throughput of 91 steps per second, representing a 9.6% speedup. It is a important enchancment — particularly contemplating that our baseline already had excessive GPU utilization (85%).

    Sensitivity of Pipelining to Workload Properties

    The effectiveness of pipelining with CUDA streams is extremely depending on the specifics of the coaching workload and runtime atmosphere. If the encoder is considerably bigger than the decoder (or vice versa), pipelining could provide little profit and even hinder efficiency. Conversely, when the GPU is underutilized, pipelining tends to yield extra substantial beneficial properties.

    As an instance this dependency, we reran the experiment with various batch sizes. The outcomes are summarized beneath:

    Influence of Pipelining With CUDA Streams on Throughput (by Writer)

    Because the batch dimension will increase, the advantage of pipelining diminishes. That is seemingly as a result of bigger batch sizes naturally result in greater (and extra environment friendly) GPU utilization, leaving much less room for enchancment by means of concurrent execution.

    Half 2: Offloading Augmentations onto the GPU

    On this part, we are going to apply using CUDA streams to the acceleration of knowledge augmentation. In earlier weblog posts (e.g., here and here), we now have studied the issue of bottlenecks on the information enter pipeline from totally different views and reviewed a number of methods for diagnosing and addressing them. A standard causes of those bottlenecks is CPU useful resource exhaustion, the place the CPU can’t meet the computational calls for of the preprocessing pipeline. The result’s GPU hunger — a state of affairs wherein the costly GPU sits idle, ready for knowledge to reach.

    One efficient resolution is to dump heavy knowledge preprocessing to the GPU. We’ll show this method and take it a step additional by executing the augmentations on a devoted CUDA stream, enabling concurrent execution with the mannequin coaching.

    A Toy Picture Classification Coaching Experiment

    We start by defining a easy CNN-based picture classification mannequin:

    import torch
    import torch.nn as nn
    
    import torch
    import torch.nn as nn
    
    img_size = 256
    num_classes = 10
    mannequin = nn.Sequential(
        # Begin with 256x256 picture
        nn.Conv2d(3, 16, kernel_size=1),
        nn.ReLU(inplace=True),
        nn.Conv2d(16, 32, kernel_size=2, stride=2),  # 2x downsample
        nn.ReLU(inplace=True),
        nn.Conv2d(32, 64, kernel_size=2, stride=2),  # 4x downsample
        nn.ReLU(inplace=True),
        nn.Conv2d(64, 128, kernel_size=2, stride=2),  # 8x downsample
        nn.ReLU(inplace=True),
        nn.Conv2d(128, 256, kernel_size=2, stride=2),  # 16x downsample
        nn.ReLU(inplace=True),
        nn.Conv2d(256, 512, kernel_size=2, stride=2),  # 32x downsample
        nn.ReLU(inplace=True),
        nn.Conv2d(512, 1024, kernel_size=2, stride=2),  # 64x downsample
        nn.ReLU(inplace=True),
        nn.Conv2d(1024, 2048, kernel_size=2, stride=2),  # 128X downsample
        nn.ReLU(inplace=True),
        nn.Conv2d(2048, 4096, kernel_size=2, stride=2),  # 256X
        nn.Flatten(),
        nn.Linear(4096, num_classes)
    )

    Subsequent, we create an artificial dataset with an augmentation pipeline deliberately designed to trigger a extreme efficiency bottleneck:

    import random
    from torch.utils.knowledge import DataLoader
    import torchvision.transforms.v2 as T
    from torchvision.datasets.imaginative and prescient import VisionDataset
    import torchvision.transforms.v2.purposeful as F
    import torchvision.ops as ops
    
    # A dataset with random photos and labels
    class FakeDataset(VisionDataset):
        def __init__(self, rework = None):
            tremendous().__init__(root=None, rework=rework)
            self.dimension = 1000000
    
        def __getitem__(self, index):
            # create a random picture
            img = torch.randint(0, 256, (3, img_size, img_size),
                               dtype=torch.uint8)
            # create a random label
            goal = torch.randint(0, num_classes, (1, ))
    
            if self.rework:
                # Apply tranformations
                img = self.rework(img)
    
            return img, goal
    
        def __len__(self):
            return self.dimension
    
    augmentations = T.Compose([
        T.ToDtype(torch.float32),
        T.RandomCrop(img_size//2),
        T.Resize(img_size),
        T.RandomRotation(degrees=45.0),
        T.GaussianBlur(kernel_size=7),
        T.Normalize(mean=[0, 0, 0], std=[1, 1, 1])
    ])
    
    train_set = FakeDataset(rework=augmentations)
    
    train_loader = DataLoader(
        dataset=train_set,
        batch_size=32,
        num_workers=8
    )

    Lastly, we outline the loss perform, optimizer, and coaching loop:

    import time
    
    machine = torch.machine("cuda")
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(mannequin.parameters())
    
    mannequin.prepare().to(machine)
    
    warmup = 10
    active_batches = 100
    total_iters = warmup + active_batches
    
    for idx, knowledge in enumerate(train_loader):
        inputs = knowledge[0].to(machine=machine, non_blocking=True)
        labels = knowledge[1].to(machine=machine, non_blocking=True).squeeze()
        optimizer.zero_grad()
        output = mannequin(inputs)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
    
        if idx == warmup:
            # sync the GPU and begin the timer
            torch.cuda.synchronize()
            t0 = time.perf_counter()
    
        if idx == total_iters:
            break
    
    # look ahead to the GPU to finnish after which cease the timer
    torch.cuda.synchronize()
    total_time = time.perf_counter() - t0
    print(f'throughput: {active_batches / total_time}')

    Working this baseline script leads to a mean throughput of 20.41 steps per second and a GPU utilization of solely 42%. The heavy knowledge augmentations are choking the CPU resulting in GPU hunger. See our previous post for extra data on detecting bottlenecks on the information enter pipeline.

    Offloading Information Augmentations to the GPU

    To handle the efficiency bottleneck on the information enter pipeline, we transfer the augmentations onto the GPU.

    Step one is to outline custom data transforms that apply random rotations and crops per pattern in a batch. That is necessary as a result of the built-in torchvision transforms apply the identical augmentation throughout the complete batch — shedding the per-sample randomness seen on the CPU.

    We implement the BatchRandomCrop rework utilizing the roi_align operator.

    class BatchRandomCrop(T.Rework):
        def __init__(self, output_size):
            tremendous().__init__()
            self.output_size = output_size
    
        def rework(self, img: torch.Tensor, params: dict):
            batch_size, _, original_height, original_width = img.form
            machine = img.machine
            max_top = original_height - self.output_size
            max_left = original_width - self.output_size
    
            # Generate random prime and left coords for every picture within the batch
            random_top = torch.randint(0, max_top + 1, (batch_size,),
                                       machine=machine, dtype=torch.float32)
            random_left = torch.randint(0, max_left + 1, (batch_size,),
                                        machine=machine, dtype=torch.float32)
    
            image_indices = torch.arange(batch_size, machine=machine,
                                         dtype=torch.float32)
    
            packing containers = torch.stack([
                image_indices,
                random_left,
                random_top,
                random_left + self.output_size,
                random_top + self.output_size
            ], dim=1)
    
            cropped_batch = ops.roi_align(
                img,
                packing containers,
                output_size=self.output_size
            )
            return cropped_batch 

    We implement the BatchRandomRotate transfrom by iterating over all the photos within the batch and making use of a random rotation to every one. Observe that this model will not be vectorized; a totally vectorized implementation can be extra would require better effort.

    class BatchRandomRotation(T.Rework):
        def __init__(self, levels):
            tremendous().__init__()
            self .levels = levels
    
        def rework(self, inpt: torch.Tensor, params: dict):
            # break up the batch into a listing of particular person photos
            photos = checklist(torch.unbind(inpt, dim=0))
    
            augmented_images = []
            for img_tensor in photos:
                # generate a random angle
                angle = random.uniform(-self.levels, self.levels)
    
                # apply the rotation to the one picture
                transformed_img = F.rotate(
                    img_tensor,
                    angle=angle
                )
                augmented_images.append(transformed_img)
    
            # stack the reworked photos
            return torch.stack(augmented_images, dim=0)

    We now outline batch_transform that mimics the CPU-based augmentation pipeline outlined above:

    batch_transform = T.Compose([
        T.ToDtype(torch.float32),
        BatchRandomCrop(img_size//2),
        T.Resize(img_size),
        BatchRandomRotation(degrees=45.0),
        T.GaussianBlur(kernel_size=7),
        T.Normalize(mean=[0, 0, 0], std=[1, 1, 1])
    ]) 

    Lastly, we reset the dataset and replace the coaching loop to use the brand new batch_transform:

    train_set = FakeDataset(rework=None)
    
    train_loader = DataLoader(
        dataset=train_set,
        batch_size=32,
        num_workers=8
    )
    
    for idx, knowledge in enumerate(train_loader):
        inputs = knowledge[0].to(machine=machine, non_blocking=True)
        labels = knowledge[1].to(machine=machine, non_blocking=True).squeeze()
        
        # apply augmentations
        inputs = batch_transform(inputs)
        
        optimizer.zero_grad()
        output = mannequin(inputs)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
    
        if idx == warmup:
            torch.cuda.synchronize()
            t0 = time.perf_counter()
    
        if idx == total_iters:
            break
    
    torch.cuda.synchronize()
    total_time = time.perf_counter() - t0
    print(f'throughput: {active_batches / total_time}')

    This up to date coaching script improves throughput to 35.22 steps per second — a 72.57% speedup over the baseline consequence.

    Pipelining Augmentations With CUDA Streams

    Subsequent, we pipeline the augmentation and coaching steps utilizing two separate CUDA streams: one for working the information rework one for coaching the mannequin. In every iteration of the loop we carry out two concurrent operations:

    1. We prepare the mannequin on the augmented batch N.
    2. Carry out GPU-based knowledge augmentations on batch N+1
    transform_stream = torch.cuda.Stream()
    model_stream = torch.cuda.Stream()
    
    # initialize the reworked worth to None
    reworked = None
    
    for idx, knowledge in enumerate(train_loader):
        inputs = knowledge[0]
        labels_next = knowledge[1]
    
        if reworked will not be None:
            with torch.cuda.stream(model_stream):
                labels = labels.to(machine, non_blocking=True).squeeze()
                model_stream.wait_stream(transform_stream)
                optimizer.zero_grad()
                output = mannequin(reworked)
                loss = criterion(output, labels)
                loss.backward()
                optimizer.step()
    
        with torch.cuda.stream(transform_stream):
            inputs = inputs.to(machine, non_blocking=True)
            reworked = batch_transform(inputs)
            # Report that the tensor was produced on transform_stream
            reworked.record_stream(transform_stream)
    
        labels = labels_next
    
        if idx == warmup:
            torch.cuda.synchronize()
            t0 = time.perf_counter()
        if idx == total_iters:
            break
    
    torch.cuda.synchronize()
    total_time = time.perf_counter() - t0
    print(f'throughput: {active_batches / total_time}')

    This additional improves the throughput to 38.82 steps per second — a ten.2% enhance over the serialized resolution, and 90.20% sooner than the unique baseline

    Sensitivity of Pipelining to Workload Properties

    As we noticed in Half 1, the advantage of pipelining utilizing CUDA streams varies based mostly on the small print of the workload. Within the desk beneath, we seize the outcomes for a number of totally different batch sizes:

    Influence of Pipelining With CUDA Streams on Throughput (by Writer)

    Because the batch dimension will increase, GPU offloading turns into more practical, considerably boosting efficiency. On the identical time, the beneficial properties from pipelining lower. That is seemingly do to the very fact bigger batch sizes enhance the GPU effectivity, lowering the alternatives for overlap.

    Abstract

    Relating to working AI/ML workloads, each millisecond counts. On this put up we explored the affect of pipelining an AI/ML coaching step utilizing CUDA stream in two widespread situations: partial mannequin coaching and offloading knowledge augmentations to the GPU. In each instances, the pipelined resolution outperformed the serialized implementation — although the extent of the advance diversified considerably based mostly on the worth of the batch dimension.

    As we’ve emphasised all through the put up, the anticipated affect of using CUDA streams can differ vastly based mostly on the AI/ML workload. For instance, in instances the place the GPU is already being effectively utilized, the overhead of utilizing CUDA streams may very well result in a degradation in runtime efficiency. We strongly advocate testing this method by yourself workloads earlier than adopting this strategy.

    We hope you will discover the approach described on this put up helpful. For extra tip, tips, and methods for profiling and optimizing AI/ML workflows, try the opposite posts on this series.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    I Built a C++ Backend So My GPU Would Stop Eating Air

    June 3, 2026

    I Spent May Evaluating Different Engines for OCR

    June 3, 2026

    Why AI Is NOT Stealing Your Job

    June 3, 2026

    What AI Agents Should Never Do on Their Own

    June 3, 2026

    Exploring Income Patterns with Python Pandas, Matplotlib, and Seaborn

    June 2, 2026

    From Local App to Public Website in Minutes

    June 2, 2026

    Comments are closed.

    Editors Picks

    June deadline approaches for Hawthorne sale process

    June 4, 2026

    Today’s NYT Mini Crossword Answers for June 4

    June 4, 2026

    New tiny nudibranch species discovered in Taiwan

    June 4, 2026

    Why the Budget’s CGT changes are a disaster for angel investors and startups

    June 4, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Best Yoga Mat (2025), Tested and Reviewed

    July 15, 2025

    Whole Genome Sequencing Goes Global

    October 3, 2025

    Efficient Design and Simulation of LPDA-Fed Parabolic Reflector Antennas

    April 17, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.