Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • GM reimagines Hummer off-roader with California ideas unit
    • London’s DEScycle secures over €10 million in grant funding to scale critical metals recovery platform
    • How to Edit, Merge, and Split PDFs With Free Online Tools
    • Florida crackdown targets illegal machines in Sarasota
    • Audiophile-Oriented Noble Audio Debuts More Affordable Osprey Earbuds
    • New radio bursts detected from binary stars
    • Remarkable, Catalysr and Indigenous pre-accelerators score NSW government support for diverse founders
    • Whoop Promo Codes May 2026: 20% Off | June 2026
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Tuesday, June 2
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»AI in Multiple GPUs: Gradient Accumulation & Data Parallelism
    Artificial Intelligence

    AI in Multiple GPUs: Gradient Accumulation & Data Parallelism

    Editor Times FeaturedBy Editor Times FeaturedFebruary 24, 2026No Comments10 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    is a part of a sequence about distributed AI throughout a number of GPUs:

    Introduction

    Distributed Information Parallelism (DDP) is the primary parallelization methodology we’ll have a look at. It’s the baseline method that’s all the time utilized in distributed coaching settings, and it’s generally mixed with different parallelization strategies.

    A Fast Neural Community Refresher

    Coaching a neural community means operating a ahead go, calculating the loss, backpropagating the gradients of every weight with respect to the loss perform, and eventually updating weights (what we name an optimization step). In PyTorch, it usually appears like this:

    import torch
    
    def training_loop(
        mannequin: torch.nn.Module,
        dataloader: torch.utils.information.DataLoader,
        optimizer: torch.optim.Optimizer,
        loss_fn: callable,
    ):
        for i, batch in enumerate(dataloader):
            inputs, targets = batch
            output = mannequin(inputs)  # Ahead go
            loss = loss_fn(output, targets)  # Compute loss
            loss.backward()  # Backward go (compute gradients)
            optimizer.step()  # Replace weights
            optimizer.zero_grad()  # Clear gradients for the following step

    Performing the optimization step on massive quantities of coaching information typically provides extra correct gradient estimates, resulting in smoother coaching and probably quicker convergence. So ideally we might be taking every step after computing the gradients primarily based on your complete coaching dataset. In observe, that’s hardly ever possible in Deep Studying eventualities, as it might take too lengthy to compute. As an alternative, we work with small chunks like mini-batches and micro-batches.

    • Batch: Refers back to the total coaching set used for one optimization step.
    • Mini-batch: Refers to a small subset of the coaching information used for one optimization step.
    • Micro-batch: Refers to a subset of the mini-batch, we mix a number of micro-batches for one optimization step.

    That is the place Gradient Accumulation and Information Parallelism come into play. Though we don’t use your complete dataset for every step, we will use these strategies to considerably improve our mini-batch dimension.

    Gradient Accumulation

    Right here’s the way it works: choose a big mini-batch that received’t slot in GPU reminiscence, however then break up it into micro-batches that do match. For every micro-batch, run ahead and backward passes, including (accumulating) the computed gradients. As soon as all micro-batches are processed, carry out a single optimization step utilizing the averaged gradients.

    Discover Gradient Accumulation isn’t a parallelization approach and doesn’t require a number of GPUs.

    Picture by creator: Gradient Accumulation animation

    Implementing Gradient Accumulation from scratch is easy. Right here’s what it appears like in a easy coaching loop:

    import torch
    
    def training_loop(
        mannequin: torch.nn.Module,
        dataloader: torch.utils.information.DataLoader,
        optimizer: torch.optim.Optimizer,
        loss_fn: callable,
        grad_accum_steps: int,
    ):
        for i, batch in enumerate(dataloader):
            inputs, targets = batch
            output = mannequin(inputs)
            loss = loss_fn(output, targets)
            loss.backward()  # Gradients get accrued (summed)
    
            # Solely replace weights after `grad_accum_steps` micro-batches
            if (i+1) % grad_accum_steps == 0:  # i+1 to keep away from a step within the first iteration when i=0
                optimizer.step()
                optimizer.zero_grad()

    Discover we’re sequentially performing a number of ahead and backward passes earlier than every optimization step, which requires longer coaching instances. It could be good if we might pace this up by processing a number of micro-batches in parallel… that’s precisely what DDP does!

    Distributed Information Parallelism (DDP)

    For a reasonably small variety of GPUs (as much as ~8) DDP scales virtually linearly, which is perfect. That signifies that if you happen to double the variety of GPUs, you’ll be able to virtually halve the coaching time (we already mentioned Linear Scaling beforehand).

    With DDP, a number of GPUs work collectively to course of a bigger efficient mini-batch, dealing with every micro-batch in parallel. The workflow appears like this:

    1. Break up the mini-batch throughout GPUs.
    2. Every GPU runs its personal ahead and backward passes to compute gradients for its personal information shard (micro-batch).
    3. Use an All-Scale back operation (we beforehand discovered about it in Collective operations) to common gradients throughout all GPUs.
    4. Every GPU applies the identical weight updates, retaining fashions in good sync.

    This lets us practice with a lot bigger efficient mini-batch sizes, resulting in extra steady coaching and probably quicker convergence.

    Picture by creator: Distributed Information Parallel animation

    Implementing DDP from scratch in PyTorch

    Let’s do this step-by-step. In this first iteration, we’re only syncing the gradients.

    import torch
    
    
    class DDPModelWrapper:
        def __init__(self, model: torch.nn.Module):
            self.model = model
    
        def __call__(self, *args, **kwargs):
            return self.model(*args, **kwargs)
    
        def sync_gradients(self):
            # Iterate over parameter matrices in the model
            for param in self.model.parameters():  
                # Some parameters might be frozen and don't have gradients
                if param.grad is not None:
                    # We sum and then divide since torch.distributed doesn't have an average operation
                    torch.distributed.all_reduce(param.grad.data, op=torch.distributed.ReduceOp.SUM)
                    # Assuming each GPU received an equally sized mini-batch, we can average
                    # the gradients dividing by the number of GPUs (aka world size)
                    # By default the loss function already averages over the mini-batch size
                    param.grad.data /= torch.distributed.get_world_size()

    Before we start training, we obviously need our model to be the same across all GPUs, otherwise we would be training different models! Let’s improve our implementation by checking that all weights are identical during instantiation (if you don’t know what ranks are, check the first blog post of the sequence).

    import torch
    
    
    class DDPModelWrapper:
        def __init__(self, mannequin: torch.nn.Module):
            self.mannequin = mannequin
            for param in self.mannequin.parameters():
                # We create a brand new tensor so it may well obtain the published
                rank_0_param = param.information.clone()
                # Initially rank_0_param comprises the values for the present rank
                torch.distributed.broadcast(rank_0_param, src=0)
                # After the published rank_0_param variable is overwritten with the parameters from rank_0
                if not torch.equal(param.information, rank_0_param):  # Now we evaluate rank_x with rank_0
                    elevate ValueError("Mannequin parameters will not be the identical throughout all processes.")
    
        def __call__(self, *args, **kwargs):
            return self.mannequin(*args, **kwargs)
    
        def sync_gradients(self):
            for param in self.mannequin.parameters():  
                if param.grad shouldn't be None:  
                    torch.distributed.all_reduce(param.grad.information, op=torch.distributed.ReduceOp.SUM)
                    param.grad.information /= torch.distributed.get_world_size()

    Combining DDP with GA

    You’ll be able to mix DDP with GA to realize even bigger efficient batch sizes. That is notably helpful when your mannequin is so massive that just a few samples match per GPU.

    The important thing profit is decreased communication overhead: as an alternative of syncing gradients after each batch, you solely sync as soon as per grad_accum_steps batches. This implies:

    • World efficient batch dimension = num_gpus × micro_batch_size × grad_accum_steps
    • Fewer synchronization factors = much less time spent on inter-GPU communication

    A coaching loop utilizing our DDPModelWrapper with Gradient Accumulation appears like this:

    def training_loop(
        ddp_model: DDPModelWrapper,
        dataloader: torch.utils.information.DataLoader,
        optimizer: torch.optim.Optimizer,
        loss_fn: callable,
        grad_accum_steps: int,
    ):
        for i, batch in enumerate(dataloader):
            inputs, targets = batch
            output = ddp_model(inputs)
            loss = loss_fn(output, targets)
            loss.backward()
    
            if (i+1) % grad_accum_steps == 0:
                # Should sync gradients throughout GPUs *BEFORE* the optimization step
                ddp_model.sync_gradients()
                optimizer.step()
                optimizer.zero_grad()

    Professional-tips and superior utilization

    • Use information prefetching. You’ll be able to pace up coaching by loading the following batch of information whereas the present one is being processed. PyTorch’s DataLoader supplies a prefetch_factor argument that controls what number of batches to prefetch within the background. Correctly leveraging prefetching with CUDA could be a bit tough, so we’ll depart it for a future publish.
    • Don’t max out GPU reminiscence. Counter-intuitively, leaving some free reminiscence can result in quicker coaching throughput. Once you depart not less than ~15% of GPU reminiscence free, the GPU can higher handle reminiscence by avoiding fragmentation.
    • PyTorch DDP overlaps communication with computation. By default, DDP communicates gradients as they’re computed throughout backpropagation quite than ready for the complete backward go to complete. Right here’s how:
      • PyTorch organizes mannequin gradients into buckets of bucket_cap_mb megabytes. In the course of the backward go, PyTorch marks gradients as prepared for discount as they’re computed. As soon as all gradients in a bucket are prepared, DDP kicks off an asynchronous allreduce to common these gradients throughout all ranks. The loss.backward() name returns solely in any case allreduceoperations have accomplished, so instantly calling choose.step() is protected.
      • The bucket_cap_mb parameter creates a tradeoff: smaller values set off extra frequent allreduce operations, however every communication kernel launch incurs some overhead that may damage efficiency. Bigger values cut back communication frequency but in addition cut back overlap; on the excessive, if buckets are too massive, you’re ready for your complete backward go to complete earlier than speaking. The optimum worth is dependent upon your mannequin structure and {hardware}, so profile with totally different values to search out what works finest.
    Supply: PyTorch Tutorial
    • Right here’s an entire PyTorch implementation of DDP:
    """
    Launch with:
      torchrun --nproc_per_node=NUM_GPUS ddp.py
    """
    import torch
    import torch.nn as nn
    import torch.distributed as dist
    from torch.nn.parallel import DistributedDataParallel as DDP
    from torch.utils.information import DataLoader, TensorDataset
    from torch.utils.information.distributed import DistributedSampler
    from torch import optim
    
    
    class ToyModel(nn.Module):
        def __init__(self):
            tremendous().__init__()
            self.web = nn.Sequential(
                nn.Linear(1024, 1024), nn.ReLU(),
                nn.Linear(1024, 1024), nn.ReLU(),
                nn.Linear(1024, 256),
            )
    
        def ahead(self, x):
            return self.web(x)
    
    
    def practice():
        dist.init_process_group(backend="nccl")
        rank = dist.get_rank()
        torch.cuda.set_device(rank)
        gadget = torch.gadget(f"cuda:{rank}")
    
        # Create dummy dataset
        x_data = torch.randn(1000, 1024)
        y_data = torch.randn(1000, 256)
        dataset = TensorDataset(x_data, y_data)
    
        # DistributedSampler ensures every rank will get totally different information
        sampler = DistributedSampler(dataset, shuffle=True)
        dataloader = DataLoader(dataset, batch_size=64, sampler=sampler)
    
        mannequin = ToyModel().to(gadget)
    
        # gradient_as_bucket_view: avoids an additional grad tensor copy per bucket.
        ddp_model = DDP(
            mannequin,
            device_ids=[rank],
            bucket_cap_mb=25,
            gradient_as_bucket_view=True,
        )
    
        optimizer = optim.AdamW(ddp_model.parameters(), lr=1e-3)
        loss_fn = nn.MSELoss()
    
        for epoch in vary(2):
            sampler.set_epoch(epoch)  # Ensures totally different shuffling every epoch
    
            for batch_idx, (x, y) in enumerate(dataloader):
                x, y = x.to(gadget), y.to(gadget)
    
                optimizer.zero_grad()
                output = ddp_model(x)
                loss = loss_fn(output, y)
    
                # Backward robotically overlaps with allreduce per bucket.
                # By the point this returns, all allreduce ops are achieved.
                loss.backward()
                optimizer.step()
    
                if rank == 0 and batch_idx % 5 == 0:
                    print(f"epoch {epoch}  batch {batch_idx}  loss={loss.merchandise():.4f}")
    
        dist.destroy_process_group()
    
    
    if __name__ == "__main__":
        practice()
    • Right here’s an entire PyTorch implementation combining DDP with GA:
    """
    Launch with:
      torchrun --nproc_per_node=NUM_GPUS ddp_ga.py
    """
    import torch
    import torch.nn as nn
    import torch.distributed as dist
    from torch.nn.parallel import DistributedDataParallel as DDP
    from torch.utils.information import DataLoader, TensorDataset
    from torch.utils.information.distributed import DistributedSampler
    from torch import optim
    from contextlib import nullcontext
    
    
    class ToyModel(nn.Module):
        def __init__(self):
            tremendous().__init__()
            self.web = nn.Sequential(
                nn.Linear(1024, 1024), nn.ReLU(),
                nn.Linear(1024, 1024), nn.ReLU(),
                nn.Linear(1024, 256),
            )
    
        def ahead(self, x):
            return self.web(x)
    
    
    def practice():
        dist.init_process_group(backend="nccl")
        rank = dist.get_rank()
        torch.cuda.set_device(rank)
        gadget = torch.gadget(f"cuda:{rank}")
    
        # Create dummy dataset
        x_data = torch.randn(1000, 1024)
        y_data = torch.randn(1000, 256)
        dataset = TensorDataset(x_data, y_data)
    
        # DistributedSampler ensures every rank will get totally different information
        sampler = DistributedSampler(dataset, shuffle=True)
        dataloader = DataLoader(dataset, batch_size=16, sampler=sampler)
    
        mannequin = ToyModel().to(gadget)
    
        ddp_model = DDP(
            mannequin,
            device_ids=[rank],
            bucket_cap_mb=25,
            gradient_as_bucket_view=True,
        )
    
        optimizer = optim.AdamW(ddp_model.parameters(), lr=1e-3)
        loss_fn = nn.MSELoss()
    
        ACCUM_STEPS = 4
    
        for epoch in vary(2):
            sampler.set_epoch(epoch)  # Ensures totally different shuffling every epoch
    
            optimizer.zero_grad()
            for batch_idx, (x, y) in enumerate(dataloader):
                x, y = x.to(gadget), y.to(gadget)
    
                is_last_micro_step = (batch_idx + 1) % ACCUM_STEPS == 0
    
                # no_sync() suppresses allreduce on accumulation steps.
                # On the final microstep we exit no_sync() so DDP fires
                # the allreduce overlapped with that backward go.
                ctx = ddp_model.no_sync() if not is_last_micro_step else nullcontext()
    
                with ctx:
                    output = ddp_model(x)
                    loss = loss_fn(output, y) / ACCUM_STEPS
                    loss.backward()
    
                if is_last_micro_step:
                    optimizer.step()
                    optimizer.zero_grad()
    
                    if rank == 0:
                        print(f"epoch {epoch}  batch {batch_idx}  loss={loss.merchandise() * ACCUM_STEPS:.4f}")
    
        dist.destroy_process_group()
    
    
    if __name__ == "__main__":
        practice()

    Conclusion

    Comply with me on X for extra free AI content material @l_cesconetto

    Congratulations on making it to the top! On this publish you discovered about:

    • The significance of huge batch sizes
    • How Gradient Accumulation works and its limitations
    • The DDP workflow and its advantages
    • Easy methods to implement GA and DDP from scratch in PyTorch
    • Easy methods to mix GA and DDP

    Within the subsequent article, we’ll discover ZeRO (Zero Redundancy Optimizer), a extra superior approach that builds upon DDP to additional optimize VRAM reminiscence utilization.

    References



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Escaping the Valley of Choice in BI

    June 2, 2026

    Ensuring Data Integrity with Cryptographic Hashing and the Ethereum Blockchain

    June 1, 2026

    RAG Is Not Machine Learning, and the ML Toolkit Solves the Wrong Problem

    June 1, 2026

    How to Combine Claude Code and Codex for Maximum Coding Power

    June 1, 2026

    It’s the Lessons We Learned Along the Way. Or, Is It?

    June 1, 2026

    Proxy-Pointer RAG: Eliminating Wasteful Entity & Relations Extraction in Knowledge Graphs

    May 31, 2026

    Comments are closed.

    Editors Picks

    GM reimagines Hummer off-roader with California ideas unit

    June 2, 2026

    London’s DEScycle secures over €10 million in grant funding to scale critical metals recovery platform

    June 2, 2026

    How to Edit, Merge, and Split PDFs With Free Online Tools

    June 2, 2026

    Florida crackdown targets illegal machines in Sarasota

    June 2, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    8 Best Gaming Laptops (2025), Tested and Reviewed

    June 9, 2025

    The Best 3-in-1 Apple Charging Stations (2025), Tested and Reviewed

    June 29, 2025

    Wearable ‘glasses’ for blind people navigate the world

    December 23, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.