Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • asexual fish defy extinction with gene repair
    • The ‘Lonely Runner’ Problem Only Appears Simple
    • Binance and Bitget to probe a rally in RaveDAO’s RAVE token, which surged 4,500% in a week, after ZachXBT alleged RAVE insiders engineered a large short squeeze (Francisco Rodrigues/CoinDesk)
    • Today’s NYT Connections Hints, Answers for April 19 #1043
    • Rugged tablet boasts built-in projector and night vision
    • Asus TUF Gaming A14 (2026) Review: GPU-Less Gaming Laptop
    • Mistral, which once aimed for top open models, now leans on being an alternative to Chinese and US labs, says it’s on track for $80M in monthly revenue by Dec. (Iain Martin/Forbes)
    • Today’s NYT Wordle Hints, Answer and Help for April 19 #1765
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Sunday, April 19
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Building a Production-Grade Multi-Node Training Pipeline with PyTorch DDP
    Artificial Intelligence

    Building a Production-Grade Multi-Node Training Pipeline with PyTorch DDP

    Editor Times FeaturedBy Editor Times FeaturedMarch 27, 2026No Comments16 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    1. Introduction

    have a mannequin. You could have a single GPU. Coaching takes 72 hours. You requisition a second machine with 4 extra GPUs — and now you want your code to really use them. That is the precise second the place most practitioners hit a wall. Not as a result of distributed coaching is conceptually onerous, however as a result of the engineering required to do it accurately — course of teams, rank-aware logging, sampler seeding, checkpoint boundaries — is scattered throughout dozens of tutorials that every cowl one piece of the puzzle.

    This text is the information I want I had after I first scaled coaching past a single node. We are going to construct an entire, production-grade multi-node coaching pipeline from scratch utilizing PyTorch’s DistributedDataParallel (DDP). Each file is modular, each worth is configurable, and each distributed idea is made specific. By the top, you should have a codebase you’ll be able to drop into any cluster and begin coaching instantly.

    What we are going to cowl: the psychological mannequin behind DDP, a clear modular mission construction, distributed lifecycle administration, environment friendly information loading throughout ranks, a coaching loop with combined precision and gradient accumulation, rank-aware logging and checkpointing, multi-node launch scripts, and the efficiency pitfalls that journey up even skilled engineers.

    The total codebase is accessible on GitHub. Each code block on this article is pulled immediately from that repository.

    2. How DDP Works — The Psychological Mannequin

    Earlier than writing any code, we’d like a transparent psychological mannequin. DistributedDataParallel (DDP) will not be magic — it’s a well-defined communication sample constructed on high of collective operations.

    The setup is easy. You launch N processes (one per GPU, doubtlessly throughout a number of machines). Every course of initialises a course of group — a communication channel backed by NCCL (NVIDIA Collective Communications Library) for GPU-to-GPU transfers. Each course of will get three identification numbers: its international rank (distinctive throughout all machines), its native rank (distinctive inside its machine), and the entire world dimension.

    Every course of holds an equivalent copy of the mannequin. Knowledge is partitioned throughout processes utilizing a DistributedSampler — each rank sees a special slice of the dataset, however the mannequin weights begin (and keep) equivalent.

    The vital mechanism is what occurs throughout backward(). DDP registers hooks on each parameter. When a gradient is computed for a parameter, DDP buckets it with close by gradients and fires an all-reduce operation throughout the method group. This all-reduce computes the imply gradient throughout all ranks. As a result of each rank now has the identical averaged gradient, the following optimizer step produces equivalent weight updates, preserving all replicas in sync — with none specific synchronisation code from us.

    This is the reason DDP is strictly superior to the older DataParallel: there isn’t any single “grasp” GPU bottleneck, no redundant ahead passes, and gradient communication overlaps with backward computation.

    Determine 1: DDP gradient synchronization circulate. All-reduce occurs robotically through hooks registered throughout backward().
    Key terminology
    Time period Which means
    Rank Globally distinctive course of ID (0 to world_size – 1)
    Native Rank GPU index inside a single machine (0 to nproc_per_node – 1)
    World Measurement Complete variety of processes throughout all nodes
    Course of Group Communication channel (NCCL) connecting all ranks

    3. Structure Overview

    A manufacturing coaching pipeline ought to by no means be a single monolithic script. Ours is break up into six centered modules, every with a single accountability. The dependency graph beneath reveals how they join — notice that config.py sits on the backside, performing as the only supply of fact for each hyperparameter.

    Determine 2: Module dependency graph. practice.py orchestrates all different modules. config.py is imported by everybody

    Right here is the mission construction:

    pytorch-multinode-ddp/
    ├── practice.py            # Entry level — coaching loop
    ├── config.py           # Dataclass configuration + argparse
    ├── ddp_utils.py        # Distributed setup, teardown, checkpointing
    ├── mannequin.py            # MiniResNet (light-weight ResNet variant)
    ├── dataset.py          # Artificial dataset + DistributedSampler loader
    ├── utils/
    │   ├── logger.py       # Rank-aware structured logging
    │   └── metrics.py      # Working averages + distributed all-reduce
    ├── scripts/
    │   └── launch.sh       # Multi-node torchrun wrapper
    └── necessities.txt

    This separation means you’ll be able to swap in an actual dataset by modifying solely dataset.py, or exchange the mannequin by modifying solely mannequin.py. The coaching loop by no means wants to alter.

    4. Centralized Configuration

    Laborious-coded hyperparameters are the enemy of reproducibility. We use a Python dataclass as our single supply of configuration. Each different module imports TrainingConfig and reads from it — nothing is hard-coded.

    The dataclass doubles as our CLI parser: the from_args() classmethod introspects the sphere names and kinds, robotically constructing argparse flags with defaults. This implies you get –batch_size 128 and –no-use_amp without cost, with out writing a single parser line by hand.

    @dataclass
    class TrainingConfig:
        """Immutable bag of each parameter the coaching pipeline wants."""
    
    
        # Mannequin
        num_classes: int = 10
        in_channels: int = 3
        image_size: int = 32
    
    
        # Knowledge
        batch_size: int = 64          # per-GPU
        num_workers: int = 4
    
    
        # Optimizer / Scheduler
        epochs: int = 10
        lr: float = 0.01
        momentum: float = 0.9
        weight_decay: float = 1e-4
    
    
        # Distributed
        backend: str = "nccl"
    
    
        # Blended Precision
        use_amp: bool = True
    
    
        # Gradient Accumulation
        grad_accum_steps: int = 1
    
    
        # Checkpointing
        checkpoint_dir: str = "./checkpoints"
        save_every: int = 1
        resume_from: Non-compulsory[str] = None
    
    
        # Logging & Profiling
        log_interval: int = 10
        enable_profiling: bool = False
        seed: int = 42
    
    
        @classmethod
        def from_args(cls) -> "TrainingConfig":
            parser = argparse.ArgumentParser(
                formatter_class=argparse.ArgumentDefaultsHelpFormatter)
            defaults = cls()
            for identify, val in vars(defaults).gadgets():
                arg_type = sort(val) if val will not be None else str
                if isinstance(val, bool):
                    parser.add_argument(f"--{identify}", default=val,
                                        motion=argparse.BooleanOptionalAction)
                else:
                    parser.add_argument(f"--{identify}", sort=arg_type, default=val)
            return cls(**vars(parser.parse_args()))

    Why a dataclass as an alternative of YAML or JSON? Three causes: (1) sort hints are enforced by the IDE and mypy, (2) there’s zero dependency on third-party config libraries, and (3) each parameter has a visual default proper subsequent to its declaration. For manufacturing programs that want hierarchical configs, you’ll be able to at all times layer Hydra or OmegaConf on high of this sample.

    5. Distributed Lifecycle Administration

    The distributed lifecycle has three phases: initialise, run, and tear down. Getting any of those fallacious can produce silent hangs, so we wrap every thing in specific error dealing with.

    Course of Group Initialization

    The setup_distributed() perform reads the three surroundings variables that torchrun units robotically (RANK, LOCAL_RANK, WORLD_SIZE), pins the proper GPU with torch.cuda.set_device(), and initialises the NCCL course of group. It returns a frozen dataclass — DistributedContext — that the remainder of the codebase passes round as an alternative of re-reading os.environ.

    @dataclass(frozen=True)
    class DistributedContext:
        """Immutable snapshot of the present course of's distributed identification."""
        rank: int
        local_rank: int
        world_size: int
        gadget: torch.gadget
    
    
    
    
    def setup_distributed(config: TrainingConfig) -> DistributedContext:
        required_vars = ("RANK", "LOCAL_RANK", "WORLD_SIZE")
        lacking = [v for v in required_vars if v not in os.environ]
        if lacking:
            increase RuntimeError(
                f"Lacking surroundings variables: {lacking}. "
                "Launch with torchrun or set them manually.")
    
    
        if not torch.cuda.is_available():
            increase RuntimeError("CUDA is required for NCCL distributed coaching.")
    
    
        rank = int(os.environ["RANK"])
        local_rank = int(os.environ["LOCAL_RANK"])
        world_size = int(os.environ["WORLD_SIZE"])
    
    
        torch.cuda.set_device(local_rank)
        gadget = torch.gadget("cuda", local_rank)
        dist.init_process_group(backend=config.backend)
    
    
        return DistributedContext(
            rank=rank, local_rank=local_rank,
            world_size=world_size, gadget=gadget)
    Checkpointing with Rank Guards

    The commonest distributed checkpointing bug is all ranks writing to the identical file concurrently. We guard saving behind is_main_process(), and loading behind dist.barrier() — this ensures rank 0 finishes writing earlier than different ranks try to learn.

    def save_checkpoint(path, epoch, mannequin, optimizer, scaler=None, rank=0):
        """Persist coaching state to disk (rank-0 solely)."""
        if not is_main_process(rank):
            return
        Path(path).father or mother.mkdir(mother and father=True, exist_ok=True)
        state = {
            "epoch": epoch,
            "model_state_dict": mannequin.module.state_dict(),
            "optimizer_state_dict": optimizer.state_dict(),
        }
        if scaler will not be None:
            state["scaler_state_dict"] = scaler.state_dict()
        torch.save(state, path)
    
    
    
    
    def load_checkpoint(path, mannequin, optimizer=None, scaler=None, gadget="cpu"):
        """Restore coaching state. All ranks load after barrier."""
        dist.barrier()  # await rank 0 to complete writing
        ckpt = torch.load(path, map_location=gadget, weights_only=False)
        mannequin.load_state_dict(ckpt["model_state_dict"])
        if optimizer and "optimizer_state_dict" in ckpt:
            optimizer.load_state_dict(ckpt["optimizer_state_dict"])
        if scaler and "scaler_state_dict" in ckpt:
            scaler.load_state_dict(ckpt["scaler_state_dict"])
        return ckpt.get("epoch", 0)

    6. Mannequin Design for DDP

    We use a light-weight ResNet variant known as MiniResNet — three residual levels with rising channels (64, 128, 256), two blocks per stage, international common pooling, and a fully-connected head. It’s complicated sufficient to be reasonable however mild sufficient to run on any {hardware}.

    The vital DDP requirement: the mannequin should be moved to the proper GPU earlier than wrapping. DDP doesn’t transfer fashions for you.

    def create_model(config: TrainingConfig, gadget: torch.gadget) -> nn.Module:
        """Instantiate a MiniResNet and transfer it to gadget."""
        mannequin = MiniResNet(
            in_channels=config.in_channels,
            num_classes=config.num_classes,
        )
        return mannequin.to(gadget)
    
    
    
    
    def wrap_ddp(mannequin: nn.Module, local_rank: int) -> DDP:
        """Wrap mannequin with DistributedDataParallel."""
        return DDP(mannequin, device_ids=[local_rank])

    Notice the two-step sample: create_model() → wrap_ddp(). This separation is intentional. When loading a checkpoint, you want the unwrapped mannequin (mannequin.module) to load state dicts, then re-wrap. If you happen to fuse creation and wrapping, checkpoint loading turns into awkward.

    7. Distributed Knowledge Loading

    DistributedSampler is what ensures every GPU sees a singular slice of knowledge. It partitions indices throughout world_size ranks and returns a non-overlapping subset for every. With out it, each GPU would practice on equivalent batches — burning compute for zero profit.

    There are three particulars that journey folks up:

    First, sampler.set_epoch(epoch) should be known as firstly of each epoch. The sampler makes use of the epoch quantity as a random seed for shuffling. If you happen to overlook this, each epoch will iterate over information in the identical order, which degrades generalisation.

    Second, pin_memory=True within the DataLoader pre-allocates page-locked host reminiscence, enabling asynchronous CPU-to-GPU transfers whenever you name tensor.to(gadget, non_blocking=True). This overlap is the place actual throughput positive factors come from.

    Third, persistent_workers=True avoids respawning employee processes each epoch — a major overhead discount when num_workers > 0.

    def create_distributed_dataloader(dataset, config, ctx):
        sampler = DistributedSampler(
            dataset,
            num_replicas=ctx.world_size,
            rank=ctx.rank,
            shuffle=True,
        )
        loader = DataLoader(
            dataset,
            batch_size=config.batch_size,
            sampler=sampler,
            num_workers=config.num_workers,
            pin_memory=True,
            drop_last=True,
            persistent_workers=config.num_workers > 0,
        )
        return loader, sampler

    8. The Coaching Loop — The place It All Comes Collectively

    That is the center of the pipeline. The loop beneath integrates each part we now have constructed up to now: DDP-wrapped mannequin, distributed information loader, combined precision, gradient accumulation, rank-aware logging, studying price scheduling, and checkpointing.

    Determine 3: Coaching loop state machine. The interior step loop handles gradient accumulation; the outer epoch loop handles scheduler stepping and checkpointing.
    Blended Precision (AMP)

    Computerized Blended Precision (AMP) retains grasp weights in FP32 however runs the ahead go and loss computation in FP16. This halves reminiscence bandwidth necessities and allows Tensor Core acceleration on fashionable NVIDIA GPUs, usually yielding a 1.5–2x throughput enchancment with negligible accuracy impression.

    We use torch.autocast for the ahead go and torch.amp.GradScaler for loss scaling. A subtlety: we create the GradScaler with enabled=config.use_amp. When disabled, the scaler turns into a no-op — similar code path, zero overhead, no branching.

    Gradient Accumulation

    Generally you want a bigger efficient batch dimension than your GPU reminiscence permits. Gradient accumulation simulates this by operating a number of forward-backward passes earlier than stepping the optimizer. The hot button is to divide the loss by grad_accum_steps earlier than backward(), so the gathered gradient is accurately averaged.

    def train_one_epoch(mannequin, loader, criterion, optimizer, scaler, ctx, config, epoch, logger):
        mannequin.practice()
        tracker = MetricTracker()
        total_steps = len(loader)
    
    
        use_amp = config.use_amp and ctx.gadget.sort == "cuda"
        autocast_ctx = torch.autocast("cuda", dtype=torch.float16) if use_amp else nullcontext()
    
    
        optimizer.zero_grad(set_to_none=True)
    
    
        for step, (pictures, labels) in enumerate(loader):
            pictures = pictures.to(ctx.gadget, non_blocking=True)
            labels = labels.to(ctx.gadget, non_blocking=True)
    
    
            with autocast_ctx:
                outputs = mannequin(pictures)
                loss = criterion(outputs, labels)
                loss = loss / config.grad_accum_steps  # scale for accumulation
    
    
            scaler.scale(loss).backward()
    
    
            if (step + 1) % config.grad_accum_steps == 0:
                scaler.step(optimizer)
                scaler.replace()
                optimizer.zero_grad(set_to_none=True)  # memory-efficient reset
    
    
            # Monitor uncooked (unscaled) loss for logging
            raw_loss = loss.merchandise() * config.grad_accum_steps
            acc = compute_accuracy(outputs, labels)
            tracker.replace("loss", raw_loss, n=pictures.dimension(0))
            tracker.replace("accuracy", acc, n=pictures.dimension(0))
    
    
            if is_main_process(ctx.rank) and (step + 1) % config.log_interval == 0:
                log_training_step(logger, epoch, step + 1, total_steps,
                                  raw_loss, optimizer.param_groups[0]["lr"])
    
    
        return tracker

    Two particulars value highlighting. First, zero_grad(set_to_none=True) deallocates gradient tensors as an alternative of filling them with zeros, saving reminiscence proportional to the mannequin dimension. Second, information is moved to the GPU with non_blocking=True — this permits the CPU to proceed filling the subsequent batch whereas the present one transfers, exploiting the pin_memory overlap.

    The Most important Operate

    The primary() perform orchestrates the total pipeline. Notice the attempt/lastly sample guaranteeing that the method group is torn down even when an exception happens — with out this, a crash on one rank can depart different ranks hanging indefinitely.

    def predominant():
        config = TrainingConfig.from_args()
        ctx = setup_distributed(config)
        logger = setup_logger(ctx.rank)
    
    
        torch.manual_seed(config.seed + ctx.rank)
    
    
        mannequin = create_model(config, ctx.gadget)
        mannequin = wrap_ddp(mannequin, ctx.local_rank)
    
    
        optimizer = torch.optim.SGD(mannequin.parameters(), lr=config.lr,
                                     momentum=config.momentum,
                                     weight_decay=config.weight_decay)
        scheduler = CosineAnnealingLR(optimizer, T_max=config.epochs)
        scaler = torch.amp.GradScaler(enabled=config.use_amp)
    
    
        start_epoch = 1
        if config.resume_from:
            start_epoch = load_checkpoint(config.resume_from, mannequin.module,
                                           optimizer, scaler, ctx.gadget) + 1
    
    
        dataset = SyntheticImageDataset(dimension=50000, image_size=config.image_size,
                                         num_classes=config.num_classes)
        loader, sampler = create_distributed_dataloader(dataset, config, ctx)
        criterion = nn.CrossEntropyLoss()
    
    
        attempt:
            for epoch in vary(start_epoch, config.epochs + 1):
                sampler.set_epoch(epoch)
                tracker = train_one_epoch(mannequin, loader, criterion, optimizer,
                                           scaler, ctx, config, epoch, logger)
                scheduler.step()
    
    
                avg_loss = all_reduce_scalar(tracker.common("loss"),
                                              ctx.world_size, ctx.gadget)
    
    
                if is_main_process(ctx.rank):
                    log_epoch_summary(logger, epoch, {"loss": avg_loss})
                    if epoch % config.save_every == 0:
                        save_checkpoint(f"checkpoints/epoch_{epoch}.pt",
                                         epoch, mannequin, optimizer, scaler, ctx.rank)
        lastly:
            cleanup_distributed()

    9. Launching Throughout Nodes

    PyTorch’s torchrun (launched in v1.10 as a substitute for torch.distributed.launch) handles spawning one course of per GPU and setting the RANK, LOCAL_RANK, and WORLD_SIZE surroundings variables. For multi-node coaching, each node should specify the grasp node’s handle so that every one processes can set up the NCCL connection.

    Right here is our launch script, which reads all tunables from surroundings variables:

    #!/usr/bin/env bash
    set -euo pipefail
    
    
    NNODES="${NNODES:-2}"
    NPROC_PER_NODE="${NPROC_PER_NODE:-4}"
    NODE_RANK="${NODE_RANK:-0}"
    MASTER_ADDR="${MASTER_ADDR:-127.0.0.1}"
    MASTER_PORT="${MASTER_PORT:-12355}"
    
    
    torchrun 
        --nnodes="${NNODES}" 
        --nproc_per_node="${NPROC_PER_NODE}" 
        --node_rank="${NODE_RANK}" 
        --master_addr="${MASTER_ADDR}" 
        --master_port="${MASTER_PORT}" 
        practice.py "$@"

    For a fast single-node check on one GPU:

    torchrun --standalone --nproc_per_node=1 practice.py --epochs 2

    For 2-node coaching with 4 GPUs every, run on Node 0:

    MASTER_ADDR=10.0.0.1 NODE_RANK=0 NNODES=2 NPROC_PER_NODE=4 bash scripts/launch.sh

    And on Node 1:

    MASTER_ADDR=10.0.0.1 NODE_RANK=1 NNODES=2 NPROC_PER_NODE=4 bash scripts/launch.sh
    Determine 4: Multi-node structure. Every node runs 4 GPU processes; NCCL all-reduce synchronizes gradients throughout the ring.

    10. Efficiency Pitfalls and Ideas

    After constructing a whole bunch of distributed coaching jobs, these are the errors I see most frequently:

    Forgetting sampler.set_epoch(). With out it, information order is equivalent each epoch. That is the only most typical DDP bug and it silently hurts convergence.

    CPU-GPU switch bottleneck. All the time use pin_memory=True in your DataLoader and non_blocking=True in your .to() calls. With out these, the CPU blocks on each batch switch.

    Logging from all ranks. If each rank prints, output is interleaved rubbish. Guard all logging behind rank == 0 checks.

    zero_grad() with out set_to_none=True. The default zero_grad() fills gradient tensors with zeros. set_to_none=True deallocates them as an alternative, lowering peak reminiscence.

    Saving checkpoints from all ranks. A number of ranks writing the identical file causes corruption. Solely rank 0 ought to save, and all ranks ought to barrier earlier than loading.

    Not seeding with rank offset. torch.manual_seed(seed + rank) ensures every rank’s information augmentation is completely different. With out the offset, augmentations are equivalent throughout GPUs.

    When NOT to make use of DDP

    DDP replicates the whole mannequin on each GPU. In case your mannequin doesn’t slot in a single GPU’s reminiscence, DDP alone is not going to assist. For such circumstances, look into Absolutely Sharded Knowledge Parallel (FSDP), which shards parameters, gradients, and optimizer states throughout ranks, or frameworks like DeepSpeed ZeRO.

    11. Conclusion

    We’ve gone from a single-GPU coaching mindset to a totally distributed, production-grade pipeline able to scaling throughout machines — with out sacrificing readability or maintainability.

    However extra importantly, this wasn’t nearly making DDP work. It was about constructing it accurately.

    Let’s distill a very powerful takeaways:

    Key Takeaways

    • DDP is deterministic engineering, not magic
      When you perceive course of teams, ranks, and all-reduce, distributed coaching turns into predictable and debuggable.
    • Construction issues greater than scale
      A clear, modular codebase (config → information → mannequin → coaching → utils) is what makes scaling from 1 GPU to 100 GPUs possible.
    • Right information sharding is non-negotiable
      DistributedSampler + set_epoch() is the distinction between true scaling and wasted compute.
    • Efficiency comes from small particulars
      pin_memory, non_blocking, set_to_none=True, and AMP collectively ship huge throughput positive factors.
    • Rank-awareness is crucial
      Logging, checkpointing, and randomness should all respect rank — in any other case you get chaos.
    • DDP scales compute, not reminiscence
      In case your mannequin doesn’t match on one GPU, you want FSDP or ZeRO — no more GPUs.

    The Larger Image

    What you’ve constructed right here is not only a coaching script — it’s a template for real-world ML programs.

    This precise sample is utilized in:

    • Manufacturing ML pipelines
    • Analysis labs coaching giant fashions
    • Startups scaling from prototype to infrastructure

    And the perfect half?

     Now you can:

    • Plug in an actual dataset
    • Swap in a Transformer or customized structure
    • Scale throughout nodes with zero code modifications

    What to Discover Subsequent

    When you’re snug with this setup, the subsequent frontier is memory-efficient and large-scale coaching:

    • Absolutely Sharded Knowledge Parallel (FSDP) → shard mannequin + gradients
    • DeepSpeed ZeRO → shard optimizer states
    • Pipeline Parallelism → break up fashions throughout GPUs
    • Tensor Parallelism → break up layers themselves

    These strategies energy as we speak’s largest fashions — however all of them construct on the precise DDP basis you now perceive.

    Distributed coaching usually feels intimidating — not as a result of it’s inherently complicated, however as a result of it’s not often offered as an entire system.

    Now you’ve seen the total image.

    And when you see it end-to-end…

    Scaling turns into an engineering choice, not a analysis downside.

    What’s Subsequent

    This pipeline handles data-parallel coaching — the most typical distributed sample. When your fashions outgrow single-GPU reminiscence, discover Absolutely Sharded Knowledge Parallel (FSDP) for parameter sharding, or DeepSpeed ZeRO for optimizer-state partitioning. For really huge fashions, pipeline parallelism (splitting the mannequin throughout GPUs layer by layer) and tensor parallelism (splitting particular person layers) turn out to be mandatory.

    However for the overwhelming majority of coaching workloads — from ResNets to medium-scale Transformers — the DDP pipeline we constructed right here is strictly what manufacturing groups use. Scale it by including nodes and GPUs; the code handles the remaining.

    The whole, production-ready codebase for this mission is accessible right here: pytorch-multinode-ddp

    References

    [1] PyTorch Distributed Overview, PyTorch Documentation (2024), https://pytorch.org/tutorials/newbie/dist_overview.html

    [2] S. Li et al., PyTorch Distributed: Experiences on Accelerating Knowledge Parallel Coaching (2020), VLDB Endowment

    [3] PyTorch DistributedDataParallel API, https://pytorch.org/docs/steady/generated/torch.nn.parallel.DistributedDataParallel.html

    [4] NCCL: Optimized primitives for collective multi-GPU communication, NVIDIA, https://developer.nvidia.com/nccl

    [5] PyTorch AMP: Computerized Blended Precision, https://pytorch.org/docs/steady/amp.html



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    AI Agents Need Their Own Desk, and Git Worktrees Give Them One

    April 18, 2026

    Your RAG System Retrieves the Right Data — But Still Produces Wrong Answers. Here’s Why (and How to Fix It).

    April 18, 2026

    Europe Warns of a Next-Gen Cyber Threat

    April 18, 2026

    How to Learn Python for Data Science Fast in 2026 (Without Wasting Time)

    April 18, 2026

    A Practical Guide to Memory for Autonomous LLM Agents

    April 17, 2026

    You Don’t Need Many Labels to Learn

    April 17, 2026

    Comments are closed.

    Editors Picks

    asexual fish defy extinction with gene repair

    April 19, 2026

    The ‘Lonely Runner’ Problem Only Appears Simple

    April 19, 2026

    Binance and Bitget to probe a rally in RaveDAO’s RAVE token, which surged 4,500% in a week, after ZachXBT alleged RAVE insiders engineered a large short squeeze (Francisco Rodrigues/CoinDesk)

    April 19, 2026

    Today’s NYT Connections Hints, Answers for April 19 #1043

    April 19, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Today’s NYT Mini Crossword Answers for Sept. 9

    September 9, 2025

    Here what coffee does to your body and how much is OK to drink

    February 5, 2026

    Hawaii lawmakers introduce bill that would ban prediction markets

    January 28, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.