Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Canyon Spectral:ON CF 8 Electric Mountain Bike: Beginner-Friendly, Under $5K
    • US-sanctioned currency exchange says $15 million heist done by “unfriendly states”
    • This New Air Purifier Filter Can Remove Cannabis Smoke Odor, Just in Time for 4/20
    • Portable water filter provides safe drinking water from any source
    • MAGA Is Increasingly Convinced the Trump Assassination Attempt Was Staged
    • NCAA seeks faster trial over DraftKings disputed March Madness branding case
    • AI Trusted Less Than Social Media and Airlines, With Grok Placing Last, Survey Says
    • Extragalactic Archaeology tells the ‘life story’ of a whole galaxy
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Saturday, April 18
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»AI in Multiple GPUs: Understanding the Host and Device Paradigm
    Artificial Intelligence

    AI in Multiple GPUs: Understanding the Host and Device Paradigm

    Editor Times FeaturedBy Editor Times FeaturedFebruary 13, 2026No Comments7 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    is a part of a collection about distributed AI throughout a number of GPUs:

    • Half 1: Understanding the Host and Gadget Paradigm (this text)
    • Half 2: Level-to-Level and Collective Operations (coming quickly)
    • Half 3: How GPUs Talk (coming quickly)
    • Half 4: Gradient Accumulation & Distributed Information Parallelism (DDP) (coming quickly)
    • Half 5: ZeRO (coming quickly)
    • Half 6: Tensor Parallelism (coming quickly)

    Introduction

    This information explains the foundational ideas of how a CPU and a discrete graphics card (GPU) work collectively. It’s a high-level introduction designed that can assist you construct a psychological mannequin of the host-device paradigm. We’ll focus particularly on NVIDIA GPUs, that are essentially the most generally used for AI workloads.

    For built-in GPUs, comparable to these present in Apple Silicon chips, the structure is barely completely different, and it gained’t be coated on this submit.

    The Massive Image: The Host and The Gadget

    A very powerful idea to know is the connection between the Host and the Gadget.

    • The Host: That is your CPU. It runs the working system and executes your Python script line by line. The Host is the commander; it’s in control of the general logic and tells the Gadget what to do.
    • The Gadget: That is your GPU. It’s a strong however specialised coprocessor designed for massively parallel computations. The Gadget is the accelerator; it doesn’t do something till the Host provides it a job.

    Your program at all times begins on the CPU. If you need the GPU to carry out a job, like multiplying two giant matrices, the CPU sends the directions and the info over to the GPU.

    The CPU-GPU Interplay

    The Host talks to the Gadget via a queuing system.

    1. CPU Initiates Instructions: Your script, working on the CPU, encounters a line of code meant for the GPU (e.g., tensor.to('cuda')).
    2. Instructions are Queued: The CPU doesn’t wait. It merely locations this command onto a particular to-do checklist for the GPU referred to as a CUDA Stream — extra on this within the subsequent part.
    3. Asynchronous Execution: The CPU doesn’t anticipate the precise operation to be accomplished by the GPU, the host strikes on to the subsequent line of your script. That is referred to as asynchronous execution, and it’s a key to attaining excessive efficiency. Whereas the GPU is busy crunching numbers, the CPU can work on different duties, like getting ready the subsequent batch of knowledge.

    CUDA Streams

    A CUDA Stream is an ordered queue of GPU operations. Operations submitted to a single stream execute so as, one after one other. Nevertheless, operations throughout completely different streams can execute concurrently — the GPU can juggle a number of impartial workloads on the similar time.

    By default, each PyTorch GPU operation is enqueued on the present energetic stream (it’s normally the default stream which is routinely created). That is easy and predictable: each operation waits for the earlier one to complete earlier than beginning. For many code, you by no means discover this. Nevertheless it leaves efficiency on the desk when you’ve work that might overlap.

    A number of Streams: Concurrency

    The classic use case for multiple streams is overlapping computation with data transfers. While the GPU processes batch N, you can simultaneously copy batch N+1 from CPU RAM to GPU VRAM:

    Stream 0 (compute): [process batch 0]────[process batch 1]───
    Stream 1 (data):   ────[copy batch 1]────[copy batch 2]───

    This pipeline is possible because compute and data transfer happen on separate hardware units inside the GPU, enabling true parallelism. In PyTorch, you create streams and schedule work onto them with context managers:

    compute_stream = torch.cuda.Stream()
    transfer_stream = torch.cuda.Stream()
    
    with torch.cuda.stream(transfer_stream):
        # Enqueue the transfer on transfer_stream
        next_batch = next_batch_cpu.to('cuda', non_blocking=True)
    
    with torch.cuda.stream(compute_stream):
        # This runs concurrently with the transfer above
        output = model(current_batch)

    Note the non_blocking=True flag on .to(). Without it, the transfer would still block the CPU thread even when you intend it to run asynchronously.

    Synchronization Between Streams

    Since streams are independent, you need to explicitly signal when one depends on another. The blunt tool is:

    torch.cuda.synchronize()  # waits for ALL streams on the device to finish

    A more surgical approach uses CUDA Events. An event marks a specific point in a stream, and another stream can wait on it without halting the CPU thread:

    event = torch.cuda.Event()
    
    with torch.cuda.stream(transfer_stream):
        next_batch = next_batch_cpu.to('cuda', non_blocking=True)
        event.record()  # mark: transfer is done
    
    with torch.cuda.stream(compute_stream):
        compute_stream.wait_event(event)  # don't start until transfer completes
        output = model(next_batch)

    This is more efficient than stream.synchronize() because it only stalls the dependent stream on the GPU side — the CPU thread stays free to keep queuing work.

    For day-to-day PyTorch training code you won’t need to manage streams manually. But features like DataLoader(pin_memory=True) and prefetching rely heavily on this mechanism under the hood. Understanding streams helps you recognize why those settings exist and gives you the tools to diagnose subtle performance bottlenecks when they appear.

    PyTorch Tensors

    PyTorch is a powerful framework that abstracts away many details, but this abstraction can sometimes obscure what is happening under the hood.

    When you create a PyTorch tensor, it has two parts: metadata (like its shape and data type) and the actual numerical data. So when you run something like this t = torch.randn(100, 100, device=device), the tensor’s metadata is stored in the host’s RAM, while its data is stored in the GPU’s VRAM.

    This distinction is important. When you run print(t.shape), the CPU can immediately access this information because the metadata is already in its own RAM. But what happens if you run print(t), which requires the actual data living in VRAM?

    Host-Device Synchronization

    Accessing GPU data from the CPU can trigger a Host-Device Synchronization, a common performance bottleneck. This occurs whenever the CPU needs a result from the GPU that isn’t yet available in the CPU’s RAM.

    For example, consider the line print(gpu_tensor) which prints a tensor that is still being computed by the GPU. The CPU cannot print the tensor’s values until the GPU has finished all the calculations to obtain the final result. When the script reaches this line, the CPU is forced to block, i.e. it stops and waits for the GPU to finish. Only after the GPU completes its work and copies the data from its VRAM to the CPU’s RAM can the CPU proceed.

    As another example, what’s the difference between torch.randn(100, 100).to(device) and torch.randn(100, 100, device=device)? The first method is less efficient because it creates the data on the CPU and then transfers it to the GPU. The second method is more efficient because it creates the tensor directly on the GPU; the CPU only sends the creation command.

    These synchronization points can severely impact performance. Effective GPU programming involves minimizing them to ensure both the Host and Device stay as busy as possible. After all, you want your GPUs to go brrrrr.

    Image by author: generated with ChatGPT

    Scaling Up: Distributed Computing and Ranks

    Training large models, such as Large Language Models (LLMs), often requires more compute power than a single GPU can offer. Coordinating work across multiple GPUs brings you into the world of distributed computing.

    In this context, a new and important concept emerges: the Rank.

    • Each rank is a CPU process which gets assigned a single device (GPU) and a unique ID. If you launch a training script across two GPUs, you will create two processes: one with rank=0 and another with rank=1.

    This means you are launching two separate instances of your Python script. On a single machine with multiple GPUs (a single node), these processes run on the same CPU but remain independent, without sharing memory or state. Rank 0 commands its assigned GPU (cuda:0), while Rank 1 commands another GPU (cuda:1). Although both ranks run the same code, you can leverage a variable that holds the rank ID to assign different tasks to each GPU, like having each one process a different portion of the data (we’ll see examples of this in the next blog post of this series).

    Conclusion

    Congratulations for reading all the way to the end! In this post, you learned about:

    • The Host/Device relationship
    • Asynchronous execution
    • CUDA Streams and how they enable concurrent GPU work
    • Host-Device synchronization

    In the next blog post, we will dive deeper into Point-to-Point and Collective Operations, which enable multiple GPUs to coordinate complex workflows such as distributed neural network training.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    A Practical Guide to Memory for Autonomous LLM Agents

    April 17, 2026

    You Don’t Need Many Labels to Learn

    April 17, 2026

    Beyond Prompting: Using Agent Skills in Data Science

    April 17, 2026

    6 Things I Learned Building LLMs From Scratch That No Tutorial Teaches You

    April 17, 2026

    Introduction to Deep Evidential Regression for Uncertainty Quantification

    April 17, 2026

    memweave: Zero-Infra AI Agent Memory with Markdown and SQLite — No Vector Database Required

    April 17, 2026

    Comments are closed.

    Editors Picks

    Canyon Spectral:ON CF 8 Electric Mountain Bike: Beginner-Friendly, Under $5K

    April 18, 2026

    US-sanctioned currency exchange says $15 million heist done by “unfriendly states”

    April 18, 2026

    This New Air Purifier Filter Can Remove Cannabis Smoke Odor, Just in Time for 4/20

    April 18, 2026

    Portable water filter provides safe drinking water from any source

    April 18, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Premier League Soccer: Stream Man City vs. Sunderland Live From Anywhere

    December 6, 2025

    Thermal Batteries Power Clean Industrial Heat

    October 30, 2025

    PixMo Video Generator Review: Pricing & Key Features

    December 9, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.