Introduction
calls for large-scale fashions and information, pushing compute {hardware} to its limits. Whether or not you’re coaching fashions on complicated photos, processing long-context paperwork, or operating high-throughput reinforcement studying environments, maximizing your GPU effectivity is important. It’s not unusual to be coaching or operating inference on fashions with billions of parameters throughout terabytes of knowledge. An unoptimized setup can flip a fast experiment into hours or days of ready.
When coaching or inference crawls, our intuition is usually guilty the mannequin dimension or mathematical complexity. Trendy GPUs are quick calculators, however they’re depending on the CPU to allocate work and the on-device information storage location on the GPU. Normally, computation on the GPU just isn’t the bottleneck. In case your CPU is struggling to load, preprocess, and switch your batches throughout the PCIe bridge, your GPU sits idle, ravenous for information.
The excellent news? You don’t want to jot down customized CUDA kernels or debug low-level GPU code to repair it. In case you are a ML researcher, engineer, or a hobbyist enthusiastic about optimizing GPU pipelines, this weblog is for you! On this put up, we discover the mechanics of this bottleneck and stroll by way of actionable engineering selections to maximise GPU utilization. We are going to cowl every little thing from basic PyTorch pipeline tweaks to extra superior {hardware} optimizations and Hugging Face integrations.
đź’ˇ Notice
We will probably be assuming a fundamental working data of Python and PyTorch DataLoaders going ahead. No deep understanding of GPU structure is required as we are going to present a high-level overview of the GPU and the way it works. All methods mentioned will probably be relevant to coaching and inference except explicitly said.
GPU Overview
To begin, Graphics Processing Models (GPUs) exploded in reputation alongside deep studying attributable to their potential to coach and run fashions at lightning speeds with parallelizable operations. However what does a GPU really do? Earlier than optimizing our code, we want a shared psychological mannequin of what occurs beneath the hood in a GPU, the variations from a CPU, and the dataflow between the 2.
What’s a GPU? How does it differ from a CPU?
GPUs don’t universally outperform CPUs. CPUs are designed to resolve extremely sequential issues with low latency and sophisticated branching (perfect qualities for operating an working system). Alternatively, GPUs include hundreds of cores optimized to finish fundamental operations in parallel [1]. Whereas a CPU would wish to course of a thousand matrix multiplications sequentially (or as much as the restrict of its cores), a GPU can run all these operations in parallel (in a fraction of a second). In machine studying, we not often take care of extremely sequential issues requiring a CPU. Most operations are matrix multiplication, a extremely parallelizable job.
At a excessive stage, a GPU consists of hundreds of tiny processing cores grouped into Streaming Multiprocessors (SMs) designed for large parallel computation. SMs handle, schedule, and execute a whole lot of threads concurrently. A high-bandwith reminiscence pool, Video RAM (VRAM), surrounds the compute items alongside ultra-fast caches that briefly maintain information for fast entry. VRAM is the primary warehouse the place your mannequin weights, gradients, and incoming information batches reside. CPUs and GPUs talk with one another through an interface bridge, which we are going to analyze extra in depth under, as that is the primary bridge the place bottlenecks happen.
đź’ˇ Notice
In case you are utilizing NVIDIA GPUs, there’s one other element throughout the GPU referred to as Tensor Cores, which speed up mixed-precision matrix math utilized in machine studying. They are going to come up once more after we focus on Blended Precision.
PCIe Bridge
As we simply talked about, information travels from the CPU to the GPU throughout an interface bridge referred to as the Peripheral Part Interconnect Categorical (PCIe). Information originates in your disk, masses into CPU RAM, after which crosses the PCIe bus to achieve the GPU’s VRAM. Each time you ship a PyTorch tensor to the gadget utilizing .to('cuda'), you’re invoking a switch throughout this bridge. In case your CPU is consistently sending tiny tensors one after the other as an alternative of huge, contiguous blocks, it shortly clogs the bridge with latency and overhead.
What’s GPU Utilization?
Now that we have now lined GPU anatomy, we have to perceive the metrics we’re monitoring. When utilizing nvidia-smi, Weights and Biases, PyTorch profiler, NVIDIA Nsight Programs, or some other technique of GPU monitoring, we usually analyze two foremost percentages: Reminiscence Utilization and Unstable GPU-Util.
- Reminiscence Utilization (VRAM): VRAM is the GPU’s bodily reminiscence. Your VRAM may be at 100% capability whereas your GPU is doing nothing. Excessive VRAM utilization solely means you will have efficiently loaded your mannequin weights, gradients, and a batch of knowledge onto the GPU’s bodily reminiscence.
- Unstable GPU-Util (Compute Utilization): That is the essential metric. It measures the proportion of time over the previous pattern interval (normally 1 second) that the GPU’s computing kernels had been actively executing directions. The objective is to constantly maximize this share!
CPU-GPU Bottleneck
Now that we have now lined CPUs and GPUs, let’s have a look at how the CPU-GPU bottleneck happens and what we will do to repair it. The GPU has hundreds of cores able to parallelize operations, however it wants the CPU to delegate duties. Whenever you practice a mannequin, your GPU can not learn straight out of your SSD. The CPU should load and decode the uncooked information, apply augmentations, batch it, and hand it off. In case your CPU takes 50 milliseconds to arrange a batch, and your GPU solely takes 10 milliseconds to compute the ahead and backward passes, your GPU spends 40 milliseconds idling.
Roofline Mannequin
This downside is formalized into the Roofline Mannequin. It measures the efficiency (FLOPs/second) in opposition to arithmetic depth (FLOPs/byte), FLOPs being floating-point operations per second. When arithmetic depth is low (you load an enormous quantity of knowledge however do little or no math with it), you hit the slanted “Reminiscence-Sure” roof. When arithmetic depth is excessive (you load a small quantity of knowledge however do an enormous quantity of matrix multiplication with it), you hit the flat “Compute-Sure” roof.
GPU parallelism isn’t the bottleneck for analysis experiments. Sometimes slowdowns happen within the reminiscence regime: CPU information parsing, PCle bus clogging, or VRAM bandwidth limits. The important thing to that is nearly at all times higher dataflow administration.
Optimizing the Information Pipeline
Monitoring GPU Utilization
Earlier than we will optimize the info pipeline, we should perceive methods to monitor GPU utilization and VRAM. The simplest method is to make use of nvidia-smi to get a desk with all out there GPUs, present VRAM, and unstable GPU Utilization.
nvidia-smi. The CUDA and driver variations are proven within the header. Every row of the desk represents a GPU. The columns present GPU ID, Energy Utilization, GPU Utilization, and Reminiscence Utilization. Picture by Creator.With watch -n 1 nvidia-smi, metrics may be monitored and up to date each second. Nevertheless, one of the simplest ways to get extra detailed GPU metrics is both utilizing the PyTorch Profiler or Weights and Biases. NVIDIA Nsight Programs can also be an incredible device for monitoring that’s comparatively quick to setup here. Weights and Biases gives the best visualization of GPU utilization graphs for our functions, and these graphs are the best approach to diagnose poor GPU optimization. An instance setup for Weights and Biases is proven right here (take straight from Weights and Biases’ documentation [2]):
import wandb
# Mission that the run is recorded to
venture = "my-awesome-project"
# Dictionary with hyperparameters
config = {"epochs": 1337, "lr": 3e-4}
# The `with` syntax marks the run as completed upon exiting the `with` block,
# and it marks the run "failed" if there's an exception.
#
# In a pocket book, it could be extra handy to jot down `run = wandb.init()`
# and manually name `run.end()` as an alternative of utilizing a `with` block.
with wandb.init(venture=venture, config=config) as run:
# Coaching code right here
# Log values to W&B with run.log()
run.log({"accuracy": 0.9, "loss": 0.1})
The simplest indicator of an unoptimized GPU pipeline is a sawtooth GPU utilization graph. That is the place GPU utilization idles at 0%, briefly spikes to 100%, after which idles again at 0%, signifying a CPU to GPU bottleneck subject. Hitting periodic 100% utilization just isn’t an indication that GPU utilization is maximized. The GPU is tearing by way of out there information in a fraction of a second, and the 0% valleys symbolize the watch for the CPU to arrange the subsequent batch. The objective is steady utilization- a flat, unbroken line close to 100%, which means the GPU by no means has to attend. An instance of a sawtooth GPU utilization graph in the identical format as Weights and Biases is proven under:

Let’s see why this occurs in a fundamental PyTorch DataLoader. By default, a PyTorch DataLoader could possibly be outlined as follows:
DataLoader(dataset, batch_size=32, shuffle=True, num_workers=0, pin_memory=False)
​​With num_workers=0 and pin_memory=False (the default values in a DataLoader), the primary Python course of has to do every little thing sequentially:
- Fetch the recordsdata from the disk.
- Apply picture augmentations or preprocess textual content.
- Transfer the batch to the GPU.
- GPU computes the ahead and backward passes.
That is the worst case situation for GPU utilization. For steps 1-3, GPU utilization sits at 0%. When step 3 is full, the GPU utilization spikes to 100%, after which steps 1 and a couple of are repeated for the subsequent batch.
The subsequent few sections focus on methods to optimize the info pipeline.
num_workers (Parallelizing the CPU)
Essentially the most impactful repair is parallelizing information preparation. By growing num_workers, you inform PyTorch to spawn devoted subprocesses for batch fetching and preparation within the background whereas the GPU computes. Nevertheless, extra employees don’t at all times imply extra velocity. In case you have an 8-core CPU and set num_workers=16, you’ll decelerate your coaching attributable to context-switching overhead and Inter-Course of Communication (IPC). Every employee creates a duplicate of the dataset in reminiscence. Too many employees may cause reminiscence thrashing and crash your system. A great rule of thumb is beginning at num_workers=4 and profiling from there.
đź’ˇ Notice
OptimumÂ
num_workers received’t repair a gradualÂDataset implementation. To maintainÂ__getitem__ environment friendly, keep away from instantiating objects, per-item DB connections, or heavy preprocessing within the perform name. Its solely job is to fetch uncooked bytes, convert them to a tensor, and return.
pin_memory=True (Optimized Information Switch)
Even with background employees, how the info bodily transfers throughout the PCIe bridge issues. Usually tensors don’t go straight to the GPU if you switch them. It’s first learn from the disk into paged system RAM and copied by the CPU right into a particular unpaged (or page-locked) space of RAM earlier than crossing the PCIe bus to GPU VRAM.
Setting pin_memory=True creates a knowledge quick lane. It instructs your DataLoader to allocate batches straight into page-locked reminiscence. This permits the GPU to make use of Direct Reminiscence Entry (DMA) to tug the info straight throughout the bridge with out the CPU having to behave as a intermediary for the ultimate switch, considerably decreasing latency.
pin_memory=True comes with a {hardware} trade-off. With page-locked reminiscence, the working system can’t swap this RAM to the onerous drive if it runs out of area. Usually, information may be swapped to the disk when out of reminiscence, however in case you run out of page-locked RAM, an Out of Reminiscence error will probably be thrown. When you get an OOM in your script, make sure you first examine the pin_memory flag earlier than extra complicated debugging. Moreover, watch out combining pin_memory=True with a excessive num_workers depend. As a result of every employee course of is actively producing and holding batches in reminiscence, this may quickly inflate your system’s locked RAM footprint.
prefetch_factor (Queueing up Information)
Typically the bottleneck isn’t the CPU’s processing energy, however the disk itself. When studying hundreds of recordsdata from a community drive, there could also be sudden spikes in I/O latency to no fault of the person.
The prefetch_factor argument dictates what number of batches every employee ought to put together and maintain in a queue on the CPU upfront. In case you have 4 employees and a prefetch_factor=2, the CPU will at all times attempt to hold 8 ready-to-go batches queued up. If the disk abruptly hangs for half a second on a corrupted file, the GPU received’t starve- it simply pulls from the prefetch queue whereas the employee catches up.
đź’ˇ Notice
You should definitely not setÂ
prefetch_factor too excessive, as it could trigger the GPU to attend for the CPU to catch up. A great rule of thumb is to setÂprefetch_factor to 2 or 3. It will possibly additionally triggerÂCUDA Out of Reminiscence errors in case you set it too excessive.
By adjusting these parameters, you possibly can easy sawtooth utilization right into a excessive, steady utilization curve. That is an instance of what you ought to be aiming for:

The GPU utilization is now a excessive, steady line close to 100%, which means the GPU by no means has to attend! Simply with information loading parameters, we had been in a position to go from inefficient GPU utilization to efficient, steady utilization.
Compute and Reminiscence on the GPU
As soon as the DataLoader is optimized, information is flying throughout the PCIe bridge, not creating the sawtooth GPU utilization bottleneck. However as soon as the info strikes to GPU VRAM, how can we ensure that it’s getting used effectively?
Batch Measurement
Let’s revisit the Roofline Mannequin. To flee the slanted “Reminiscence-Sure” roof and attain the flat “Compute-Sure” most efficiency of your GPU, you want excessive Arithmetic Depth. The simplest approach to enhance arithmetic depth is to extend batch dimension. Loading a single large matrix of dimension 1024×1024 and doing the mathematics unexpectedly is extra environment friendly for the GPU’s streaming multiprocessors than loading 32 smaller matrices sequentially.
In concept we should always simply load all of our information in directly and have a singular batch, however in observe, this ends within the dreaded CUDA Out of Reminiscence error. This implies you are attempting to load extra information in VRAM, however there isn’t a extra space to allocate on the GPU.
đź’ˇ Notice
You might be questioning why batch dimension,Â
num_workers, or any deep studying metric is usually an influence of two. It’s not only a de facto rule however a results of NVIDIA {hardware} design.
- Inside an SM, a GPU doesn’t execute threads individually. It teams them into items referred to as Warps, and on NVIDIA GPUs, a warp at all times comprises precisely 32 threads.
- Past the 32-thread warp restrict, the bodily reminiscence controllers on a GPU fetch information from VRAM in power-of-2 byte chunks. In case your tensor dimensions will not be aligned with these chunks, the GPU has to carry out a number of reminiscence fetches to seize the overflowing information.
For max effectivity, select multiples of 32 or 64 (or 8 if in case you have reminiscence limitations) for batch dimension and powers of two for different deep studying metrics.
Blended Precision
By default, PyTorch initializes all mannequin weights and information in FP32 (32-bit floating level). For nearly all deep studying duties, that is overkill. The answer is mannequin quantization, casting tensors right down to FP16 or BF16 (16-bit). Why does this matter for utilization?
- It halves the reminiscence bottleneck: Solely half as many bytes are being moved throughout the PCIe bridge and throughout the GPU’s inner VRAM.
- It unlocks the {hardware}: Trendy NVIDIA GPUs possess specialised silicon referred to as Tensor Cores. These cores sit fully idle in case you move them FP32 math. They’re particularly engineered to execute 16-bit matrix multiplications at excessive speeds.
Earlier than casting to FP16, ensure that efficiency is identical with a subsampled dataset to substantiate the duty at hand doesn’t require FP32. An alternative choice is to make use of PyTorch’s torch.autocast. This built-in perform wraps the ahead move in a context supervisor that routinely figures out which operations are protected to forged to 16-bit (like matrix multiplications) and which want to remain in 32-bit for numerical stability (like Softmax or LayerNorm). It’s primarily a free 2x speedup.
Nevertheless, FP16 or BF16 just isn’t at all times the very best technique for quantization. On fashionable NVIDIA architectures (A100s or H100s), BF16 needs to be used as an alternative of FP16 to keep away from any NaN losses or gradient underflow. One other good choice is NVIDIA proprietary TF32 (TensorFloat-32) format which is a 19-bit floating-point format that maintains FP32 accuracy with a 10x speedup over FP32 on A100s and H100s [3].
Gradient Accumulation (Coaching Solely)
When coaching a mannequin, as an alternative of making an attempt to power a big batch dimension into VRAM and crashing, use a smaller “micro-batch”. When utilizing “micro-batch” methods, as an alternative of updating the mannequin’s weights instantly (calling optimizer.step()), accumulate the gradients (loss.backward()) over a number of consecutive ahead passes, creating an “efficient batch dimension”. For instance, a batch dimension of 8 with 8 steps of gradient accumulation yields the identical mathematical replace as a batch dimension of 64 with a single replace. Micro-batch methods can stabilize coaching with out blowing up your VRAM footprint.
Kernel Effectivity
The ultimate effectivity idea considerations kernel effectivity in coaching and inference. That is an exploration into understanding how CUDA kernels work inside a GPU for these aspiring to work with customized architectures. PyTorch 2.0+ (which is nearly at all times used), abstracts this away from the person with a easy command proven under. Let’s begin with a deeper dive into the GPU and the way kernels perform:
Each time you execute an operation in PyTorch (we are going to use d=a+b+c as a easy instance), you’re launching a “kernel” on the GPU. Abstracting away the intricacies of a GPU, it could actually’t do math in a single go. The GPU should:
- LearnÂ
a andÂb from VRAM into the SM cache. - ComputeÂ
a+b. - Write that intermediate consequence again to VRAM.
- Learn that intermediate consequence and c from VRAM.
- Compute the ultimate addition.
- Write d again to VRAM.
That is Kernel Overhead. When constructing customized architectures from scratch, it’s simple to by chance create a whole lot of tiny, sequential reads and writes. GPU cores spend all their time ready on the inner VRAM reasonably than doing math. Fusing customized CUDA kernels can scale back overhead, however fortunately, PyTorch 2.0+ implicitly handles this with torch.compile(). PyTorch analyzes all the computational graph and makes use of OpenAI’s Triton to routinely write extremely optimized, fused kernels that may shave hours off an extended coaching run by shortening reminiscence round-trips.
Whereas torch.compile() is phenomenal for automated, general-purpose operation fusion, typically squeezing out efficiency positive aspects requires extremely specialised kernels. Traditionally, integrating hand-written CUDA or Triton kernels into your analysis meant wrestling with complicated C++ construct methods and matching CUDA toolkit variations. Fortunately, the Hugging Face kernels library [4] treats low-level compute operations like pretrained fashions. As a substitute of compiling from supply, you possibly can fetch pre-compiled, hardware-optimized binaries straight from the Hub with a single Python perform name. The library routinely detects your precise PyTorch and GPU atmosphere and downloads the right match in seconds.
A easy instance of utilizing the Hugging Face kernels library is proven under (from the Hugging Face kernels library documentation [4]):
import torch
from kernels import get_kernel
# Obtain optimized kernels from the Hugging Face hub
activation = get_kernel("kernels-community/activation", model=1)
# Random tensor
x = torch.randn((10, 10), dtype=torch.float16, gadget="cuda")
# Run the kernel
y = torch.empty_like(x)
activation.gelu_fast(y, x)
print(y)
Hugging Face Transformers
As a fast apart, the Hugging Face transformers library [5] consists of all of the performance to optimize your mannequin for the GPU we have now mentioned by way of its TrainingArguments and Coach courses. For best abstraction, you possibly can merely use the instance under to run a mannequin by way of Hugging Face.
from transformers import TrainingArguments, Coach
training_args = TrainingArguments(
# Specify an output listing
output_dir="./outcomes",
# Information Pipeline
dataloader_num_workers=4,
dataloader_pin_memory=True,
dataloader_prefetch_factor=2,
# Compute and Reminiscence
per_device_train_batch_size=8,
gradient_accumulation_steps=4,
# Precision and {Hardware}
bf16=True,
tf32=True,
# Kernel Effectivity
torch_compile=True,
)
coach = Coach(mannequin=mannequin, args=training_args, train_dataset=train_dataset)
coach.practice()
Conclusion
Optimizing your GPU pipeline comes down to 2 core ideas: protecting the GPU consistently loaded with information and making each operation depend as soon as the info arrives.
On the info pipeline facet, tuning the DataLoader by growing num_workers, enabling pin_memory, and setting a prefetch_factor, yields a extra steady GPU utilization. On the compute facet, maximizing batch dimension (or using gradient accumulation), dropping to blended precision (FP16/BF16 or TF32), and fusing operations through torch.compile() or the Hugging Face kernels library drastically reduces VRAM site visitors and kernel overhead.
Collectively, these tweaks flip hours of wasted, memory-bound idle time right into a high-speed, totally utilized analysis pipeline.
References
[1] CPU vs. GPU layout — NVIDIA Cuda Programming Information
[2] Weights and Biases Setup – Weights and Biases Github
[3] TensorFloat-32 Precision Format — NVIDIA Weblog
[4] Hugging Face kernels library – Hugging Face Github
[5] Hugging Face transformers library – Hugging Face Github

