launched Gaudi accelerators to Amazon’s EC2 DL1 situations, we confronted a problem that threatened the whole deployment. The efficiency numbers weren’t simply disappointing; they have been disastrous. Fashions that required coaching successfully have been seeing as much as 50% of their efficiency degradation when scaling throughout a number of nodes. The issue? A community topology that routed all bytes of information by means of host reminiscence, inflicting a bottleneck that undermined every thing Gaudi was designed to do.
I led the engineering effort to deal with this situation, which finally resulted within the improvement of what we now name Peer Direct. It’s a function that reworked the way in which Gaudi accelerators talk in cloud environments, and its historical past has some helpful classes on distributed AI coaching at scale.
The Drawback with Host NICs
Gaudi was designed with the NIC (Community Interface Card) being embedded instantly within the silicon. Every chip has ten community interfaces that may deal with 100 Gbps and assist RDMA with RoCE v2, permitting gadgets to entry one another’s reminiscence instantly with no need the CPU or This structure is very environment friendly for AI coaching workloads, the place collective operations like AllReduce must accumulate gradients from dozens or a whole lot of gadgets per coaching iteration.
However cloud deployments will not be all the time compliant with good architectures. When Amazon examined Gaudi for DL1 situations, they needed to utilise peculiar host NICs fairly than Gaudi’s built-in networking. The explanations have been pragmatic: price financial savings and the logistics of working round present information centre infrastructure to accommodate a brand new community topology. From their enterprise perspective, leveraging established community infrastructure made good sense.
From the efficiency standpoint, it was a catastrophe. As a substitute of peer-to-peer RDMA transfers between Gaudi playing cards, all communication went the good distance round. Knowledge needed to be duplicated out of Gaudi’s high-bandwidth reminiscence into host DRAM, processed by the host CPU, despatched out the host NIC over TCP/IP, acquired by the far host, and duplicated again into the far Gaudi’s reminiscence. All of the added hops precipitated latency, stole CPU cycles, and added bandwidth restrictions that utterly ruined the scalability of distributed coaching.
The efficiency shortfall was so dangerous that one questioned whether or not deployment would ever be value it in any respect. This wasn’t a matter of some trivial optimisation; it was an existential risk to the whole association with AWS.
Why Efficiency Issues This A lot
It’s value realizing why a 50% lack of efficiency is so disastrous within the life of coaching fashions, and particularly giant fashions equivalent to GPT-5. It now takes weeks or months to coach big language fashions even on humongous clusters. If you’re messing round with fashions which have billions or trillions of parameters, each share level of efficiency interprets instantly into time and {dollars}.
Think about the economics. If it takes 30 days to coach a mannequin versus 15, you’re not solely ready longer; you’re paying for double the compute time. At cloud scale, with a whole lot or hundreds of accelerators in steady use, this provides as much as hundreds of thousands of {dollars}. Worse, it halves your iteration velocity. In an aggressive AI world the place firms are racing to develop improved fashions, doubling the variety of assessments throughout the similar timeframe will be the excellence between being in entrance and being behind.
Environmental price can also be essential. Giant fashions require loads of electrical energy to show. Higher efficiency means much less compute time, which halves power consumption and carbon emissions. As extra strain is mounted on the AI business to chop its carbon footprint, good points in effectivity are not a luxurious however fairly a necessity.
The answer we designed, Peer Direct, delivered RDMA-like efficiency when the bodily community structure wasn’t appropriate for regular RDMA. We would have liked direct reminiscence entry between Gaudi gadgets on completely different methods with out traversing host reminiscence, however on host NICs that weren’t designed for this within the first place.
The enabler was AWS Elastic Cloth Adapter, a high-performance community interface for HPC and AI workloads on EC2. EFA supplies low-latency OS-bypass communications, sometimes sub-10 microsecond latency. EFA supplies RDMA-like semantics utilizing libfabric, an in-user-space communication library offering a standard interface throughout a number of networking applied sciences.
The duty was to mix libfabric with Habana’s Collective Communication Library, HCCL, which handles all distributed coaching workloads. HCCL was constructed on the belief of native RDMA utilizing Gaudi’s on-chip NICs. We would have liked to create a bridge enabling HCCL to leverage libfabric transparently for communications with out compromising its efficiency ensures and communication semantics.
The answer wanted a number of technical advances. First, we launched a reminiscence registration system that allowed libfabric to instantly entry Gaudi’s high-bandwidth reminiscence. We utilised the Linux kernel DMA-BUF framework, which supplies a shared mechanism for sharing machine driver buffers. When HCCL must switch information, the Gaudi driver supplies a DMA-BUF file descriptor for the reminiscence area, which libfabric can utilise to create RDMA transfers instantly from machine reminiscence.
Second, we included an LRU cache for reminiscence registrations. Reminiscence registration is dear; it includes kernel calls and setup operations that may trigger important overhead. By caching the mapping of reminiscence addresses to their libfabric handles, we might reuse registrations in hot-access areas, eliminating most registration overhead from precise coaching.
The consequence was a communication pipeline that regarded one thing like this: HCCL calls the OFI wrapper, which calls the cached libfabric deal with to carry out an RDMA switch straight from supply Gaudi reminiscence to vacation spot Gaudi reminiscence, with neither CPU ever being known as. The OFI wrapper was launched to maintain the codebase clear and keep away from direct header inclusions — it’s a light-weight library that dynamically hyperlinks to HCCL and permits the usage of libfabric with out requiring direct integration
After the switch is full, libfabric experiences by means of a completion queue, and HCCL continues computation with the just lately acquired information.
The Improvement Expertise
Constructing Peer Direct concerned venturing into new territory on tight schedules. Libfabric wasn’t but mainstream within the area of AI accelerators but. There wasn’t loads of public documentation obtainable, and dialogue was meagre. There was extra of an emphasis on diving into libfabric supply code and reverse-engineering primarily based on experimentation.
The communication with AWS engineers was paramount however time-zone constrained. Working with a group twelve hours forward meant that debug iterations had 24-hour turnarounds. Each situation wanted cautious documentation and correct communication, as real-time collaboration was not doable.
The stakes have been excessive because the whole DL1 deployment was driving on this performance working. Delays would have thwarted a serious product launch. No one on our group had deep background data of libfabric internals, so we have been studying a fancy codebase whereas designing a vital integration concurrently.
The Outcomes
Once we truly deployed Peer Direct, the velocity enhancements have been all the hassle was value. We noticed a 1.5 to 2x throughput enhance for collective operations on a 32MB message dimension. On bigger messages, the efficiency was much more astounding, with as much as 1.76x higher throughput at a 256MB message dimension. CPU overhead created a bottleneck that utterly disappeared.
Most importantly, these microbenchmark enhancements instantly translated into actual mannequin coaching efficiency. Coaching Habana’s DeepSpeed BERT mannequin with 5 billion parameters throughout 128 Gaudi gadgets, we noticed substantial throughput acquire. Fashions utilizing extra aggressive reminiscence optimisation strategies, like ZeRO-2, that are extra collective operation dependent, benefited disproportionately from Peer Direct.
PeerDirect was one of many primary enablers for Gaudi efficiency on AWS DL1 situations, permitting high-scale distributed coaching to run effortlessly on the launch day. Past this preliminary influence, the hassle set the groundwork for future high-performance communication options and proved that cloud-native AI accelerators might stay aggressive regardless of the constraints of cloud infrastructure.
The expertise jogged my memory of an vital lesson in methods engineering: typically crucial efficiency enhancements don’t consequence from optimising the quick path, however from sidestepping unjustified detours altogether. Throughout distributed AI coaching, having information journey straight throughout accelerators with no pointless copies and no CPU intervention is what makes a working system versus one which scales.
Key takeaways? One vital “takeaway” from this challenge is that assumptions about community topology must be examined on the earliest level within the distributed coaching course of. As lots of the accelerator stacks have been constructed primarily based on an idealised atmosphere, they don’t keep in mind the extra hops, translation layers, and/or cost-driven elements that exist within the cloud environments. Subsequently, earlier than specializing in optimising both mannequin degree or kernel degree, engineers ought to carry out easy collective microbenchmarking throughout the specified topology. If scaling effectivity dramatically decreases with growing node counts or message sizes, the seemingly purpose is the information path, not the kernel. By figuring out the host-memory detour early on, engineers can focus their efforts the place they’ll have the best influence.
One other vital lesson realized was the necessity to deal with each reminiscence registration and information switch as first-class efficiency considerations. Reminiscence registration overhead can vastly exceed the time spent speaking if every information switch requires a brand new registration. The LRU cache for registered recollections was a non-glamorous addition to HCCL; nevertheless, it successfully eradicated a systemic supply of latency and made the RDMA path viable for real-world workloads. When creating distributed methods, engineers ought to profile not solely the obtainable community bandwidth but in addition the lifecycle prices related to allocating buffers, registering them, and tearing down these registrations. Small adjustments to those management paths can lead to giant will increase in end-to-end throughputs.
Lastly, the combination methodology used on this challenge supplies a sample for integration. As a substitute of rewriting HCCL to make use of libfabric instantly, we created a skinny abstraction layer that maintained present semantics whereas changing the underlying transport layer. This offered a number of advantages, together with minimising threat, decreasing code churn, and permitting incremental testing. Groups going through an analogous problem (i.e., adapting accelerator-native communication libraries to cloud-native materials) ought to try and isolate the transport layer, keep collective semantics, and create small, testable interfaces between the 2. This not solely permits for quicker improvement but in addition permits for less complicated assist of future transport backends.
Disclosure: I work as an AI Runtime Workforce Supervisor at Intel. The views shared on this article are my very own.

