. Settle for packet loss on goal. Spray every switch throughout a whole bunch of random paths. If somebody handed you this checklist of design selections for a community connecting 131,000 GPUs, you’d assume it was written by somebody who had by no means operated a manufacturing community.
A consortium of OpenAI, AMD, Broadcom, Intel, Microsoft, and NVIDIA constructed precisely this — and quietly inverted three many years of consensus about how high-performance information heart networks ought to work.
The protocol is known as MRC, quick for Multipath Dependable Connection. It was launched on Could 5, 2026 by means of the Open Compute Project. The accompanying research paper (Araujo et al., 2026) particulars its deployment throughout OpenAI’s largest NVIDIA GB200 supercomputers, together with the Stargate website with Oracle Cloud Infrastructure in Abilene, Texas, and Microsoft’s Fairwater supercomputers. MRC has been used to coach the newest frontier fashions behind ChatGPT and Codex.
What’s most placing on shut studying of the paper is one thing the press protection has not surfaced: MRC successfully eliminates the complete Layer 3 management aircraft from the information heart material. No OSPF. No BGP. No IS-IS. No FIB. The switches within the deployment keep zero dynamic forwarding state. To the creator’s data, that is essentially the most aggressive elimination of dynamic routing in any manufacturing AI coaching material publicly documented up to now.
The paper’s core argument is that at 100,000+ GPU scale, tail latency from community congestion and failures dominates coaching efficiency, and the standard networking stack can not remedy this with out elementary adjustments to how packets transfer between GPUs. MRC is these elementary adjustments, applied in 800 Gb/s NICs from three completely different silicon distributors and deployed in manufacturing.
What makes MRC price learning rigorously shouldn’t be that it’s quick. It’s that the design selections behind it contradict a number of ideas that the networking neighborhood has handled as settled for many years. Understanding why these selections work at this scale, and the place they may not, issues for anybody constructing or working AI infrastructure.
Left: standard RoCE with single-path routing. A congested T1 hyperlink triggers PFC PAUSE that propagates backward, blocking GPU 2 despite the fact that its personal path was clear. All 100,000 GPUs idle till GPU 2’s switch completes. Proper: MRC sprays packets throughout 8 unbiased planes. When a hyperlink fails in Aircraft 2, the NIC retires that entropy worth and redistributes visitors to the remaining 7 planes in microseconds. No GPU ever stalls. The 5 numbered design selections on the backside are the topic of this text.
[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
Every of MRC’s selections is individually acquainted to anybody who has adopted networking analysis. The mix is what’s radical. The networking neighborhood has explored each one in every of these concepts in isolation — multi-plane materials, supply routing, packet spraying, lossy transports with selective retransmission, ECN as a load-balancing sign. What makes MRC price cautious examine is that the OpenAI consortium dedicated to all of them, concurrently, in manufacturing at 131,000 GPUs.
The issue: one straggler blocks 100,000 GPUs
Synchronous pretraining runs in lock-step. Each coaching step includes hundreds of thousands of information transfers throughout 1000’s of GPUs performing a mixture of tensor parallelism, pipeline parallelism, information parallelism, and knowledgeable parallelism. The step can not advance till the slowest switch completes. At 100,000 GPUs, the period of every communication spherical is set by the tail of the switch latency distribution, not the imply.
The paper frames this exactly: “As computations scale, communication turns into more and more outlier-dominated.” A single congested hyperlink, a single circulate collision, a single change buffer overflow can stall 1000’s of GPUs for milliseconds. On the hourly value of 100,000 H100-class GPUs (roughly $300,000 per hour at cloud charges), a 10-millisecond stall that happens as soon as per coaching step and repeats throughout 1000’s of steps shouldn’t be a rounding error. It’s a line merchandise.
Community failures compound the issue. At this scale, hyperlink flaps, optic failures, and change reboots will not be uncommon occasions. They’re statistical certainties that happen a number of instances per day throughout a material with a whole bunch of 1000’s of hyperlinks. The paper experiences a manufacturing incident the place an optical transceiver on a T0 change “suffered a glitch, and flapped all its 4 hyperlinks in speedy succession,” affecting three energetic coaching nodes concurrently. In a standard community, this might have crashed the coaching job.
MRC’s design purpose was not simply greater bandwidth. It was predictable bandwidth, even within the presence of failures, with a management aircraft easy sufficient {that a} small crew can handle a number of supercomputers concurrently.
The topology: 131,000 GPUs in two change tiers
The primary design resolution is architectural, not protocol-level. As a substitute of treating an 800 Gb/s NIC as one fats pipe, MRC splits it into eight 100 Gb/s hyperlinks, every connecting to a distinct change. This creates eight parallel community planes, every working independently.
Contemplate a standard strategy. Right now’s quickest datacenter Ethernet switches supply 51.2 Tb/s of switching capability, yielding 64 ports at 800 Gb/s. In an ordinary fat-tree Clos topology, every Tier-0 (T0) change connects right down to 32 NICs and as much as 32 Tier-1 (T1) switches. Every T1 change connects to 64 pods. That provides you a 3-tier community supporting roughly 64,000 GPUs at full bisection bandwidth. To achieve 100,000, you want a fourth tier, which provides latency, value, and failure domains.
Now break up the NIC. The identical 51.2 Tb/s change at 100 Gb/s per port offers you 512 ports as an alternative of 64. Every T0 change connects right down to 256 NIC ports and as much as 256 T1 switches. Every T1 connects to 512 T0s. A single two-tier aircraft helps 131,072 GPUs at full bisection bandwidth.
The paper quantifies the financial savings:
Typical 3-tier (800 Gb/s):
- 3 change tiers, 64-port switches
- Max ~64K GPUs at full bisection BW
- 5-hop or 7-hop worst-case path
Multi-plane 2-tier (8 × 100 Gb/s):
- 2 change tiers, 512-port switches
- 131K GPUs at full bisection BW
- 3-hop worst-case path
- 2/3 the optics of a 3-tier community
- 3/5 the variety of switches

[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
The resilience math is equally compelling. Dropping a single NIC-to-T0 hyperlink in an 800 Gb/s single-plane community prices 3% of that NIC’s bandwidth. In a 100 Gb/s multi-plane community, the identical failure prices 0.4%. Extra importantly, with eight unbiased planes, the NIC can proceed working on the remaining seven whereas the failed hyperlink is repaired. The coaching job doesn’t must cease.
This tradeoff shouldn’t be free. Eight separate planes imply eight instances as many hyperlinks to watch, eight instances as many potential failure factors in combination, and a transport protocol that should load-balance intelligently throughout all of them. That’s the place MRC itself is available in.
Packet spraying with entropy values
Typical RDMA transports (RoCEv2, InfiniBand RC) pin every connection to a single community path. The trail is chosen by hashing the circulate’s five-tuple (supply/vacation spot IP, supply/vacation spot port, protocol) at every change. As soon as pinned, each packet in that connection follows the identical path till the connection is torn down.
This works at average scale. It fails at 100,000+ GPUs due to circulate collisions. When two connections hash to the identical path by means of the identical bottleneck hyperlink, each undergo. The chance of collision will increase with scale, and the tail latency impression is disproportionate.
MRC eliminates circulate pinning solely. As a substitute, it assigns every Queue Pair (QP) a set of 128 to 256 entropy values (EVs) at connection setup. Every EV encodes a selected path by means of a selected community aircraft. The sender rotates by means of its EV set packet by packet, spraying consecutive packets throughout a whole bunch of various paths throughout all eight planes. No two consecutive packets from the identical switch take the identical route.
The EV is a 32-bit worth break up throughout the UDP supply port and the IPv6 circulate label in every MRC packet. Switches hash on these fields, so altering the EV adjustments the trail. The sender doesn’t must know the topology. It solely must know that completely different EVs produce completely different paths.
Per-QP state:
EV set: 128-256 entropy values (32-bit every)
Per-EV well being: {energetic, congested, suspected_failed, confirmed_failed}
Packet sending:
for every packet in switch:
ev = next_active_ev(qp.ev_set)
packet.udp_src_port = ev[0:15]
packet.ipv6_flow_label = ev[16:31]
ship(packet)
Every EV carries a couple of bits of well being state. When the receiver detects congestion on a path (through ECN marking from switches), it echoes this again to the sender, which briefly avoids that EV. When a packet is definitely misplaced (not trimmed), MRC assumes the trail has failed and instantly stops utilizing that EV. Background probes periodically take a look at retired EVs to find out whether or not the failure was transient, resurrecting them if probes succeed.
The load-balancing high quality of this scheme is excessive. As a result of completely different senders independently generate random EV units, the combination visitors distribution throughout paths is near-uniform. Small imbalances are smoothed by the ECN suggestions loop: if one path accumulates barely extra visitors, ECN marks enhance on that path, and senders redistribute to less-loaded options.

[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]

[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
Static supply routing with SRv6
That is essentially the most counterintuitive resolution within the paper. Each manufacturing datacenter community runs dynamic routing protocols (BGP, OSPF, IS-IS) that compute forwarding tables, react to topology adjustments, and converge after failures. MRC disables all of them.
As a substitute, MRC makes use of IPv6 Phase Routing (SRv6) to encode the total path every packet ought to take. The sender embeds the sequence of change identifiers immediately into the packet’s vacation spot handle. Every change alongside the trail checks if its identifier is current, removes it by shifting the handle, and forwards to the following hop. No routing desk lookup. No forwarding info base. No management aircraft convergence.
The paper explains the logic: “We took the weird place of disabling dynamic routing within the switches as a result of we didn’t need two adaptive routing mechanisms interacting with one another and dynamic routing wasn’t including something.”
MRC’s transport-layer adaptation (EV administration, ECN suggestions, path probing) already handles failures at microsecond timescales. Dynamic routing protocols converge in seconds to minutes. Operating each creates a threat of conflicting selections: MRC avoids a failed path on the transport layer whereas the routing protocol remains to be converging to a brand new forwarding state, doubtlessly creating routing loops or oscillations.
By eradicating dynamic routing solely, MRC will get three operational advantages:
First, deterministic forwarding. Each packet follows a identified, pre-computed path. If one thing goes unsuitable, you’ll be able to hint precisely which switches the packet traversed. The paper notes that this “offers us excellent observability” as a result of the trail is encoded within the packet itself.
Second, eradicated convergence failures. Dynamic routing protocols can misconfigure, loop, or partition the community throughout convergence. With static SRv6 routes, these failure modes don’t exist. The switches are stateless packet forwarders.
Third, simplified operations. The paper emphasizes that “very small groups of individuals want to have the ability to handle the networks of a number of supercomputers.” Eradicating routing protocols removes a complete class of operational complexity, configuration drift, and debugging floor space.
The tradeoff is that path computation strikes to the NIC. The MRC NIC should know sufficient concerning the topology to generate legitimate SRv6 paths for its EV set. In OpenAI’s deployment, that is dealt with at QP setup time utilizing a easy topology database. The paths are static and pre-computed. Runtime adaptation occurs on the EV choice degree, not on the routing degree.

[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
Operating lossy: why MRC disables PFC
That is the choice that may shock most networking practitioners. RDMA networks have historically relied on Precedence Circulation Management (PFC) to create lossless Ethernet materials. When a change buffer fills, PFC sends a pause body upstream, stopping the sender from transmitting till the buffer drains. InfiniBand has an analogous credit-based circulate management mechanism. Your complete “lossless material” paradigm exists to help RDMA’s assumption that packets don’t get dropped.
MRC explicitly disables PFC and runs on normal best-effort (lossy) Ethernet.
The reason being head-of-line blocking. When a PFC pause body fires on one port, it may possibly block visitors destined for different ports that share the identical ingress buffer. In a big coaching cluster operating a number of collectives concurrently, a PFC pause triggered by one collective’s incast can delay transfers from a very unrelated collective. This cross-collective interference creates precisely the tail latency outliers that MRC is designed to get rid of.
The paper’s resolution is a mixture of three mechanisms:
First, selective retransmission. MRC tracks which packets have been acquired utilizing Selective ACKs (SACKs). When loss is detected, solely the lacking packets are retransmitted, not the complete window. That is sooner than go-back-N retransmission utilized in some RoCE implementations.
Second, packet trimming. When a change would drop a packet as a result of buffer overflow, it as an alternative trims the payload and forwards simply the header as a precedence packet. The receiver will get the trimmed header, acknowledges the hole, and sends a NACK to set off fast retransmission. This eliminates the timeout delay between loss detection and retransmission. It additionally lets MRC distinguish between congestion drops (trimmed packets) and hyperlink failures (no packet in any respect), enabling completely different restoration methods for every.
Third, out-of-order reminiscence placement. Each MRC information packet carries the RDMA digital handle and distant key. The receiving NIC can write every packet on to its closing reminiscence location no matter arrival order. That is vital as a result of packet spraying throughout a whole bunch of paths ensures that packets will arrive out of order. With out direct placement, the receiver would wish reorder buffers, including latency and reminiscence overhead.

[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
ECN repurposed: load balancing, not congestion management
In standard networks, Specific Congestion Notification (ECN) indicators congestion to the sender, which responds by decreasing its transmission charge (much like TCP congestion management). MRC repurposes ECN solely.
In MRC’s multi-plane topology with full bisection bandwidth, combination congestion shouldn’t exist below regular operation. The whole out there bandwidth exceeds the overall demand. What does exist is native path imbalance: some paths could also be barely extra loaded than others because of the random EV choice throughout completely different senders.
MRC makes use of ECN as a per-path load sign. Switches mark packets with ECN in the usual randomized method, however MRC disables ECN marking on the final hop to the receiver (to keep away from complicated last-hop incast with material congestion). The receiver echoes ECN marks again to the sender, tagged with the particular EV that was marked. The sender then briefly avoids that EV, shifting visitors to less-loaded paths.
This transforms ECN from a rate-control mechanism right into a routing-level load-balancing sign. The sender doesn’t decelerate. It redirects. The excellence issues as a result of decreasing charge wastes GPU time (the switch takes longer), whereas redirecting maintains throughput whereas smoothing out imbalances.

[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
What the manufacturing proof reveals
The paper experiences outcomes from two contexts: manufacturing frontier mannequin coaching and managed testbed experiments.
In manufacturing, MRC allowed coaching jobs to experience out community failures that beforehand would have crashed the job. The paper describes the optical transceiver glitch talked about earlier: 4 hyperlinks flapped in speedy succession on three energetic coaching nodes. MRC detected the trail failures, stopped utilizing the affected EVs, and redistributed visitors throughout remaining paths. The coaching job continued with out interruption. In a standard RoCE deployment, this occasion would have triggered PFC storms, NCCL timeouts, and a job restart costing hours of GPU time.
The testbed experiments quantify MRC’s efficiency traits:
Level-to-point bandwidth: MRC achieves near-line-rate throughput on 800 Gb/s hyperlinks with packet spraying. The paper experiences comparability with normal RoCE displaying MRC’s benefit below multi-path situations.
Hyperlink failure restoration: when a hyperlink goes down, MRC detects it and redistributes visitors in tens of microseconds. No sender-side timeouts. No routing protocol convergence. The EV that mapped to the failed path is retired instantly, and the remaining EVs take up the visitors.
Load balancing throughout EVs: the paper measures visitors distribution throughout planes and paths, displaying near-uniform utilization below manufacturing workloads.
NCCL collective efficiency at scale: the paper evaluates MRC’s efficiency on all-reduce operations, that are the dominant communication sample in data-parallel coaching. MRC’s packet spraying eliminates the flow-collision downside that degrades all-reduce efficiency at scale with standard ECMP hashing.
The operational proof helps the static routing resolution. The paper experiences that T1 core switches had been rebooted throughout energetic coaching runs with out disrupting the job. In a standard community with dynamic routing, rebooting a core change triggers reconvergence throughout the material. With static SRv6, the change merely reloads its static forwarding state and resumes. MRC’s transport layer dealt with the non permanent lack of paths by means of that change by redistributing visitors to different planes.

[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
The place these design selections are strongest
MRC was designed for a selected workload profile: synchronous pretraining with all-reduce dominated communication, operating on a single-tenant material with full bisection bandwidth. Inside these constraints, the three design selections are well-matched to the issue:
Static routing works as a result of the topology is mounted and identified at deployment time. Coaching clusters don’t add or take away switches throughout a coaching run. The failure modes are link-level (dealt with by MRC’s EV administration), not topology-level (which might require routing protocol reconvergence).
Lossy Ethernet works as a result of the selective retransmission and packet trimming mechanisms get better sooner than PFC pause frames propagate. The cross-collective head-of-line blocking that PFC creates is extra damaging to tail latency than the occasional retransmission overhead.
ECN-as-load-balancing works as a result of the multi-plane topology supplies full bisection bandwidth, making certain that combination congestion doesn’t happen. Native imbalances are the one congestion supply, and ECN-guided EV avoidance is a exact, low-overhead mechanism for smoothing them.

[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
The boundary situations: the place MRC works and the place it doesn’t
MRC is a production-proven protocol for its goal workload. The pure questions for the broader AI infrastructure neighborhood concern the boundary situations.
First, multi-tenancy. OpenAI’s coaching clusters run a single coaching job at a time throughout the total material. Most cloud suppliers and enterprise deployments share GPU clusters throughout a number of workloads. MRC’s static routing assumes a steady topology database on the NIC degree. In a multi-tenant setting the place workloads are dynamically positioned, the topology seen to every NIC adjustments regularly. Whether or not MRC’s path-generation logic adapts to this or requires modifications is an open engineering query.
Second, inference workloads. MRC was designed for synchronous coaching’s all-reduce communication sample: giant bulk transfers between identified units of GPUs. Inference workloads, significantly disaggregated inference with KV cache transfers between prefill and decode swimming pools, have a distinct communication profile: smaller transfers, point-to-point relatively than collective, and latency-sensitive on the particular person request degree relatively than the combination step degree. Packet spraying throughout a whole bunch of paths provides jitter to particular person switch latency, which can or might not matter relying on the SLO necessities.
Third, oversubscribed networks. MRC’s ECN-as-load-balancing mechanism depends on full bisection bandwidth. In oversubscribed networks (widespread in cloud environments the place value optimization drives topology selections), combination congestion is actual, not simply native imbalance. ECN would wish to perform as a real congestion sign on this case, which adjustments MRC’s circulate management dynamics.
Fourth, interoperability. MRC is at present applied in particular NIC silicon (NVIDIA ConnectX-8, AMD Pollara/Vulcano, Broadcom Thor Extremely) and particular change platforms (NVIDIA Spectrum-4/5, Arista EOS on Broadcom Tomahawk 5). The OCP launch of the specification allows broader implementation, however silicon-level protocol help takes 12-18 months to develop and validate. Close to-term adoption shall be restricted to organizations utilizing these particular {hardware} platforms.
These will not be criticisms of MRC. They’re the engineering questions that come up naturally when a protocol designed for a selected, well-defined setting meets the variety of the broader infrastructure market. The truth that MRC solved the tail latency downside at 131,000-GPU scale is a real achievement. The query for the remainder of the neighborhood is which of its design selections generalize and that are particular to the constraints of single-tenant, full-bisection-bandwidth coaching materials.
What MRC indicators about the way forward for AI networking
MRC represents a broader shift in how AI infrastructure thinks about networking. The traditional strategy treats the community as a clear pipe: packets go in a single finish and are available out the opposite, and the transport protocol’s job is to fill the pipe as effectively as potential. MRC treats the community as a managed useful resource with observable, per-path well being indicators that the transport protocol actively exploits.
This isn’t a brand new thought in networking analysis. Multipath TCP, Valiant load balancing, and ECMP have explored variations of it for years. What’s new is the size at which MRC operates, the aggressiveness of its design selections (no PFC, no dynamic routing, full packet spraying), and the manufacturing proof that it really works on the most important AI coaching clusters on the earth.
For networking practitioners, MRC validates a thesis that has been debated for a decade: at enough scale, endpoint intelligence beats community intelligence. Making the NIC smarter and the change easier produces a extra resilient system than making the change smarter and the NIC easier. Whether or not you agree with each design resolution or not, the manufacturing proof from OpenAI and Microsoft makes this argument more durable to dismiss than it was every week in the past.
The MRC specification is out there by means of OCP below an open license. The research paper supplies detailed experimental outcomes. For anybody constructing GPU clusters at scale, each are price studying rigorously. The three guidelines MRC breaks could be the identical three guidelines holding your community again.

