Why GPU clusters need NVLink and InfiniBand
Training a frontier model means thousands of GPUs taking the same step at the same time. Ethernet wasn't built for that, and PCIe gave up a long time ago.
Why it exists
Imagine 10,000 students all editing the same Google Doc, and every 30 seconds the system forces them to pause, share their changes with each other, and only resume once everyone has the same copy. With normal home Wi-Fi this would be hopeless — by the time the slowest student finishes syncing, the rest are sitting idle. Training a frontier AI model is exactly this problem at thousands-of-GPUs scale. Every few steps, every GPU has to compare notes with every other GPU, and the “comparison” is hundreds of gigabytes of numbers. Ordinary network cables choke on this. NVLink and InfiniBand are special, much fatter cables built so the comparison step doesn’t dominate the whole training run.
Open the spec sheet of a top-end AI training server — a DGX or HGX H100 SXM box, or one of the OEM clones — and you will find two networks you did not expect to find on a server.
Inside the box, eight GPUs are wired to each other through something called NVLink, not the PCIe slots you would normally expect to carry a GPU’s traffic. Between boxes, the cluster runs over InfiniBand or a carefully tuned variant of Ethernet — never the same boring 10/25/100G fabric the rest of the data center uses.
The obvious question is: why? PCIe is the universal interconnect. Ethernet runs the entire internet. They are both fast, both standardized, both cheap. Why does training one neural network require building two extra networks on top?
The honest answer is that training a frontier model is one of the strangest networking workloads ever invented. Every few hundred milliseconds, every GPU in the cluster has to stop, sum its results with every other GPU, and only then can it take the next step. There is no “fast path” and “slow path” — the slowest link in the cluster sets the training speed for everybody. PCIe and ordinary Ethernet were designed for traffic that averages out. AI training is traffic that has to synchronize. Those are different problems, and they need different hardware.
Why it matters now
If you write software, you probably never touch NVLink directly. But almost everything you’d care about downstream is shaped by it.
- Cluster cost. A surprisingly large slice of a frontier-training bill is interconnect, not silicon. Networking gear, optics, switches, and cables make up a non-trivial fraction of an AI data center’s capex — though I don’t have a single citable number that holds across Meta, Microsoft, xAI, and the rest, since the breakdowns aren’t public.
- Why GPUs ship in groups of 8. A “node” — DGX, HGX, and the OEM clones — is built around the unit that fits inside one non-blocking NVLink fabric. Eight isn’t a marketing number; it’s the size of the domain where every GPU can talk to every other GPU at full bandwidth.
- Why training scales the way it does. “Scaling laws” assume you can keep all the GPUs in lockstep. The reason 100k-GPU runs are even possible is that the network was upgraded, not just the chips.
The short answer
GPU cluster network = PCIe replacement (NVLink) + Ethernet replacement (InfiniBand or RoCE) + collectives that match how training actually communicates
Inside a server, NVLink replaces PCIe as the GPU-to-GPU link because PCIe is roughly an order of magnitude too slow and not built for direct GPU-to-GPU traffic. Between servers, InfiniBand (or carefully tuned Ethernet) replaces ordinary networking because training is bottlenecked on the tail of latency, not the average — and on collective operations like all-reduce, which have to finish before the next step can begin.
How it works
The thing to hold in your head is what training actually does on the network.
In standard data-parallel training, every GPU has a copy of the model. Each step, every GPU does a forward and backward pass on a different slice of the batch and produces its own gradients. Before the optimizer can take a step, those gradients have to be averaged across every GPU in the world that holds a copy of that parameter. The collective operation that does this is called all-reduce.
All-reduce has two properties that wreck normal networking:
- Everybody waits for the slowest GPU. A 1% tail-latency spike on one link stalls the entire cluster for that step. There is no “retry later” — the next step literally cannot start.
- The volume is roughly the size of the model, every step. For a 100B-parameter model in bfloat16, that’s ~200 GB of gradients to reconcile per optimizer step. If your compute step is fast, the network has milliseconds to move that data, or it becomes the bottleneck.
Now compare what’s available.
Inside the box: NVLink vs. PCIe. PCIe Gen5 ×16 gives you 128 GB/s
total bidirectional (64 GB/s each way). NVLink 4 on an H100 SXM gives
each GPU 900 GB/s of bidirectional bandwidth — NVIDIA’s number, and the
one most third-party write-ups repeat. (The PCIe-form-factor H100 NVL
is lower; this post is about the SXM parts that go into HGX/DGX nodes.)
That’s roughly 7× per GPU. NVLink is also topologically better:
GPU-to-GPU traffic over PCIe has to traverse switches and sometimes a
shared root complex, and the bandwidth is shared with everything else on
that PCIe tree — NICs, NVMe, the other GPU you’re trying to talk to.
NVLink is a dedicated mesh of GPU-to-GPU links with its own switch fabric.
In an HGX H100 8-GPU board, every GPU is wired to four NVSwitch chips, and the NVSwitches are wired to each other in a way that gives you a non-blocking all-to-all fabric: every GPU can simultaneously talk to every other GPU at full NVLink speed. NVIDIA quotes 3.6 TB/s of bisection bandwidth for that 8-GPU domain. The point isn’t the headline number; it’s that there are no contention surprises within a node.
Between boxes: InfiniBand vs. Ethernet. Ordinary Ethernet has two problems for this workload. The first is latency: training cares about microsecond tail latencies, and a switch chain that occasionally drops packets and recovers is fine for HTTP and disastrous for all-reduce. The second is that ordinary Ethernet is lossy — it expects upper layers to retransmit. InfiniBand was designed lossless from day one, with credit-based flow control: you never send a packet the receiver isn’t ready for.
In practice, InfiniBand is usually a little lower-latency and more predictable out of the box, while RoCE (RDMA over Converged Ethernet) can match it on throughput with careful fabric tuning — flow control, ECN, congestion control, switch buffer sizing. Specific microsecond numbers floating around (e.g. ~1 µs vs 1.5–2.5 µs) are vendor- and tuning-dependent and I’d be cautious about treating any single pair of numbers as canonical.
InfiniBand also ships with a feature called SHARP that does part of the all-reduce sum inside the network switches, so the GPUs only see the final result. That’s the kind of thing you can’t bolt onto general-purpose Ethernet without reinventing it — and that reinvention, under the name “Ultra Ethernet” plus AI-specific silicon like NVIDIA’s Spectrum-X and Broadcom’s Tomahawk 6, is exactly what’s been happening for the last couple of years. In mid-2024 Meta publicly described training Llama 3 across two 24K-GPU clusters — one on InfiniBand, one on RoCE — tuned to equivalent performance, with the RoCE cluster used for the largest model. So the “InfiniBand mandatory” answer is becoming “either, with care.”
The seam most posts skip. All of this only matters because of how
all-reduce decomposes. The classical implementation, ring all-reduce,
splits the gradient buffer into chunks and pipelines them around a
ring of GPUs in two passes (reduce-scatter, then all-gather). The
clever part: per-GPU bytes-on-the-wire is 2(N-1)/N × K for a buffer
of size K and N GPUs — which approaches 2K from below as N grows, and
is essentially independent of cluster size. So the cost of doubling
the cluster is dominated by latency (more hops around the ring), not
bandwidth — which is why microsecond tail behavior matters so much,
and why you build the cheap fast network inside the box (NVLink, where
the ring is short) and the expensive predictable network between boxes
(InfiniBand or tuned RoCE, where the ring is long). Ring isn’t the
only collective shape in use anymore — at very large scale, recursive
halving/doubling and hierarchical schemes are common — but the ring
analysis is the cleanest way to see why the bandwidth/latency split
shows up in the hardware.
Famous related terms
- NVLink —
NVLink = direct GPU-to-GPU link + non-blocking switch fabric (NVSwitch)— replaces PCIe for in-box GPU traffic; ~900 GB/s bidirectional per H100. - NVSwitch —
NVSwitch ≈ Ethernet switch, but for NVLink— what makes the 8-GPU all-to-all fabric inside a DGX/HGX node non-blocking. - InfiniBand —
InfiniBand = lossless fabric + RDMA + microsecond latency— the historical default between AI servers; designed for HPC long before “AI cluster” was a phrase. - RoCE —
RoCE = RDMA semantics + lossless-tuned Ethernet— Ethernet’s answer to InfiniBand for AI fabrics. - All-reduce —
all-reduce = sum-across-everyone + result-to-everyone— the collective op that gates every training step. - Ring all-reduce —
ring all-reduce ≈ reduce-scatter + all-gather around a ring— the pattern that makes per-GPU traffic approach 2× the buffer regardless of cluster size, at the price of being latency-sensitive (more hops as N grows).
Going deeper
- NVIDIA, Introducing NVIDIA HGX H100 (developer blog) — the canonical, if self-interested, source for the 8-GPU topology, NVSwitch count, and 3.6 TB/s bisection number.
- Meta Engineering, RoCE networks for distributed AI training at scale (Aug 2024) — the writeup of training Llama 3 over Ethernet rather than InfiniBand; useful for the “InfiniBand isn’t strictly required” half of the story.
- Patarasuk & Yuan, Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations (J. Parallel Distrib. Comput., 2009) — the paper behind ring all-reduce. Predates the AI boom and is more readable for it.
- Andrew Gibiansky’s Bringing HPC techniques to deep learning (originally Baidu Research, 2017) — the post that introduced ring all-reduce to a lot of ML practitioners; still one of the clearest explanations I know.