Driving Down The AI System Roadmap With Nvidia

In
the early years of the GPU acceleration of application performance – really
from “Kepler” datacenter GPUs in May 2012 to “Volta” in May 2017 – Nvidia, the
world’s most important technology company and still the overwhelmingly dominant
supplier of hardware and systems software for the GenAI revolution, was very
good about putting out roadmaps.

But
for a few years until 2021, the company kept its roadmaps folded up in the
front left inside pocket of co-founder and chief executive officer Jensen Huang’s
leather jacket, but as the GenAI boom went from chemical to nuclear, the
company correctly surmised that with everybody trying to synchronize money, land,
power, cooling, and systems to all come together in the largest infrastructure
buildout the IT market has ever seen, everybody needed a real roadmap, extending
out a few years, so they could plan.The
first such new era roadmap came
out at the end of 2023 in a financial presentation, not in a GTC conference
slide from Huang, and we edited the heck out of it to add missing components
like the some of the GPUs and DPUs and putting the correct calendar years on the
columns, but all along being grateful Nvidia was outlining where it was at and
where it was going. We gathered up all of the roadmaps we could find between
2021 and 2023 and put
them in this story so you would have them for reference.

That
October 2023 roadmap reveal was also when we got the first wind of the annual
cadence of updates that Nvidia was scheduling for its AI system components. On this
late 2023 roadmap, the 2025 products were called the GX200, the GX200NVL, the
X100, and the X40, which had me thinking they were going to recycle the “Xavier”
codename from the gaming side of the house, but we conceded that X could also
be a variable. The 2025 products turned out to be the “Blackwell” GPUs that
were detailed by Huang at the Computex conference in June 2024 in the new-style
roadmap we by now have seen a bunch of times with additions. (The font is
pretty small for those of us of a certain age, so you might have to squint a
little.)

Nvidia
unfolded its datacenter roadmap out to 2027 in June 2024, when we learned
about the “Vera” CV100 Arm server CPUs and the “Rubin” R200 GPU accelerators
for the first time. And then Huang folded out another year and showed
us the datacenter roadmap out to 2028 at the GTC conference last year.

At
the GTC 2026 conference, Huang added some more details on the machinery between
2026 and 2028, but he did not talk about a future and likely “Feynman Ultra”
GPU expected along with updated ConnectX-10 SmartNICs and maybe even an updated
Groq LPU that could came that year, too.

Nvidia
Mostly Owns Training, And Can Compete On Inference

These
roadmaps are important to the OEMs and ODMs that convert Nvidia’s technology into
the systems that run AI training and inference for the vast majority of the
world. And also for customers, who as we all know invest in roadmaps but do not
simply acquire point products. Despite all of the glorious competition from the
Cambrian explosion in AI compute engines and networking, Nvidia by far still
has dominant market share and will for many years to come. How much remains to
be seen.

If
you do some rough math, which as you know I love to do, the total server market
in 2025 generated somewhere between $420 billion and $450 billion based on the
limited data we see out of IDC and Gartner, and about $190 billion of the bill
of materials for those systems passed through to Nvidia as revenue. Moreover, the
machines sold by the OEMs and ODMs that had at least Nvidia GPUs (and very
likely more components) installed in them probably represented somewhere
between $275 billion and $325 billion in revenues in 2025. That gives machines
based on Nvidia technologies somewhere around a low of 61 percent share to a
high of 77 percent share of the overall systems market. We have to use a
quantum probabilistic distribution to get any more accurate than that (you
were supposed to laugh there), or see all of the financials of all the
public and private server makers and add everything up.

I
guess the point is that damned near all of the profits for AI systems are going
to Nvidia, as its gross, operating, and net income clearly show.

Truly
amazing.

Which
brings us to the 2026 roadmap as presented at Huang’s GTC keynote address:

Driving Down The AI System Roadmap With Nvidia

This
time around, the evolution of the “Oberon” and “Kyber” racks is being explicitly
called out alongside the evolution of the compute and networking engines.

You
will also note that Quantum InfiniBand is not mentioned, and that is not
because Nvidia is stopping development on InfiniBand but more that Nvidia does
not expect for AI factories to deploy InfiniBand even if there are cases where
HPC centers running smaller clusters or even some AI centers might opt for that.

Furthermore,
as we
pointed out in our prior coverage of Huang’s keynote, the “Rubin” CPX long
context and attenuation processing engine, which
was unveiled last September, is not on the roadmap. The Rubin CPX was
expected to be delivered at the end of this year for AI context windows of 1
million tokens or more as well as helping with video generation for models that
do pictures instead of words. It may be premature to count CPX out of the
picture for such workloads. In fact, you might see a combination of Nvidia CPX
and Groq LPU compute engines handling both kinds of inference – and Vera-Rubin
compute complexes not involved. (Nvidia did not say this, but I am.)

The
Vera-Rubin systems are locked and loaded for volume shipments in the second
half of 2026, as planned. The Vera Arm server CPU has 88 custom Nvidia “Olympus”
cores with two threads per core and a 1.8 TB/sec NVLink chip-to-chip interconnect
that can be used as a high speed connection between one or more “Rubin” R200
GPU accelerators. Rubin is, as
we know from this time last year, a pair of reticle-sized GPU chips linked
by NVLink C2C ports inside a single socket that has 288 GB of HBM4 memory and
delivers 50 petaflops of FP4 performance on its tensor cores compared to 10
petaflops for the “Blackwell” B200 and 15 petaflops for the B300. These B200
and B300 GPUs have 288 GB of HBM3E stacked memory. Rubin is expected to be etched
using the 3 nanometer N3E or N3P process from Taiwan Semiconductor
Manufacturing Co. As far as we know, the Oberon racks will have the same 72 GPU
sockets and the same 36 CPU sockets in NVL72 rackscale systems as were done in
the Blackwell generation with the B200 and B300. (Nvidia was calling these
NVL144 by counting GPU chips, not sockets for a while, confusing itself and
more than a few customers.)

Alongside
Vera and Rubin, the Groq LP30 will ship inside of dedicated racks with a regular
Spectrum Ethernet spine (sometimes called a backplane). As far as we know, this
Ethernet spine not use the Spectrum-6 ASICs with co-packaged optics, but it
could us optics in the spine and copper in the chip-to-chip spine connectors
coming off the Groq chips.

Nvidia
calls this an Oberon ETL256 configuration, which means that either 256 Vera
CPUs or 256 Groq LPUs can be linked to this backplane.

The
Groq sleds coming this year have eight LP30s in four sockets per sled, and they
look like this:

A
rack of the LP30s is called the Groq 3 LPX system, and it has 32 sleds with a
total of 315 petaflops of FP8 inference computing, 128 GB of SRAM on the 256
chips, 40 PB/sec of aggregate SRAM bandwidth, and 640 TB/sec of aggregate scale
up bandwidth across the Spectrum ETL backplane. (Again, it is not clear if this
is Spectrum-5 or Spectrum-6 with the CPO removed. We suspect it is Spectrum-5,
which is simpler.)

Later
this year, it will also be possible to get whole racks of Vera server CPUs in
the Oberon racks with the ETL spine. (Meta
Platforms is going to be an early customer for this.) If you do the math,
that is eight Vera CPUs (possibly four two-way Vera-Vera nodes) in each sled,
with 32 sleds in a Vera ETL racks. That is 256 CPUs with a total of 22,528
cores and 512 TB of main memory and 300 TB/sec of bandwidth across that memory.

Presumably
this will be called a Vera CPX rack, with CPX being short for Compute
Processing Rack (not to be confused with the Rubin CPX processor). The storage
racks based on the BlueField-4 DPUs running various distributed storage software
stacks from a dozen or so partners is called a BlueField STX rack, and similarly
a rack of Spectrum-6 switchery is called the Spectrum-5 SPX rack.

Perhaps
the X was not a good idea in the naming. Perhaps, just perhaps, these should have
been called CPR, STX, and SPR? Naming matters. The are all based on the MGX
modular server architecture, and MGX is not to be confused with the private
equity firm in the Middle East that is underwriting a lot of AI facilities
around the globe these days.

As
we roll forward into 2027, the “Rubin Ultra” GPU, presumably called the R300, is
just a doubling up of GPU chips inside of the Rubin socket, from two to four
chips and supplying 100 petaflops of FP4 performance. Nvidia is going to double
up the number of sockets to 144 inside the new “Kyber” rack, which will have a
copper midplane instead of huge spaghettis of thousands of copper cables
cross-coupling the GPU sockets in the rack. Nvidia will have sixteen banks of
HBM4E memory against those four Rubin GPU chips, for 1 TB of capacity and 32
TB/sec. (In theory, that HBM4E memory could run at 64 TB/sec, and we are wondering
why Nvidia is gearing it down – perhaps for power consumption and heat
dissipation reasons.)

Let’s
talk about NVLink ports and NVSwitch memory fabric interconnects for a second.
The names were out of phase a bit because the initial NVLink 1.0 that debuted
with the “Pascal” P100 GPUs back in 2016 did not have a switch but rather used
a mesh interconnect to share memory across the Pascal GPUs. The names were of
the ports and switches were lockstepped with the Blackwell B300 GPUs (I think)
and going forward the chip and port generations are names in synch. Like this:

There
are many ways the NVSwitch memory fabric ASICs can be enhanced, but I think it
is safe to say that the radix – the number of ports per ASIC – is getting to be
too low and I think there is a more than even chance that Nvidia will start
thinking about not chiplets but waferscale designs for ASICs. (And might even
do that for future Groq LPUs, now that I think on it.) These would not have to
be full wafer scale, but it would mean doing away with the C2C interconnects
everywhere as well as all of the buffering that needs to be done as data moves
from one chip through the C2C interconnects and to the adjacent chips. (This
what we think secretive
networking chip startup Eridu is already working on, and Cerebras has shown
how it works well for parallel compute.)

Suffice
it to say, NVLink 6 ports on the Rubin GPUs will double to their bandwidth over
NVLink 5 ports, to 3,600 GB/sec, and they will double yet again with the Rubin
Ultra GPUs, which stands to reason given the performance doubling and the HBM4
memory bandwidth almost tripling between Rubin and Rubin Ultra.

In
the Rubin generation, the Spectrum-6 Ethernet ASICs will have co-packaged
optics, and that generation of 102.4 Tb/sec switches will also support the
scale out network needs of the Rubin Ultra systems. The 2027 Rubin Ultra product
line will see the Groq LP35 chip get FP4 floating point processing in the NVFP4
format so it matches the precision of the Blackwell and Rubin GPUs. And with
the Groq LP40 compute engine coming in the Rosa-Feynman systems in 2028, NVLink
ports will be added so the Groq engines so they can have memory coherency with
the Rosa Arm server CPUs (named after Rosalyn Sussman Yalow, a Nobel Prize
winning medical physicist who developed the radioimmunoassay method of
detecting tiny amounts of chemicals in blood or tissue) and with the Feynman
GPUs (named after famous physicist and bongo player Richard Feynman).

You
will see in the roadmap that in 2028 Nvidia will add CPO to NVLink 8 ports and
presumably for the other end at the NVSwitch ASIC. While we are always pushing
for compute engine makers to do CPO on their devices, they can be copper on
that end with a multitier network of switch ASICs using CPO on the other side.
You do not have to have CPO on both sides. (Nvidia seems to be using NVSwitch
and NVLink loosely in this chart, so be careful.) We think that CPO for
NVSwitch is interesting because it will allow for fast, high-bandwidth, two-tier
NVSwitch networks to create larger GPU compute memory domains for models to
play in.

With
Hopper GPUs, the official scalability was eight GPUs with linked memory, but
the unofficial scalability was 256 GPUs using a two tier network. With
Blackwell, the official GPU memory domain size is 72 GPUs, but with multiple
tiers of NVSwitch, it can, in theory, be boosted to 576 GPUs. The Kyber racks,
which cram twice as many GPUs on vertical sleds and have a copper backplane,
the rackscale domain will be 144 GPUs. Eventually, with the advent of NVSwitch
8 CPO (I know the chart says NVLink 8 CPO), the single rack will remain at 144
GPUs, but across a multitier network (we think two tier, but we can’t know that
without knowing the radix of the NVSwitch 8 device) Nvidia will have a domain
size of 1,152 GPUs.

Decades
ago, Cray supercomputers had copper backplanes in the rack and optical links
coming off their routers to interlink the racks. We suspect that Nvidia will do
much the same. The rule is always: Copper when you can, optics when you must,
and that is an economic rule as much as a technical one. But, with Nvidia
accounting for such a large share of AI systems investment, if any workload can
get CPO to ramp up in volumes and therefore push down unit prices, it is GenAI
inference, and if any company can drive that effort and coordinate it across a supply
chain, it is Nvidia. One might argue that only Nvidia can make this happen, and
if it does, all systems will benefit.

The
factor of 16X more GPU sockets combined with the performance increases expected
with the Feynman GPUs – all Nvidia is saying is that it will have die stacking
and custom HBM memory with this generation of chips – will result in a massive
throughput boost for CPU-GPU hybrid systems.

If
the die stacking is just for SRAM cache (which is relatively easy to do), that
still allows for more 2D GPU cores to be added to a socket. Nvidia will
probably move to 2 nanometer processes or smaller for Feynman, which also means
a move to gate all around (GAA) transistors and the High NA EUV process, which
also means the maximum reticle size moves from 858 mm2 to 429 mm2 because the
chips can only be half as tall. So whatever Feynman is or isn’t, it will have
at least eight GPU chips in a socket compared to the four chips in a Rubin
Ultra socket and use the process shrink to add more circuitry.

There is always a chance that Nvidia might be stacking
both SRAM and compute with the Feynman devices, of course. That would be very
interesting, indeed.

Why I’m sticking with Firefox as my browser – after years of using Chrome, Edge, and Safari

Anthropic IPO filing marks AI maturing into enterprise utility

TinyFish Launches BigSet: An Open-Source Multi-Agent System That Builds Structured Live Datasets from Plain-English Descriptions

Paxton’s win over Cornyn sets up high-stakes Texas clash with Talarico

Global Resources Outlook 2024 | UNEP

Texas Democrat Talarico claims voting laws are rigged ahead of Paxton race