Nvidia slaps $20B Groq tech into massive new LPX racks to speed AI response time

GTC Nvidia will use Groq’s language processing units (LPUs), a technology it paid $20 billion for, to boost the inference performance of its newly-announced Vera Rubin rack systems, CEO Jensen Huang revealed during his GTC keynote on Monday.

Using this technology, the GPU giant can now serve massive trillion parameter large language models (LLMs) at hundreds or even thousands of tokens a second per user, Ian Buck, VP of Hyperscale and HPC at Nvidia told press ahead of Huang’s keynote Sunday.

Until now, ultra-low latency inference has been dominated by a handful of boutique chip slingers like Cerebras, SambaNova, and of course, Groq, the latter of which Nvidia all but absorbed as part of an acquihire late last year.

Demand for these so-called premium tokens has grown over the past year. OpenAI is using Cerebras’ dinner-plate sized accelerators to achieve near nearly instantaneous code generation for models like GPT-5.3 Codex-Spark.

By combining its GPUs with Groq’s LPUs, Nvidia wagers inference providers will be able to charge as much as $45 per million tokens generated. To put that in perspective, OpenAI currently charges about $15 per million output tokens for API access to its top GPT-5.4 model.

To be clear, LPUs won’t replace Nvidia’s GPUs but rather augment them.

Speed for decode

LLM inference encompasses two stages: the compute-heavy prefill phase in which the prompts are processed, and the bandwidth-heavy decode phase during which a response is generated.

With up to 50 petaFLOPs each, Nvidia’s newly announced Rubin GPUs aren’t hurting for compute, but with 22 TB/s of HBM4 memory bandwidth, Groq’s latest chip tech is nearly 7x faster, achieving 150 TB/s apiece.

This makes Groq’s LPU an ideal decode accelerator. Nvidia plans to cram 256 of the chips into a new LPX rack system that’ll be connected via a custom Spectrum-X interconnect to a neighboring Vera-Rubin NVL72 rack system. The GPUs will handle the compute-intensive prompt processing, while the LPUs spew out tokens.

The GPU giant needs that many chips because, while SRAM may be fast, the chips are neither capacious nor compute-dense.

Each Groq 3 LPU is capable of 1.2 petaFLOPS of FP8 and contains 500 MB of on board memory. That’s about 1/500th of the capacity of Nvidia’s Rubin GPU.

“The LPU is optimized strictly for that extreme, low-latency token generation, offering token rates in the 1000s of tokens per second. The trade off, of course, is that you need many chips in order to perform that kind of performance,” Buck explained. “The tokens per second per chip, is actually quite low.”

In other words, to do anything interesting, Nvidia is going to need a lot of them.

Even with 256 chips per rack, that’s only 128 GB of ultra fast memory, which is nowhere near enough for trillion-parameter models like Kimi K2. At 4-bit precision you’d need at least 512 GB of memory or about a thousand LPUs to hold a 1 trillion-parameter model in memory.

Nvidia says multiple LPX racks can be ganged together to support these larger models.

The integration of Groq’s latest LPUs into Nvidia’s LPX racks represents a bit of a course correction for the AI infrastructure magnate. Nvidia had previously announced a dedicated prefill processor called Rubin CPX at Computex last year. The basic idea was to use GDDR7-equipped Rubin CPX processors for prefill processing and HBM-equipped Rubin GPUs for decode. However, that project appears to have been abandoned in favor of Groq’s LPU-based decode accelerators.

“Integrating LPU and LPX into our written platform to optimize the decode, that’s where we’re focused right now,” Buck said.

Nvidia isn’t the only one looking to fuse its compute-heavy AI accelerators to an SRAM heavy architecture like Groq’s.

On Friday, Amazon Web Services (AWS) announced a collaboration with Cerebras to develop a combined inference platform, not unlike Nvidia’s Groq 3 LPX. In this case, the platform will use AWS’ Trainium 3 accelerators for prompt processing and Cerebras’ WSE-3 ASICs, each of which pack 44 GB of SRAM onto a wafer-sized chip, to generate low-latency tokens.

Nvidia’s Groq-based LPX systems are expected to ship alongside its Vera Rubin rack systems later this year, though it appears both access and software support may be somewhat limited. At least initially, Nvidia is focusing on model builders and service providers that need to serve trillion-plus parameter models with high token rates.

Buck also notes that while Nvidia is using Groq’s ASICs to accelerate its inference platform, they don’t support its CUDA natively just yet.

“There are no changes to CUDA at this time. We are leveraging the LPU as an accelerator to the CUDA that’s running on the Vera NVL 72 platform,” he explained. ®

Ad Partner

Partners

Free Hosts Blogs

Speed for decode

More Ad Partners

WHUK Partner