Improving AI Inference with AMD EPYC Host CPUs

Home

Improving AI Inference with AMD EPYC Host CPUs | Signal65 Webcast

On this episode of Signal65, Ryan Shrout and Russ discuss how CPU selection and system-level design are influencing AI inference performance as workloads scale.

AI performance gains are increasingly determined by what happens before and after the GPU.

In this Signal65 webcast, Ryan Shrout, Russ Fellows, and Mitch Lewis are joined by Madhu Rangarajan, Corporate VP, Compute and Enterprise AI Products at AMD, and Curt Waltman, Senior Director, Compute and Enterprise AI Products at AMD, to explore how AMD EPYC processors are improving AI inference performance in enterprise environments.

As AI workloads move from experimentation to production, the efficiency and scalability of the host platform become critical. This discussion breaks down how EPYC CPUs support AI acceleration, optimize data movement, and deliver measurable performance improvements in real-world deployments.‍

Key Takeaways:

🔹 Inference is infrastructure-bound: AI performance is heavily influenced by host CPU architecture, not just accelerators.

🔹 Data movement is a bottleneck: Memory bandwidth, I/O, and interconnects significantly impact AI workload efficiency.

🔹 CPU + GPU synergy matters: Optimizing inference requires tight integration between EPYC CPUs and AI accelerators.

🔹 Enterprise AI requires balance: Power efficiency, core density, and scalability determine real-world deployment success.

🔹 Platform-level optimization wins: AI performance is achieved through system-level engineering, not component-level thinking.

To understand how EPYC CPUs are shaping AI inference performance in enterprise data centers: https://www.amd.com/en.html

Read the Signal65 research paper: https://signal65.com/research/ai/improving-ai-inference-with-amd-epyc-host-cpus/

Listen to the audio here:

Disclaimer: Six Five Media is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded, and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors, and we ask that you do not treat us as such.

‍

Transcript

Ryan Shrout:
Hey everybody, welcome to a Signal 65 video insights. I’m your host, Ryan Shrout, president at Signal 65. I’m joined by Mitch and Russ from our performance analysis team. Guys, thanks for joining me today here right before New Year’s. Hello. And we’ve got two great guests from AMD, Madhu, who is the Vice President of Server Products at AMD, and Kurt, Senior Director of Epic AI Products at AMD. Both welcome again also for coming in and joining us and talking with us right before New Year’s.

Madhu Rangarajan:
All right. No, thank you.

Ryan Shrout:
Great to be here. Thank you, Ryan. So we’ve got a really interesting topic to talk about today. And as with most things we deal with at Signal 65, I don’t know, over the last two years, essentially, it has to do with AI and AI performance and performance testing analysis. But this one’s a little bit different. I think it’ll be an interesting conversation for our viewers and listeners, right? There’s a lot of components that go into AI systems. We talk about them all the time here, right? Whether it be for inference or training, you’re looking at GPUs and accelerators and networking and memory, but also CPUs. And I’ll start this question, you know, Madhu, you can take the first stab at it, but anybody, feel free to jump in, you know, but how would you kind of generally describe the role that a CPU plays in AI compute in the data center today, despite kind of the constant attention that GPUs like AMD’s own Instinct series tend to draw?

Madhu Rangarajan:
Yeah, I think you already touched upon it a little bit, Ryan, which is AI is a workload that involves a CPU, it involves GPUs, it involves networking, it involves software that kind of ties it all together. And it is critical to make sure there is no individual thing in that entire system that becomes a bottleneck. And which is why it’s important. I know while GPUs are what get a lot of the press when it comes to an AI workload, there are other things in the system that you need to tune very finely. And what we were going to talk about a lot more today was the CPU, which plays the role of a traffic cop. If you look at it from a very oversimplified perspective, going one click down, it’s doing things like memory copies, kernel dispatches, and things that I’ll let Kurt maybe dig deeper into. And what we want to do is make sure that stuff doesn’t take up so much time keeping the GPUs waiting. And that’s why it’s important to make sure you have a CPU that’s performant, extreme high single-threaded performance, high frequency, and then does its thing, gets out of the way, and then lets the GPU do its thing really at as much utilization as possible. Especially given the price of GPUs, you want the utilization to be as high as possible.

Ryan Shrout:
Yeah, yeah. Kurt, is traffic cop a good high-level overview descriptor, you think?

Curt Waltman:
Yeah, absolutely. And building on that, the overall end-to-end workload performance of these AI systems is, in order to maximize that performance, you need a high-performance CPU. Many of the tasks that Madhu mentioned are CPU bound in the course of these AI workloads running. And so selecting the right processor is critical to ensure that customers are able to get the most they can out of their AI investments.

Ryan Shrout:
It’s interesting in a lot of the testing that we do, both in my prior life and at Signal 65, this idea of kind of constantly shifting bottlenecks and trying to find what is the current bottleneck? How do you make it as efficient as possible and as performant as possible? And that comes into play, whether or not we’re looking at storage testing or GPU testing, whether it be for AI or for graphics, there’s always something in your kind of the engineering side of it is always finding that next little bit of performance that you can go pull out of the system. It’s super interesting to follow. Kurt, I’m curious, when you look at the Epic processor lineup today, would you say that AMD is kind of focused on developing SKUs that specifically target this kind of compute segment and workload? And what are the traits of CPUs that are most important? Is it the highest core count? Is it the highest frequency? Is it the most cache? What should we be looking at in that regard?

Curt Waltman:
So AMD has developed a family of processors aimed directly at this AI host application market. It includes the EPYC 9575F. It’s a 64 core CPU that runs at up to five gigahertz of max boost frequency. That single threaded performance is critical for getting the most out of these AI workloads. What we find is that frequently the end to end workloads will need that single core performance in order to retire those CPU bound portions of the workload as quickly as possible. And so by delivering products that maximize that single core frequency, we’re able to help our customers and they get the most out of those investment dollars that they’re making.

Ryan Shrout:
So what’s interesting about that is that might be a different configuration than what you wanna see if you were doing AI inference on the CPU itself, right? So I’m curious, are there different parameters or different vectors of core count frequency cache that are important in those different workloads?

Curt Waltman:
Yeah, absolutely, Ryan. If you’re running inference directly on the CPU, you’re going to tend to run better with higher core count processors. That enables you to get more work out of the socket. Differentiating the AI host applications where the CPU is supporting GPUs, that’s where a lower core count, higher frequency CPU really shines. What we find is for these AI workloads, when they’re hosting GPUs, it doesn’t need as many cores. But what it does need is for those cores to run very, very fast so that you can get the CPU out of the way of the GPUs and the overall end-to-end workload.

Madhu Rangarajan:
And maybe just to add to that a little bit, Ryan, is I think of these in three different buckets. There is running workloads and the CPU itself, like CPU inference, which Kurt just spoke about. Lots of cores tend to do better there. There’s a second bucket, which is, I think of it as AI pipeline workloads, like pre-processing and post-processing, doing RAG, doing speculative decode, things like that. And again, that one’s going to be pretty workload dependent. RAG tends to do really well with a lot of cores. And then the third bucket is the CPU actually being involved in the controlled flow of the GPU and making sure the GPU is kept fed. And that’s where the frequency and the single threaded performance makes the biggest difference.

Ryan Shrout:
Which leads me into some of the work and testing that Signal 65 has done. We did some testing and validation to kind of look at CPU head node performance, particularly comparing Epic versus competitive Intel Xeon platforms. Mitch, you and Russ were working on this project together. I’m curious, as you look through the data and results, feel free to give me a little bit of background on kind of what the testing parameters were and what you guys wanted to look at. And if there were any particularly interesting results across a specific model or configuration that stood out to you.

Mitch Lewis:
Yeah, so we tested a whole bunch of different models, different shapes and configurations. And the goal is really to isolate the impact of the CPU, right? So same configuration, same GPUs that they’re running with, really looking to see how is the AMD CPU compared to the Intel CPU? We pretty consistently saw an advantage for the AMD EPYC processor that we were running with. One model that really sticks out to me was GPT-OSS120B. So obviously a new, interesting model, but that one, there was really a pretty big advantage kind of across the board. So the throughput range from 12% to 14% higher. That’s both on the request and on the open throughput. Then, you know, time to first token range from 10% to 36% faster. And then even token latency was around 11% or so. You know, that right there is a pretty good example. And look, the CPU that you’re using does really matter, even if so much of the focus is around the GPUs.

Ryan Shrout:
Yeah, it’s interesting because the GPT-OSS 120B, it’s an MOE model, mixture of experts model, which I think, Madhu, correct me if I’m wrong on this, requires a little bit extra effort from the CPU side in terms of the scheduling and the load balancing across those different experts, right?

Curt Waltman:
Yeah, that’s exactly right, Ryan. In a mixture of experts model, you’ve got multiple different, you know, sections of the AI, which are responding to the queries from the end user. And so the CPU’s role is, you know, helping to orchestrate the task across each of those various experts.

Ryan Shrout:
I find it really interesting as we’ve seen this progression and kind of change of AI models and the different technologies used in them that we’re kind of seeing this, maybe this oscillation of where more of these performance bottlenecks lie, right? So more impact on networking, more impact on CPU head node selection, I think is pretty interesting. I’m curious, Russ, if anything stood out to you from the testing that you did or the analysis that we’ve done to this point on kind of the overall trend of what some of this data might be showing. Anything stand out to you so far?

Russ Fellows:
Well, as Mitch said, I was involved in the testing, so we kind of split up the testing duties. So we were pretty hands-on with the individual test results. And like Mitch said, we saw a general across the board advantage for AMD. But I guess what I’d talk about is, going back to the earlier points about the CPU being important in a lot of different parts of the whole AI pipeline, we’re focused really just on the inferencing portion of it, but other workflows that are pretty common, like RAG, preparing your data, which as anybody knows, is a big part of the whole AI data workflow, is much CPU dependent, right? So a lot of the data prep is completely handled by the CPU. So there’s a lot of portions to the day-to-day work that we didn’t necessarily test that would be impacted by the CPU. And thus the advantages that AMD has here would translate to that as well. You know, like KVCache paging right now is a big deal in production inferencing, which we didn’t really test, but that’s completely bound by the CPU. So that could have additional benefits as well. Really interesting. Good point.

Ryan Shrout:
The next thing I want to talk about, right, is when we hear about performance advantages for like-for-like GPU infrastructure, changing out the CPU as the head node, and, you know, Mitch was talking about performance numbers that go from, you know, 5 to 7 to 10 and above performance differences that you’ll see when you go look up our Signal 65 report on this. But that has real financial benefits potentially to these customers as well. So I’m curious, maybe Madhu, I’ll pose this question to you. When you see those performance results like that, how do you position or talk to the customer about those benefits in terms of what does it mean for TCO or savings over time or getting the most out of your sometimes very significant millions or billions of dollars of AI infrastructure spend that a lot of these companies are doing?

Madhu Rangarajan:
Yeah, I think that’s a great question because at first glance you look at all 10% more performance, 8% more performance, what does that actually mean, and maybe. Let me first talk about it at a single node level and then maybe zoom out and look at it at a data center level. A single GPU node with like eight GPUs goes for hundreds of thousands of dollars, right? So even if you get 10% more performance out of it just by picking the right CPU skew, that’s tens of thousands of dollars worth of perf per dollar, right? Very non-trivial, especially given the cost of GPU infrastructure going up steadily. Now, if you really zoom out and look at it at a data center or cluster level, you’re talking about some of the hyperscalers deploying 100 megawatt data centers. And in the context of a 100 megawatt data center, if you look at how many dollars this translates into, it easily goes to tens or even over 100 million dollars just by picking the right CPU to run these workloads. So given the scale of AI deployments, these kinds of performance improvements quickly approach really large numbers.

Curt Waltman:
And just building on top of that, a critical factor for many of our customers is cost efficiency, but also power efficiency. Our customers can be constrained at a top level by how many gigawatts they can source off the grid. So if you’re able to get 5% or 10% more performance out of your investment at a given power level, that can translate directly into service capacity, which is really what customers care about as a top level metric.

Ryan Shrout:
I’ll add another one on there, Kurt, and it’s just like physical space capacity, right? So, I mean, they’re all kind of interrelated, but how much power you have in your data center, how much footprint you have in physical space can matter. And again, if you can squeeze that extra 10% or more in some of these workloads out of that in the same footprint, it’s another big advantage. And so, I think, Madhu, you talk about a single node of an AGPU system running in the hundreds of thousands. If you’re spending a billion dollars on infrastructure, you get 10% out of it. That’s a pretty significant healthy chunk of money that you can reinvest. You can utilize extra performance or distribute how you see fit. So it’s a pretty impressive number. And I think when we’re used to seeing performance claims of 5X or 50% or what have you as you go through these kind of generational improvements on GPUs, right, I think it’s easy to get lost in the noise of how much this 5, 10, 15% can really add up from a dollar’s perspective.

Madhu Rangarajan:
Oh yeah, exactly. And getting those 5x often mean using different positions and re-quantizing and things like that. In this case, you don’t really actually have to do much. You just change the CPU’s skew and the fact that the CPU is running at a higher frequency and higher single-threaded perf, you just get that higher performance without doing any additional tuning. So that makes it a kind of very low-hanging fruit to get some incremental performance.

Ryan Shrout:
Which is pretty rare in the AI landscape today. So I think most people will take it. Okay, I’m gonna close this out on this question, and anybody that has some input on it, I would love to hear. But I’m curious, right, we talked about the financial benefits, the performance advantages, but if you have a customer or there’s a hyperscaler that’s standing up a new inference cluster going into 2026, what kind of practical guidance would you as AMD give around head node selection, accelerators, rack scale solutions, like how you balance all of this going into what will be likely another massive year for AI infrastructure investment?

Curt Waltman:
So I think, Ryan, first, we’re not in a one size fits all world, right? Customers are building these AI systems. Some of them are going into data centers. Some of them are going into edge locations. Client devices are using AI to give better user experiences. AMD is uniquely positioned. We’ve got a broad portfolio of solutions that span CPUs, GPUs, networking gear, and more. What we’re able to do with that is help our customers get the right solution for their workloads, for their environment, so that they’re able to get the most out of their investments in AI.

Madhu Rangarajan:
Yeah, I totally agree with Kurt. We got CPUs, we got GPUs, we got networking. And instead of making everything a GPU problem or everything a CPU problem, we don’t really have religion on this one. We want to deliver the best solutions to our customers. And it’s the right combination of CPU plus GPU plus networking that gets us there. And you can see us active on all of those fronts and also embracing open standards across the board in order to deliver networking solutions, open sourcing our software stacks, and so on. Because the other part of this is AI is too big for one company to do it alone. And open standards is what will let everyone innovate together and move this industry faster than it would otherwise.

Ryan Shrout:
It’s incredibly interesting. I agree. I’m really looking forward to seeing what AMD has in store for us as we go through 2026. I think there’s going to be some new product announcements, some new architectures, and I think we’re all very excited here, especially for Russ and Mitch to be able to go test all of it as well. So that’ll be really exciting. Again, for everybody watching, please go to Signal 65, check out our report that looks at the head node performance evaluations that we’ve talked about here. You can see all the data, testing configurations, and results that we just barely touched the surface on in our interview here today. And then I’ll thank Kurt, Madhu, Russ, Mitch for joining me here, and we look forward to seeing you all on the next video insights from Signal 65. Thanks.

‍

CYBERSECURITY

Threat Intelligence: Insights on Cybersecurity from Secureworks

Alex Rose from Secureworks joins Shira Rubinoff on the Cybersphere to share his insights on the critical role of threat intelligence in modern cybersecurity efforts, underscoring the importance of proactive, intelligence-driven defense mechanisms.

HP Launches World’s First Business PCs to Protect Against Quantum Hacks - The Six Five On the Road

On this episode of the Six Five - On the Road, hosts Patrick Moorhead and Daniel Newman are joined by HP's Ian Pratt, Global Head of Security for Personal Systems.

What is Autonomous Endpoint Management?

Autonomous Endpoint Management is a framework designed to unify IT operations and security teams on a single platform through real-time control and visibility.

QUANTUM

Quantum in Action: Insights and Applications with Matt Kinsella

Quantum is no longer a technology of the future; the quantum opportunity is here now. During this keynote conversation, Infleqtion CEO, Matt Kinsella will explore the latest quantum developments and how organizations can best leverage quantum to their advantage.

Accelerating Breakthrough Quantum Applications with Neutral Atoms

Our planet needs major breakthroughs for a more sustainable future and quantum computing promises to provide a path to new solutions in a variety of industry segments. This talk will explore what it takes for quantum computers to be able to solve these significant computational challenges, and will show that the timeline to addressing valuable applications may be sooner than previously thought.

Improving AI Inference with AMD EPYC Host CPUs | Signal65 Webcast

MORE VIDEOS

From Pilot to Production: How Lenovo XIQ Is Bringing Agentic AI to Retail at Scale

Zero Trust or No Trust: Why Enterprise Security Leaders Need to Rethink Mainframe Risk