Inside the Memory Tech Powering Today’s AI and HPC Workloads – Six Five On The Road
Girish Cherussery, VP/GM AI Solutions at Micron, joins David Nicholson to explore the latest memory technologies powering AI and HPC, breaking down current bottlenecks, emerging architectures, and what data center leaders need to know now.
AI and HPC are outgrowing yesterday’s memory architectures, and the next performance breakthroughs won’t come from GPUs alone. How are memory and storage innovations reshaping how data centers scale for the AI era?
From SC25, host David Nicholson, Global Technology Advisor at Futurum, is joined by Micron's Girish Cherussery, Vice President and General Manager, AI Solutions for Micron’s Cloud Memory Business Unit, for a conversation on the advanced memory technologies driving efficiency, bandwidth, and scalability for AI and HPC workloads. The discussion brings practical insights into the real-world implications of memory bottlenecks and emerging architectures and interconnects shaping next-gen data center performance. What evolving strategies around composable infrastructure are helping data center architects plan for memory-intensive computing?
Key Takeaways Include:
🔹Memory bottlenecks in AI and HPC: Why memory architectures are now the critical limiting factor in AI/HPC system performance and scaling.
🔹Emerging solutions and trade-offs: How higher bandwidth, increased capacity, new module designs, and innovative interconnects are addressing performance needs, and the challenges and roadblocks system designers still face.
🔹Modular, composable, and CXL-driven infrastructure: The practical benefits of modular and composable memory, and how technologies like CXL are enabling more dynamic, agile, and efficient memory use for modern workloads.
🔹Real-world benchmarking & ecosystem collaboration: Insights from industry-standard benchmarks and testing with partners, revealing how latency and bandwidth behave under authentic AI/HPC conditions.
Learn more at Micron
Watch the full video at sixfivemedia.com, and be sure to subscribe to our YouTube channel, so you never miss an episode.
Listen to the audio here:
Disclaimer: Six Five On The Road is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded, and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors, and we ask that you do not treat us as such.
David Nicholson:
Welcome to Six Five On The Road, coming to you from St. Louis, Missouri at SC25. This is the Supercomputing High-Performance AI Conference to End All HPC AI Conferences. I've got a very special guest with me today from Micron. He's the Vice President and General Manager of AI Solutions for Micron's Cloud Memory Business Unit, Girish Cherussery. Welcome. Thank you. So Girish, I just listed your title, but how would you describe what you do at Micron?
Girish Cherussery:
Oh, thank you for asking that question. And first of all, excited to be here at Supercompute, one of my favorite conferences where you get to actually discuss what the future could look like. Anyway, at Micron, I work on high performance memory, HBM. and all solutions that are related to AI architectures or AI infrastructure. So, with HBM being the predominant one, the bandwidth memory being the predominant one that we focus on. I kind of managed their business for my calendar.
David Nicholson:
So speaking of high bandwidth memory, we've seen AI and high performance compute workloads. It's not a shock to anyone that they've gotten, they become more and more and more demanding. What are you seeing from your perspective as we move into that future where there are greater and greater performance demands? Where are you seeing bottlenecks?
Girish Cherussery:
Right. I think it's important to recognize that AI about 10 years ago was a single class of workloads. AI in the future is going to be multiple classes of workloads. What I mean by that is you're going to have training, you're going to have inference, you're going to have within training and inference, different breakouts like inference today has pre-fill, there's decode. Pre-fill is more compute intensive, decode is more memory intensive. As you start looking at this and as AI becomes so predominant in the world, the infrastructure needed for these kinds of solutions are going to be very desegregated or are going to be very different from one to the next. So at the heart of all of that, you're going to start to see memory capacity, memory bandwidth, latency, power efficiency become the key critical things that people are going to look at in memory.
David Nicholson:
Now let me ask you, are you describing different kinds of memory, or are you saying you can get all of those things, you can get all of the best of those things in one device?
Girish Cherussery:
Yeah, let me give you two things. So the interesting thing, I tell my kids to go get something from the dishwasher, I ask them to put the clothes into the washing machine, and then I ask them to go bring a book. They can only remember two out of the three things. right, is why their memory is not as good because they are still young. As they grow up, their memory starts becoming bigger, their ability to compute and understand these things grow, and as a result, at some point in time, they become very smart human beings who can process a lot of information. You take machines, especially in the AI world, are very similar, right? We are talking about artificial general intelligence and how machines are going to be your agents or your secretary is going to go do some work for you, help you out. As you think about that, they're going to need memory. And is it one type of memory that has all of this? Yes, absolutely. Yes. But really, do we need all of that in one type of memory if it's too expensive? We may want to think about that as, hey, what is the right workload? And what is the right memory type for that? What I mean by that is, if you take, for example, the decode phase of interference, it's extremely memory bandwidth intensive, and it is memory bound problems. And when you start thinking about memory bound problems, you are really trying to reduce your latency, you want as much bandwidth as possible. What are you willing to trade off? Well, some amount of capacity. High bandwidth memory is a perfect solution for that because you are restricted in the number of placements around your XPU, GPU, SOC, whatever you want to call that compute engine. Now you go out and look at pre-fill or context links where memory capacity is important. You wanna load up a book, you wanna be able to read from that book and generate your first token. Now you need memory capacity. Well, you've got something like a SOCAM2, which uses a low power memory device, which we call the LPDDR5, built into a module form factor that goes onto the same board that you have your compute engine or your CPU. And if you need higher capacity for your main memory, for your general purpose database kind of applications, you can now go into an RDEM or an MRDEM, which offers really large volumes of memory that you can use with multiple modules that you can stick into the slots. And in that particular case, when you start thinking about where does that go beyond that, then you've got the CXL memory, which is more like a desegregated memory that is sitting isolated away from the compute in some ways, but it is a large pool of memory that you can actually access on a need basis.
David Nicholson:
Yeah, I want to talk about CXL, but I want to kind of put a thumbtack in that for a moment because I want to... I want to kind of double-click on what you're saying here in terms of the different kinds of memory. Do you think of that in terms of tiering, or is that not the right way to look at it? Because it is often the case that different types of memory have different places in the environment, and it isn't about just most recently accessed or something like that.
Girish Cherussery:
So traditionally, when people think about tiering in the computer architecture sense, you have the L1, L2, L3 cache memory, which is different memory that is sitting right next to the host with different latencies and compute capabilities. And as you go from L1 all the way to L2, L3, that capacity goes out. Once you go into the DRAM world, traditionally in the x86 architecture, that's been a DRAM module that's sitting right next to the host CPU. In the new world of AI, now you've got the GPUs or the XPUs that have a HBM or a high bandwidth memory sitting right next to it. That alone is not enough for you to service all the needs of the compute engine. Like if you've got a large model, you load it up into your HPM and you start computing. As you're generating these results, you've got to go store this somewhere. And sometimes, let's say you're in chat GPT, you've asked a bunch of questions and you decide, you know what? My mom's calling me for dinner. So let me just go in and get some dinner. Or my wife's basically got some chore for me to finish. You don't want to lose all of the work that you've done. And you don't want it to be decided necessarily in your memory. You can store it off into a storage element or into an aggregated memory pool that is sitting there. that you can then retrieve later as soon as you come back. You may not care about an extra bit of latency when you come back, but you want all of that context to remain. So this is another tier where you need a large capacity of memory. You may be willing to sacrifice a little bit of the bandwidth or the latency associated with it. Whereas when you're actually chatting with ChatGPT or any of those AI tools like Gemini, you want the response time to be super good. They typically say human attention span is about 2.5 milliseconds. I can't even get 0.5 milliseconds out of my kids. But if you do have that as your SLA agreement with your customers as time for responses, you start to see that that bandwidth is going to be super important. And especially as you start going into more you know, chain of thoughts or reasoning models, that is basically trying to figure out all the different possible solutions and then kind of come back and say, okay, this is the right solution based on all the different things that I investigated in parallel. Guess what? That means memory bandwidth and compute, both are gonna be now bottlenecks because you're kind of really pushing the limits of those systems. So when we talk about tiering, to get back to what you were asking, it depends on how you think about tiering. The way I would think about it is if you're looking for very high bandwidth, high bandwidth memory is the perfect solution for you. If you're okay to trade off some amount of bandwidth, but you want more capacity, an LPDDR memory or a DDR MR-DIMM that's sitting right next to a host is a really good memory choice. If you're looking for a vast amount of memory or storage, then you can actually go to CXL or your storage element that is sitting in that same system. So if you think about it from that perspective, you've got the entire stack of different memory and storage solutions that you can bring to the market.
David Nicholson:
Yeah, and just a quick clarification. When someone refers to being memory bound today, especially in the context of an AI discussion, that typically means performance bound as opposed to capacity. So the solution to being memory bound isn't just more memory. Right, in the way that the term is used now. Or do people sometimes refer to being memory bound when what they mean is, I need more?
Girish Cherussery:
So it's a bit of both. Is it a bit of both? OK. And pure roofline models that you look at, it is focused on the memory bandwidth side of it, like you said. But if you are loading up a large model, a four to five billion parameter model, and you're trying to do it with 96 gigabytes RDIM that may or may not fit into it, then you need more capacity to actually solve that problem because the first step is to load up the model. So in that particular case, you can't even run the model if you don't have enough memory capacity. So you assume at a baseline you're able to load the model, then the question is, hey, is the compute more intensive or is the memory more intensive? And usually it's a mix of both, but then you look at which one dominates. If I push the memory performance up, am I able to get a better performance? Or if I push my memory, I mean my compute up, am I able to get more performance? If you're able to push your memory bandwidth up and you're able to get more performance, you traditionally say, well, that's a memory bound problem.
David Nicholson:
Okay, excellent. Okay, I'm about to reach up and grab the CXL sticky note off of my virtual monitor here. But before we get there, what is Micron's view on this idea of composable or modular infrastructure in the context of HPC and AI moving forward? We've been talking about this for a while, but is this something that you see happening moving forward? Is this a real thing?
Girish Cherussery:
Oh, yeah. Yeah, absolutely. The classic example is Micron recently announced the SOCAM 2, which is basically, if you kind of think about what SOCAMs are, These are modules with LPDDR memory in them. And they sit right next to the host CPU in this particular case. We've worked with our partners over the last five years to define what that looks like. And if you look at the module, it's a thin module with multiple placements of LPDDR. And within the LPDDR, you have multiple die per package. The beauty of it is you're able to plug and play that memory as and when you need it, meaning it acts like a module. You can remove it, replace it if needed. But the most interesting piece here is LPDDR typically provides better performance and is able to provide higher capacity per placement or over a DDR in this particular case, because DDR is a single component, but you've got a module with multiple placements. You don't have room for that kind of module sticking down on the board. It's usually a DIMM slot that you place them in. In the case of LPDDR, you actually are able to kind of, instead of putting, soldering it down onto the board, you're able to use a module form factor, which allows you to the word that you used is compose and decompose. The amount of memory that you need in the system as new memory modules come in with a higher capacity, you're able to replace them. The beauty of LPDDR is it's able to provide higher performance as well. And so now when you use an LPDDR memory in this module, you're able to push the performance and you're able to get the capacity. And as you think about, you know, how this is moving forward into, you know, the rack scale architecture, where your HBM kind of becomes your first tier of memory that the XPU or GPU is using. And then this becomes supplemental memory. Some people call it the fast memory sitting right next to the GPU, but it's actually talking through the CPU into the GPU, but there's no latency because it's cache coherent. And you're able to access that memory more seamlessly between the two.
David Nicholson:
Okay, so let's get to CXL. The first time I really started looking at CXL was a couple of years ago, actually here at SC23. In my mind, I would have thought that maybe we would be further along in terms of deployments. But first, define CXL for us. What is CXL? And what does it mean to Micron? And what do you think it means moving forward in this context of supporting HPC workloads and AI?
Girish Cherussery:
So CXL is Compute Express Link. And in this particular case, when we talk about CXL memory, it's an aggregated pool of memory, be it DDR memory or LPDDR, whatever memory technology, which is kind of abstracted away from the host. And it talks to the memory in this protocol that's defined by CXL. And it's a CXL mem protocol that's defined on top of it. What it allows you to do is, basically, if you need large swaths of memory, you don't need to necessarily have all of that memory sitting right next to the host. You can have it at a distance.
David Nicholson:
Like how far away?
Girish Cherussery:
It can be on a different node. It can be on a different rack. It depends on what your end use application.
David Nicholson:
But in the data center?
Girish Cherussery:
In the data center. Absolutely inside the data center. And originally envisioned the way it was, as multiple hosts can talk to the same pool of memory, which gives you the ability to actually kind of enhance your redundancy in your memory. So if there's X amount of memory that is available, all of them can access that amount of memory. Now you're not wasting that memory. You're able to kind of utilize the memory more effectively. More recently, the CXL protocols in the AI world are transforming into applications that are more catered towards, how do I make sure that this large AI model is able to enhance the large workloads that are there? How can it be enhanced with a larger pool of memory that is there? Part of the problem today that the ecosystem is trying to solve is, how do we integrate the hardware, software ecosystem more closely? And CXL was one of the ways that the market started looking at how do we solve this more effectively and how do we take advantage of this large pool of memory that is available. In the ecosystem today, when we think about CXL, one of the things that's happening is that hardware software ecosystem that needs to work well together, that's still evolving and people are still optimizing and trying to find the right place where this can be very effective. And as AI is growing at this rapid scale, the focus has been more about trying to get that compute and memory right next to compute more sorted out today. I definitely think CXL will continue to evolve. One of the things that we did announce recently was a collaborative work that we had done with the Pacific National Labs. The PNNL is basically a machine that's called Krete, I think is the right way to pronounce it. It was a hardware architecture that was optimized with memory inside and the large pool of memory per node that they had about two terabytes of memory basically was talking to the host using a CXL protocol.
David Nicholson:
Okay.
Girish Cherussery:
And this was specifically focused for high-performance science applications. And you can begin to see the amount of co-design that needs to happen to make sure that the system can take advantage of the large pool of memory that is available there. And CXL was a great protocol for them to run on top of that layer. So it's just the beginning of that world beginning to show up. But along with CXL, you will continue to have other memory technologies that sit in the AI hardware space and HPC hardware space, in my opinion.
David Nicholson:
Yeah, speaking of optimization, the sort of AI world at large seems to have one optimization point right now, which is to throw as much money at me as you possibly can and trust us, it'll be worth it in the end. But you've been involved with some actual objective benchmarking work. That's been done. So when you talk about latency and throughput and things that can be measured, those are things that all roll up through the stack that contribute to whatever the ultimate metric is, which is how much am I getting for my dollar? So if we're focused on what Micron really focuses on, what are some of those objective metrics and what should people understand about what's important to think of when they're evaluating memory?
Girish Cherussery:
Yeah, I think I would probably first talk about a benchmark that we recently had published for the financial market. It's called the Stack M3 and Stack A2. These are basically benchmarks that basically look at the ticks or the simulations of risk assessment in the financial market. And this was using a 96 gigabyte RDM that we had On a 128 gigabyte RDAM per node, it was about two terabytes of memory sitting with an Intel Xeon processor, the Granite Rapids SP, along with our SSD module. And it basically was deployed in a petaflop server from Supermicro that was built by Supermicro. and They ran about 25 benchmarks and we basically created a record on all 25 of them. So this was big news that the world knew and it was optimized both at the node level and also at the full deployment level and they measured it and it knocked the records down pretty seamlessly. Remember, we talked about memory-bound and compute-bound. If you're able to push your performance up 30%, that actually helps you. basically get better TCO. You're able to get better tokens per second for your inference applications, the overall throughput is better, so automatically that means you're getting better TCO. So now you take something as simple as what happens in a component. You take it at a node level, and then you can scale it at a rack level. And you will start to see this kind of become bigger numbers as they go through. So overall, you will continue to see this performance scaling happening. And there's going to be a demand for it, right? Because you mentioned a lot of people are building large-degree or watts-scale data centers. And for them to be able to scale it, they're going to need these kinds of power efficient solutions and performance solutions for them to monetize them.
David Nicholson:
This all sounds really good. Let's talk about what really matters. The primary constraint in all of this, if we're honest, power, power consumption. What are you doing to address power consumption?
Girish Cherussery:
Yeah, I think we talked about the power efficient HBM3E solution, where we are about 30% better than competition. In addition, if you look at our SOCAM2 that we are coming out with, the SOCAM basically, it uses a one gamma LPDDR5 component in it, which is about 20% better than its previous generation. The third thing that we're doing is we are trying to make sure that our leadership in our process node innovations continue. And as that continues, basically you're going to start to see, continue to see this power efficient solution come out across the entire DRAM stack. And Micron's very, very focused on that. And we pride ourselves in being the most power efficient solution there.
David Nicholson:
Well, if this system can tell me whether I should be buying gold or selling gold, then I will truly appreciate it. We want to get some practical work out of these systems. Girish, thanks so much for joining us. It's a pleasure. For Six Five Media and Six Five on the Road at SC25, I'm Dave Nicholson. Stay tuned for more interesting commentary from St. Louis, Missouri.
Other Categories
CYBERSECURITY

Threat Intelligence: Insights on Cybersecurity from Secureworks
Alex Rose from Secureworks joins Shira Rubinoff on the Cybersphere to share his insights on the critical role of threat intelligence in modern cybersecurity efforts, underscoring the importance of proactive, intelligence-driven defense mechanisms.
QUANTUM

Quantum in Action: Insights and Applications with Matt Kinsella
Quantum is no longer a technology of the future; the quantum opportunity is here now. During this keynote conversation, Infleqtion CEO, Matt Kinsella will explore the latest quantum developments and how organizations can best leverage quantum to their advantage.

Accelerating Breakthrough Quantum Applications with Neutral Atoms
Our planet needs major breakthroughs for a more sustainable future and quantum computing promises to provide a path to new solutions in a variety of industry segments. This talk will explore what it takes for quantum computers to be able to solve these significant computational challenges, and will show that the timeline to addressing valuable applications may be sooner than previously thought.




