Storage Is the New Foundation of AI Inference

Home

Storage Is the New Foundation of AI Inference - Six Five On the Road

AI inference performance depends on storage architecture more than most enterprise infrastructure teams have accounted for. Analyst Ryan Shrout and Avi Shetty, VP of AI Ecosystem and Market Enablement at Solidigm, breaks down why growing context windows force GPU recompute when storage is under-provisioned, how the three-tier inference storage architecture addresses that constraint, and why Jensen Huang's projection that the context memory tier will consume the entire TAM of storage signals the scale of infrastructure commitment enterprises need to be planning for now.

GPU utilization is the metric every AI infrastructure team is chasing. What enterprises are discovering is that storage architecture determines whether or not they hit their targets. Context windows are growing. KV cache is expanding. Every time an overloaded HBM forces a GPU to recompute context, utilization drops and inference latency climbs. The compute investment doesn’t change, but the output does.

At COMPUTEX 2026 in Taipei, Signal65 President Ryan Shrout sat down with Avi Shetty, VP of AI Ecosystem and Market Enablement at Solidigm, to break down why storage has moved from a peripheral infrastructure consideration to a core variable in AI inference performance.

They start with a simple eight-word query that reframes the discussion, "where's the best dumpling in Taipei?" This generates roughly 42,000 tokens and 12 to 13 gigabytes of KV cache that has to be stored somewhere. Scale that to an engineering environment asking an AI agent to resolve a Jira ticket against years of project history, and the storage demand compounds quickly. Shetty walks through the three-tier storage architecture Solidigm recommends for inference environments: G3 direct-attach for GPU feed speed, G3.5 for context memory to eliminate recompute, and G4 shared storage for scalable density. He also addresses Jensen Huang's assertion from CES and GTC 2026 that the context memory tier will consume the entire TAM of storage going forward, and what that trajectory means for enterprise infrastructure decisions being made today.

Key Takeaways:

🔹 GPU recompute is the hidden cost of under-provisioned storage. When context windows exceed HBM capacity, GPUs recompute rather than retrieve, pulling utilization down and driving up token TCO. Storage architecture directly determines how efficiently the primary data center asset performs.

🔹 A single eight-word query generates 12 to 13 gigabytes of KV cache. At enterprise inference scale, across complex agentic workflows and long-context tasks, storage demand is not linear. It compounds with every increase in context window size.

🔹 Three storage tiers define the inference architecture. G3 direct-attach feeds GPUs at speed. G3.5 context memory eliminates recompute overhead. G4 shared storage provides the density and scalability that hard drive arrays cannot match at modern data center efficiency ratios.

🔹 Jensen Huang has called the context memory tier the entire future TAM of storage. That projection from CES and GTC 2026 signals the scale of infrastructure commitment enterprises will need to make as inference workloads grow across the next several years.

🔹 Solidigm's 122TB U.2 drive puts four petabytes in a single rack at 24 drives. The density shift from hard drives to high-density QLC changes the power and space economics of shared storage at data center scale.

Inference is moving everywhere, from edge deployments and small data centers to local and co-located environments. The enterprises provisioning storage architecture now to match their latency and recompute tolerance are the ones whose GPU investments actually perform to spec.

Watch the full video at sixfivemedia.com, and be sure to subscribe to our YouTube channel so you never miss an episode.

Disclaimer: Six Five Media is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded, and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors, and we ask that you do not treat us as such.

‍

Transcript

Avi Shetty:
The whole goal, the whole game is to make sure that the GPUs are fully utilized. With increasing context memories, you have to ensure storage architecture is part of your data center architecture.

Ryan Shrout:

Hey, everybody. Welcome to Six Five On The Road. I'm your host, Ryan Shroud, here at Computex 2026 in Taipei. And we're going to talk about how AI inference is impacting the world of storage and compute. And I'm joined by a good friend of mine, Avi Shetty. You are the Vice President of AI Ecosystem and Market Enablement at Solidime. Welcome. Thanks for joining us.

Avi Shetty:

Oh, thank you, and great to be here, and great to be here at Computex. I think there'll be a lot of cool technology announcements happening from all the ecosystem partners here, which I'm excited about.

Ryan Shrout:

It's been interesting to see the evolution of Computex here in Taipei over the last handful of years as kind of the AI evolution, revolution has kind of reset a whole bunch of expectations, so it's been interesting.

Avi Shetty:

Yeah, you've seen how this entire conference has fundamentally changed its value prop and mission, right? This is, I was telling your team, this is I think my 15th or 14th or 15th year coming here, but I used to come here in a client role. It used to be a PC, and all the coolest PC tech, it was the hotbed for that announcement, but now it's all about AI.

Ryan Shrout:

So, I want to ask you some questions about the data center world, this AI inference build out. I think one of the things that's been interesting as we've, you know, it feels like forever ago, but it's only been a couple of years, we started moving from training to inference. And I think a lot of people started to assume that maybe storage and capacity and performance were only limited to being important in that training space. In inference, we're still seeing storage as a critical value and performance part of that story. Why is that still the case?

Avi Shetty:

I think you'll see a lot more. If not, you've not heard much. You'll see a lot more of storage in key to inference-related architectural discussions. And the fundamental reason is inference is all about responsiveness to the end user. Right, and as a result, you have data which is located in different tiers, depending on the hotness and depending on whether it's in the prefill cache, whether it's in your KV cache, whether it's an HBM, whether it's evicted, all of those add to in the end, token TCO, which is essentially the parameter which is inference is required, inference requires. You know, you can't fit everything in the HBM. And as a result, you need new tiers to essentially offload, to read up from caches so that you give better responsiveness back to whatever you're doing. Well, we did a study, SolidM did a study of a simple LLM request, right? Let's talk about, hey, we are in Taipei, where's the best dumpling available, right? It's a simple eight word, eight, nine word query. We did a full study, which we kind of talk about on our website. It's called the anatomy of a token. And this eight word search translates to around 42,000 tokens. and roughly 12 to 13 gigabytes of KV that has to be stored somewhere. That's one such query. Now imagine you are in an engineering environment and you are asking, hey, fix this ticket on Jira, look for our history, whether we've solved it. Now the query's becoming a lot more complex, and as a result, you'll see a lot more tokens. I think all of that is part of the inference workload, and you'll see data just continuously growing.

Ryan Shrout:

You mentioned KB Cache, I know that's an important one, and what about like context windows? Where are the other areas that the enterprises might be seeing that?

Avi Shetty:

Yeah, context windows are growing, and guess what happens when the context window grows and fully utilizes your HBM? Now your GPU has to recompute. And what that means is, recompute means GPU is not fully utilized. Your primary asset in your data center is your GPU. The whole goal, the whole game is to make sure that the GPUs are fully utilized. And with increasing context memories, you have to ensure storage architecture is part of your data center architecture. You've ensured that you've provisioned storage at different tiers, as well as ability to scale as well. which will add value to your end token output, which ensures that your GPU continuously remains utilized.

Ryan Shrout:

When I think about enterprises that are getting quotes, they're looking at RFPs, they're trying to build out their infrastructure, how do you see the balance shifting between You know, I used to only worry about how many GPUs I needed, and now we're talking about with agents, how many CPUs do I need? Where does storage fit into that? Like, how do you recommend or kind of suggest that these enterprises look at including storage in that decision?

Avi Shetty:

The difference between training and inference. Training happens at these megawatt data centers, big foundational companies, but inference can happen anywhere, right? It can happen at your back office, can happen at your small data center, which is in your basement, or at a local colo location where you set up. So it depends upon what individual enterprise's usage is. NVIDIA has done the blueprint, like this year, 2026, is the year of storage, right? The start of the year, Jensen talked about at CES and then followed it up at GTC about storage. This whole KV tier, his quote, which I think resonates with all our storage vendors, as well as us in Solidime, he said, the context memory tier will use up the entire TAM. of storage going forward. So that's the amount of context memory scale which you'll see with inference over the next few years. And for enterprises who are determining what's the best way to use it, I think the question they need to ask themselves is, how much latency can I afford? how much GPU recompute time can afford. If those are very critical parameters, you need to ensure you have the right storage architecture at the three levels, G3, which is your direct attach, G3.5, which is your context memory, and then G4, which is your shared storage.

Ryan Shrout:

When you look out those next couple of three years, and it's hard to do in this space and really kind of have any accurate predictions, but I'm curious, what role does storage play two years from now or three years from now? Is it a capacity game? Is it a performance game? Is it just, you know, how does it change?

Avi Shetty:

I think it's all of the above, right? There is no one size fits all in storage. I think, you know, storage is like the whole memory hierarchy or the memory wall is a function of economics, right? Ideally, if you ask any You know, GPU architect or CPU architect, what do you want? They'll say, hey, give me one petabyte of persistent storage and it's non-volatile and SRAM-like latency, but that's not feasible. That's why you have tiers. You have your SRAM, your caches, you have your HBM, your DRAM, your tiered storage. tiered storage, G3, G3.5, and G4. And every one of those will see innovations where the focus on G3 is to ensure the whole purpose of that whole section is to ensure that your GPU is fed at high speeds and that results in GPU utilization being high. Your G3.5 is to ensure that you don't have to recompute your context every time. So there you need a function of performance as well as density. And that's a function of how big your workloads are and the context is. And when it comes to shared storage, I think we're now in a world where enough data points have been shown where no more hard drives. I think it's purely a math of whether you want nine racks of hard drives, or do you want one rack with high-density QLC storage, which I've got it for you here. This is our 122 terabyte solution in one U.2 form factor. We were the first ones to introduce this back in Q4. That's a lot of storage, but not large. lot enough, much enough for when you look at data center efficiencies, you put 24 of these in one, you now have four petabytes and low power and scalable for the end customers.

Ryan Shrout:

It's amazing to see how this revolution has kind of changed what you and I came from in the client space where we would look at drives like this all the time and kind of how they're being repurposed and where the bottlenecks are really lying. Mavi, thanks for joining me. Really appreciate the conversation. Thank you, everybody, for tuning in to Six Five on the Road at Computex 2026 in Taipei. I'm Ryan Shrout. And make sure you follow us on social media and find all of our other content at sixfivemedia.com.

CYBERSECURITY

Threat Intelligence: Insights on Cybersecurity from Secureworks

Alex Rose from Secureworks joins Shira Rubinoff on the Cybersphere to share his insights on the critical role of threat intelligence in modern cybersecurity efforts, underscoring the importance of proactive, intelligence-driven defense mechanisms.

HP Launches World’s First Business PCs to Protect Against Quantum Hacks - The Six Five On the Road

On this episode of the Six Five - On the Road, hosts Patrick Moorhead and Daniel Newman are joined by HP's Ian Pratt, Global Head of Security for Personal Systems.

What is Autonomous Endpoint Management?

Autonomous Endpoint Management is a framework designed to unify IT operations and security teams on a single platform through real-time control and visibility.

QUANTUM

Quantum in Action: Insights and Applications with Matt Kinsella

Quantum is no longer a technology of the future; the quantum opportunity is here now. During this keynote conversation, Infleqtion CEO, Matt Kinsella will explore the latest quantum developments and how organizations can best leverage quantum to their advantage.

Accelerating Breakthrough Quantum Applications with Neutral Atoms

Our planet needs major breakthroughs for a more sustainable future and quantum computing promises to provide a path to new solutions in a variety of industry segments. This talk will explore what it takes for quantum computers to be able to solve these significant computational challenges, and will show that the timeline to addressing valuable applications may be sooner than previously thought.

Storage Is the New Foundation of AI Inference - Six Five On the Road

MORE VIDEOS

Adobe's Vision for Redefining Creative Workflows in the AI Era

From Pilot to Production: How Lenovo XIQ Is Bringing Agentic AI to Retail at Scale

Zero Trust or No Trust: Why Enterprise Security Leaders Need to Rethink Mainframe Risk

CYBERSECURITY

Threat Intelligence: Insights on Cybersecurity from Secureworks

HP Launches World’s First Business PCs to Protect Against Quantum Hacks - The Six Five On the Road

What is Autonomous Endpoint Management?

QUANTUM

Quantum in Action: Insights and Applications with Matt Kinsella

Accelerating Breakthrough Quantum Applications with Neutral Atoms