AI Inferencing at Scale: What Enterprises Need to Know

Home

AI Inferencing at Scale: What Enterprises Need to Know – Six Five On The Road

Arun Narayanan, SVP, Compute & Networking at Dell, joins David Nicholson to discuss enterprise AI inferencing, tackling scaling challenges, flexible infrastructure design, and key real-world deployment insights across industries.

‍

How are enterprises navigating the rapid adoption of AI inferencing while balancing flexibility, performance, and operational efficiency?

‍

From SC25, host David Nicholson, Global Technology Advisor at The Futurum Group, is joined by Dell Technologies’ Arun Narayanan, Senior Vice President, Compute & Networking, for a conversation on AI inferencing at scale. They explore how inferencing is becoming more mainstream, transitioning from experimental projects to core business operations. Find out what’s driving the next phase of enterprise demand, and strategies for future-ready infrastructure that avoids vendor lock-in.

‍

Key Takeaways Include:

‍

🔹Surge in AI Inferencing Demand: Enterprise inferencing workloads are rapidly expanding, driven by tangible use cases across manufacturing, healthcare, finance, and retail.

‍

🔹Operational & Architectural Challenges: Scaling inferencing introduces new pressures around infrastructure flexibility, efficiency, and supporting a mix of accelerators without single-vendor constraints.

‍

🔹Real-World Deployment Lessons: Success in production settings hinges on anticipating practical design factors, managing data gravity, and aligning IT strategies for cost-efficient, sustainable growth.

‍

Learn more at Dell Technologies.

‍

Watch the full video at sixfivemedia.com, and be sure to subscribe to our YouTube channel, so you never miss an episode.

‍

Or listen to the audio here:

‍Disclaimer: Six Five On The Road is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded, and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors, and we ask that you do not treat us as such.

Transcript

David Nicholson: Welcome to Six Five On The Road, coming to you from SC25, that is the Supercomputing Conference, here this year in St. Louis, Missouri. I'm Dave Nicholson for Six Five Media and The Futurum Group, and I'm here with a very special guest, distinguished guest. Arun, tell us about what you do at Dell Technologies.

Arun Narayanan: Dave, firstly, very thankful to having me here. I'm excited to be here. I appreciate the opportunity. At Dell, I lead all our computing, AI, and networking products. So think about everything from designing the product, getting it on the roadmap, working with all our key partners, getting the products out, and driving the go-to-market on those products. Absolutely what I do at Dell.

David Nicholson: So you've had a front seat to watching how the workloads that your customers and partners need to deliver upon have changed over time. What are you seeing now?

Arun Narayanan: Yeah, great question, right? I mean, let's click back to the beginning of this AI moment, right? I mean, go back to ChatGPT in 2023. And if you look at the time frame in the last three years, a lot has changed, right? I mean, you start with the first point: ChatGPT was a moment in time. And then immediately after that, what we saw is a lot of CSPs or new clouds trying to build these massive training clusters, right? That was the first generation of what we saw. You look at inferencing itself at that point in time, it was all single shot inference models, right? I mean, you look at it, you submitted a token, and you got a token back, right? That was the model. And everybody was using a chatbot interface. That was the world we lived in. Click back three years now, and everything has changed, right? So the massive neoclouds are still doing those large training models. You've gone from like 50 billion, 100 billion, 200 billion to trillion parameter models. But also at the same time, inference is scaled up, right? I mean, you've gone from what you call single-shot models to tree-of-thought models, to mixture-of-expert models, to chain-of-thought models. So you're seeing a lot of evolution in models itself. Combine that with edge intake, and what we're beginning to see is a pretty big explosion in what inference should be and what it can potentially be, right? So that's where we are. And we are the precipice of the next big wave, which I think is the inference wave.

David Nicholson: So as we move into this era of inferencing, As you said, obviously training is still important at the macro scale, but businesses are going to be getting value out of inferencing. Where do you see that inferencing happening physically?

Arun Narayanan: Great question. First, let me step back and say, where we are from an inference lifecycle standpoint, right? I mean, when I think about it, business is getting value. We have to really step back and think about what processes are a business going to change, right? Think of Simplify, Standardize, Automate. That's where inferencing needs to start. We have to understand what processes are and how a process can either be improved with AI to reduce cost or to improve speed to generate revenue. That's the starting point. When I look at that, it really depends on the industry vertical. If you're in a hospital setting, you're going to do inference in a clinical place. It's an edge inference use case. If you're on a factory floor and you're doing detection of defective parts, you need to do that inference at the edge of the factory floor. Then if you're in a financial center doing quantitative trading, you're probably doing it in a high-speed data center. And then there's a last piece of it. You can do inference on the device itself, right? I mean, think of where it's going. You're going to have cars and robots doing inference at the device itself. So you're going to see inference come across everywhere from the core of the data center to near the edge to at the device itself. And we're seeing all of those options coming out today.

David Nicholson: Yeah, I do a fair amount of playing around with inferencing at the edge, and I'm running a 4 billion parameter model on my mobile device, a 235 billion parameter model on a desktop device. And it begs the question, as we see this massive data center build out, When you hear about 10 gigawatt data centers, are they primarily talking about training what they hope to be artificial general intelligence frontier models? Or are they thinking that that's where a lot of inferencing will be happening also?

Arun Narayanan: I think it's a combination, right? I mean, but if you think of those big gigawatt data centers, the first use case is training foundational models, right? I mean, like 1 trillion, 10 trillion, 100 trillion, whatever the primary count is, that's definitely the first use case for those models. But I also see that over time, you're going to find that there are these multi-tenant inferencing solutions where latency is not the most critical element for some of these workloads. And you can get multiple users doing it at the same time. Then from a dollar per token per watt, probably doing it in a gigafactory is probably a very, very efficient place to do it. So it depends on the use case. It depends on what you're doing. Maybe if you're doing a content generation use case or a code generation use case, you can do it in a gigafactory. I mean, it's not latency sensitive. But if you're operating on somebody and you need a robot to do that, that's extremely latency sensitive. So you're not going to do that there. You're going to do that at the edge. So it depends on the use case.

David Nicholson: Clearly, as you're saying, inferencing is directionally where a lot of mindshare is and should be going. Back to training, what are you seeing? It wasn't that long ago that we might have a conversation about the fine-tuning or the training of a model. In other words, I'm a private business. I have my own data. I'm going to train the model on my data. It feels like we're moving a little bit away from that in a lot of cases in the direction of more variations of RAG. Yes. But what are your thoughts there? Would you agree that maybe not as much, I'm going to train my own model as much as maybe we expected?

Arun Narayanan: I think you're 100% right. I don't think most enterprises, like I never say any for anything, but 95% of the enterprises have no reason to train a model, right? I mean, they're going to use either an open source model that's already available or a closed source model that they can leverage. There is definitely going to be, depending on the use case, there might be a lot of drag variants that you're using, or there might be some people who are doing some fine tuning to that model. But ultimately, I think most enterprises are going to use a standard off-the-shelf foundational model. use RAG on it to get most of the use cases solved. So I don't see a scenario where a standard enterprise is going to have to train a model. So it's very rare.

David Nicholson: And you've made the point. There are no absolutes, and it varies. But because of your position, you're able to see from a high perch Varying industry verticals, healthcare manufacturing, et cetera, et cetera, are there some common themes that you can recommend to folks for some best practices for setting themselves up for success for supporting inferencing? Or is it too much case by case basis?

Arun Narayanan: I don't think so. I'll break it out. The way I think of how you should, if you're an enterprise, it doesn't matter whether you're an industry, healthcare vertical, or a manufacturing vertical, the way to think about building an inference AI stack, right? I mean, let's think about it like that. The first and most important thing is, what is the business process you are trying to transform, right? I mean, let's understand that. And let's understand what value that is going to do, right? Are you going to be able to change, one classic example I've seen is in a clinical setting, right? Where doctor-patient interaction, it takes about 30 minutes to do a doctor-patient interaction. Can you save 10 minutes of that? Saving that 10 minutes is a 30% productivity saving. It's time saving for the doctor, or you can see more patients in the day. There's a revenue outcome or a cost savings outcome. So you need to understand that. Once you understand that business outcome, then the real question is, OK, now I'm focused on saving 10 minutes of this patient interaction. What do I need to do? What is the solution? Am I looking for prescription automation? Am I looking for transcription automation? So let's understand what it is. Then based on that, let's build out, do you have the data to actually do that? Where is all the data? Have you had all the history of prescriptions that the doctors issued? Have you looked at all the data that's available from notes you've written back? Can you take a foundational model and train it with that data? That's the next question to ask. Then, OK, are you going to do it? What is the software solution stack you need? What ISV is going to help you do this the best? What models are you going to use? What is the software stack? Once you kind of figure out all of these processes, then you end up deciding what is the infrastructure you need. So this is the process you need to take. It doesn't matter which industry vertical you're in. I gave you a clinical setting. You could apply the same to the manufacturing setting. You can apply to the financial setting. So there's a standard process. Let's apply it repeatedly. Of course, the use cases are different and the data is different, but the process should be the same.

David Nicholson: Yeah. One relatively standard thing, almost a truism, is that data tends to be at the center of these things. We like to talk about data having gravity. What are your views on the platforms where data will be residing in the future as people are trying to support inferencing workloads, as an example?

Arun Narayanan: Yeah, I mean, great question, right? I mean, I really think your point about data as gravity, it cannot be understated, right? The most important point. So that's why, firmly, I think at Dell, we believe that inference is an on-prem or on-edge, more than, of course, a data center or a cloud-based solution. That's why I told you that there's a spectrum that is there. So think about that, right? I mean, so in order to do the data, you are going to need a data platform that brings all these disparate data. You have data in your databases. You have unstructured data sitting in your files. You have PowerPoint files. You have video files. There needs to be a data platform that can bring all the data together, organize it, classify it, and then give it to the model so that you can actually get good insights out of it. At Dell, we built this AI data platform, which brings in a combination of things. It brings in a search agent, it brings a RAG agent, it brings a data analytics agent, all sitting on power scale and object scale in order to be able to do that effectively and feed the GPUs to get the best outcome from an entrance standpoint.

David Nicholson: A lot of us have talked about this idea of, as we've gone through the dawn of the cloud era into the maturity of the cloud era, from a time when people would see something like an AWS or an Azure as a science experiment and not enterprise ready, to a point in time where people are pretty comfortable with a lot of enterprise workloads there. We use the term repatriation for a long time. Sometimes it felt like, when I was using the term, it was aspirational. I was being hopeful. I was trying to convince myself that, in fact, you know what? On-premises data centers, there's still a place for this stuff that I have been involved with for so many decades. Are you actually seeing that now, though, because of this idea of data gravity and then the idea of sovereignty in terms of wanting to maintain control? Are you seeing people actually pulling things out of hyperscalers into an environment where you might be leveraging a data platform like you're talking about?

Arun Narayanan: I mean, we've seen three use cases where this potential repatriation is happening. The first one is, as you said, sovereign use cases, where the data is extremely regulated. You don't want it to be in the cloud. It's never in the cloud in the first place. So those places, absolutely. You're seeing that. The second potential use case is where high confidentiality requirements or high security requirements are on the data itself. We're seeing those kinds of things repeated. And the third, we don't talk about it as much, but you're going to see this emerging, is the cost of dollars per token. The cloud cost per token is incredibly expensive. Build a platform, and you're going to see 10x to 20x savings on a price per token, right? So doing it on prem, especially when you think about it.

David Nicholson: Well, why is that? Is that an arbitrary thing? Or is it practical? Are there practical costs involved?

Arun Narayanan: It's very practical. I mean, simple thing, right? I mean, cost of buying a GPU-based server is a lot less than what and monetizing it over a three-year period to kind of run a workload is a lot less than being at a price per token to the cloud, right? That's kind of simple economics.

David Nicholson: Yeah.

Arun Narayanan: Layer that on with the fact that you can customize your needs. You can buy one PCIe GPU platform. If you have a small workload, you can start at $50,000. You don't need to pay $500,000. You don't need to sign a multi-year contract on a price per token. So cost per token is a big thing. The second is one you already talked about. Data security and data privacy is very important. And doing it on prem means your data is protected. It's in your own walled garden. It's not in the cloud. And then finally, and probably most importantly, how quickly can you do the inference? And if it's latency sensitive, you want it all next to each other.

David Nicholson: Yeah. Continuing with this idea of practical design factors, I teach an AI course for senior executives, and one of the big concerns is the question around the life cycle and how to manage that life cycle for new generation GPUs and things that are coming out. You have CIOs who, for decades, have understood how often they might be refreshing CPU-based servers. It's like, hey, every 18 to 36 months we're coming in and maybe we're going from Gen 12 Dell servers to Gen 14 Dell servers and everything is good. This rhythm was developed. Do you have a view on what we should expect moving forward from the latest generation architectures? Of course, we've got AMD, Intel, NVIDIA, all of them together. Will people be able to get the same kind of time horizon of value out of these things?

Arun Narayanan: What do you think? I mean, I strongly believe that, again, think of what is a problem you're trying to solve, an enterprise inference use case. In those cases, the lifecycle of your GPU is three to four years, right? Maybe even five years, right? I mean, the technology is evolving very fast, and of course, NVIDIA is launching new things every year. But many of those things are for the top 1%. I mean, the people who are building the trillion parameter models. Of course, you get to the most efficient dollar per token if you use that. But because you invest in an asset, and as long as it delivers the SLA you want in terms of how much tokens do you want and how fast those tokens are coming along, you could run those. I mean, I'm sure people can make inferences with the Hopper generation, today we are in Blackwell, and they can run it for the next two years and meet almost all their business needs doing it, right? So I don't think there's a notion in the industry that you need to refresh your GPUs every year. Absolutely not the case for enterprise inference use cases.

David Nicholson: Yeah, these folks have gone from fear of missing out to fear of screwing up.

Arun Narayanan: Yeah.

David Nicholson: And there's a little bit. And so one of the reasons why people might turn to a cloud service provider is for fear that their investment is only going to be worth it for a year or two.

Arun Narayanan: I think we as a community of technology professionals need to do a better job of educating customers to say that What is the key measure? Your measure is how many concurrent users do you have? What is the token rate you want? And what is that latency of tokens? As long as you can achieve that, you should be able to continue to use the same GPU for a long time. And yes, your needs might grow, but then you add incrementally. You don't need to subtract what you have and add new stuff. You can just add incrementally. And the technologies we have with PCIe-based platforms exactly allow you to do that. You start with one GPU. So if your needs grow, you add another GPU to that platform. You don't need to go buy the next biggest server out there. So there are capabilities to incrementally step into this.

David Nicholson: Yeah. Yeah. Well, so Dell has a significant presence here at SC25. What are some of the highlights from your perspective of what Dell is offering? And then also, have you bumped into anything that you found personally fascinating or interesting or any trends that you're picking up on here?

Arun Narayanan: A great question. I'll let number one first and then we'll come to number two. So number one, we are extremely proud of the portfolio we have, right? I mean, if you walk the floor, you can see everything from our PCs to our biggest rack scale systems. I don't know if you've seen our racks on the floor. It's a thing of beauty, right? I mean, of course, it's my product. And it's a thing of beauty. So that is our 50U rack that we're developing for some of the biggest HPC customers. That has the highest GPU density anywhere in the world. So we can do 144 GPUs. It's incredibly dense. So extremely proud of that. Fitting it into that form factor, getting the thermal envelope, getting the liquid cooling on it. incredibly incredible. A lot of engineering went into doing that, right? But not only that, right? You think of that, that's like the marquee platform. And then we have, for every customer and every customer use case, we build platforms, right? You think about our eight-way GPU platforms. If you look at the next generation of eight-ways, that's on the floor there. That's built on a not on a monolithic design, but it's built on the same architecture of rack scale, disaggregated power. I think we are the only people in the industry doing that on that class of platform. So that's a differentiated outcome. Then I talked to you about our PCIe GPU platform. This is where we think the bulk of our enterprise are going to be there. We have those platforms. So we've built the entire spectrum from the smallest to the largest. In addition to that, we've done the entire networking portfolio. Think of AI. AI is a system, and that needs not just a GPU service, but it needs networking and storage. We've built our networking portfolio with NVIDIA. If customers want InfiniBand, we've enabled InfiniBand. If customers want Spectrum or Tomahawk, Broadcom on Ethernet and Rocky, we've done both of those things. We've also enabled both on the same OS platform, with Sonic now covering both the Spectrum OS, Spectrum platforms, as well as covering the Tomahawk platform. So we are trying to get everybody to think of Sonic as the Linux of the networking world. So we've done that. That's an innovation that we are leading the market with. And then finally, if you think of our storage, we just talked about the AI data platform. Enabling this entire ecosystem with data is the storage data platform. So I'm very, very proud of the breadth of our portfolio that we've enabled, and it's all on the floor there for you guys to see. So that's great.

David Nicholson: It is. And any curiosities?

Arun Narayanan: Yeah, curiosities. Across from the NVIDIA booth is Vertiv, one of our very key partners who help us with some of the liquid cooling technologies. I looked at their Behemoth next generation CDU, right? I mean, like, it's massive. And that's what is going to be needed when you get to the next generation of these NVL576, right? I mean, what NVIDIA is launching. And you look at that and what the kind of CDU capacities that they need. That's very, very interesting to me. I think the next big frontier of massive gigascale data center innovation is in how you scale liquid cooling to support one megawatt racks and data center power. I mean, what is the new technology for power? Because if you want a megawatt rack, you're going to need a new data center. And the power technology in the data center is another very exciting thing. And I see a lot of people on the floor with some of the cool data center power technology. So those are the things that I'm really interested in. and as we see the future of how this evolves.

David Nicholson: Someone was explaining to me the amount of liquid flow necessary to cool the theoretical one megawatt rack as of right now with zero improvements, and it's pretty crazy. It's pretty crazy. You're talking about fire hoses of liquid pumping through. But it's an exciting time. Arun, you've got a front seat to it. I just want to go back to when I asked you to introduce yourself. It's an amazing position to be in because you're talking about the real core of the infrastructure that people are going to be deploying to do AI moving forward. The fact that you have these offerings where you can really be the Switzerland of AI, it's been fascinating for third parties to track. and keep up with. Any final thoughts on the show so far?

Arun Narayanan: You've been keeping busy? No, the show's incredible. I mean, every year it gets bigger. I went last year at London. I didn't think it would get bigger. It's bigger here, right? The floor's bigger. The boots are bigger. Everybody's doing bigger and better things. So I think we're in an incredible time in technology. And I'm just grateful to be here and be part of this, right? I mean, that's all I can say. And it's going to be a fantastic ride.

David Nicholson: Well, we're grateful to get a chance to spend time with you, Arun. for Six Five Media on the Road. I'm Dave Nicholson from SC25. That was a conversation with Dell about all things inference. Thanks for joining us. Stay tuned for more exciting content.

‍

CYBERSECURITY

Threat Intelligence: Insights on Cybersecurity from Secureworks

Alex Rose from Secureworks joins Shira Rubinoff on the Cybersphere to share his insights on the critical role of threat intelligence in modern cybersecurity efforts, underscoring the importance of proactive, intelligence-driven defense mechanisms.

HP Launches World’s First Business PCs to Protect Against Quantum Hacks - The Six Five On the Road

On this episode of the Six Five - On the Road, hosts Patrick Moorhead and Daniel Newman are joined by HP's Ian Pratt, Global Head of Security for Personal Systems.

What is Autonomous Endpoint Management?

Autonomous Endpoint Management is a framework designed to unify IT operations and security teams on a single platform through real-time control and visibility.

QUANTUM

Quantum in Action: Insights and Applications with Matt Kinsella

Quantum is no longer a technology of the future; the quantum opportunity is here now. During this keynote conversation, Infleqtion CEO, Matt Kinsella will explore the latest quantum developments and how organizations can best leverage quantum to their advantage.

Accelerating Breakthrough Quantum Applications with Neutral Atoms

Our planet needs major breakthroughs for a more sustainable future and quantum computing promises to provide a path to new solutions in a variety of industry segments. This talk will explore what it takes for quantum computers to be able to solve these significant computational challenges, and will show that the timeline to addressing valuable applications may be sooner than previously thought.

AI Inferencing at Scale: What Enterprises Need to Know – Six Five On The Road

MORE VIDEOS

AI Cluster Power: Liquid Cooling and Exascale-Ready Solutions from MiTAC at SC25 - Six Five In The Booth

Liquid Cooling at the Forefront: Redefining Performance Standards – Six Five On The Road