Home

Agentic Infrastructure at Scale: Inside Google Cloud’s AI Hypercomputer and TPU-8 Infrastructure

Agentic Infrastructure at Scale: Inside Google Cloud’s AI Hypercomputer and TPU-8 Infrastructure

Mark Lohmeyer, VP/GM of AI and Computing Infrastructure at Google Cloud, joins Patrick Moorhead live at Google Cloud Next 2026 to examine the infrastructure architecture behind the agentic era. The conversation covers the TPU-8T and TPU-8I split, the Virgo accelerator network, Managed Lustre storage performance, NVIDIA Vera Rubin integration, and the evolution of GKE into an agent-native orchestration platform built for bursty, high-parallelism workloads.

The shift from chat-based AI to agentic workflows isn’t just a software challenge. It’s about compute, storage, and orchestration, and the companies that recognize that early will move through AI iteration cycles much faster than those that don’t.

At Google Cloud Next 2026 in Las Vegas, Patrick Moorhead, sits down with Mark Lohmeyer, VP/GM of AI and Computing Infrastructure at Google Cloud, to examine how Google is engineering its AI infrastructure stack to meet the specific demands of agentic workloads. The conversation moves from the strategic framing of the AI Hypercomputer architecture through the technical specifics of TPU-8T and TPU-8I, the Virgo accelerator network, Managed Lustre storage, and the evolution of GKE from cloud-native to agent-native orchestration.

The throughline is co-design: every layer of the stack, compute, networking, and storage, built to work as a balanced system rather than a collection of components optimized in isolation.

Key Takeaways:

  • Agentic workloads generate 20-50 times more tokens than chat-based interactions. This fundamentally changes the infrastructure requirements for latency, cost per token, and parallelism at scale.
  • TPU-8T is built for large-scale training. It can link up to 9,600 chips in a single super pod with two petabytes of memory, and scale to 134,000 chips across a full data center using the Virgo network, cutting model training cycles from months to weeks.
  • TPU-8I is architected specifically for inference economics. It can triple on-chip SRAM and increase HBM by 50% to serve the KV cache requirements of low-latency, high-volume token generation at the lowest cost per transaction.
  • Google's NVIDIA partnership extends beyond chip access. The Virgo network scales NVIDIA Vera Rubin deployments to 80,000 chips within a single data center and up to 960,000 chips across multiple data centers, giving customers a unified high-performance fabric across both TPU and GPU paths.
  • Managed Lustre 10T delivers a big jump in training storage performance. It’s about 10 times faster than Google's prior offering and up to 20x faster than alternatives. It’s powered by the C4NX platform with eight SmartNICs and HyperDisk ExaPools that can scale up to 80 petabytes.
  • GKE is evolving from cloud-native to agent-native orchestration. Improvements across the full container stack make it easier to spin up and tear down tens of thousands of agents rapidly in response to bursty, event-driven demand patterns.

Enterprises weighing AI infrastructure decisions should recognize that the training-to-inference ratio is shifting exponentially toward inference. The architectural choices made now around specialization, storage throughput, and orchestration agility will determine how far and how fast agentic applications can scale.

Watch now and subscribe to Six Five Media’s Youtube for analyst-led coverage from Google Cloud Next 2026.

Disclaimer:Six Five Media is a media and analyst firm. All statements, views, and opinions expressed in this program are those of the hosts and guests and do not represent the views of any companies discussed. This content is for informational purposes only and should not be construed as investment advice.

Transcript

MARK LOHMEYER:
If you work back from what the customer cares about, they care about accelerating innovation. And what does that mean? It means shrinking the time to train a model from maybe months to weeks, or shrinking the time to do an experiment from weeks to days. And so that's all about sort of the highest level of scale, the highest level of computing 

performance to really shrink those innovation cycles.

PATRICK MOORHEAD: 

The Six Five is On The Road here at Google Cloud Next 2026 in my second home, Las Vegas, Nevada. It has been a great event so far. Literally everything from infrastructure to applications, agents, and everything in between. And one of the biggest things that Google has been talking about is full stack, literally providing all of it to their customers. Thomas Kurian was talking with us today and saying he's the only company out there that does that. And factually, that is correct. As you know, from the Six Five, we love infrastructure. And there's been a lot of discussions on the stack inside of infrastructure, the hypercomputer. So it's my pleasure to bring Mark Lohmeyer back in. He's been on the Six Five before to the show. So, Mark, welcome.

MARK LOHMEYER: 

Thanks, Patrick. It's great to be here.

PATRICK MOORHEAD: 

Yeah. Two years ago, not a whole lot of people were talking about agents. And I know there's a lot of research being done. I know that you were setting the table. I met with your CTO even saying, hey, this is when agents were on our roadmap. But how has this agentic era changed the way that you're approaching infrastructure, but more meaningfully, how your customers are looking at addressing it?

MARK LOHMEYER: 

Yeah, absolutely. I mean, I think it's a really exciting time, I would say, in infrastructure. And in many ways, the requirements of agents are really driving that and driving how the infrastructure needs to transform to meet those requirements. One way I sort of like to contextualize it is, you know, we're sort of moving from this chat phase, let's call it, where you ask a question, you get a response. It's kind of a linear back and forth to this agentic phase where you express your intent And then that intent goes to an agent. That agent might spin off multiple sub-agents. Each of those individual agents are preserving state as they're working through their part of it. Maybe they're calling other tools in that process, ultimately coming back to deliver on whatever your original intent was. And that's a radically different thing, right?

PATRICK MOORHEAD: 

It is. It's a multiplier. And even, you know, I'm not a professional coder. I am one of these vibe coders. And I love to see spawning of so many agents doing so many different things at the same time in parallel.

MARK LOHMEYER: 

Right. And so if you think about that, what that means for the infrastructure now, right. Before where you might have had to generate a thousand tokens to respond to a chat request. Now you're generating maybe 20 to 50 times more tokens across those multiple agents. And so that's something we've been spending a lot of time thinking about. How do we deliver the infrastructure that allows for those experiences to be super interactive, low latency, but also at a very low cost. And that's really the foundation of sort of what we're investing in across our AI hypercomputer platform.

PATRICK MOORHEAD: 

It is impressive. And what a lot of people do forget sometimes is that Google, I used to be a supplier to Google. And so I was there with the journey to build up what you're calling planet scale capabilities today. But part of that was TPUs. 10 years ago, I was at your Google I.O. event when the TPU was first I would say surface, because you had actually had it in production inside of Google for many, many months. But not a lot of technical details. It's very stealthy. But I couldn't help myself thinking, what is this going to mean to the industry in the future? And wrote some things about it. You brought out not just one new TPU, but two TPUs, one for training and one for inference. Talk about how these uniquely help solve business challenges. And I know there's a lot of real estate between the chip and solving business challenges. But that is what, quite frankly, your Google Cloud customers care about.

MARK LOHMEYER: 

Absolutely. Absolutely. So I think, as you said, this is the first time with TPU-8 that we've actually come out with two specialized architectures. TPU-8T, designed for training, and then TPU-8I, designed for inference. And it's really based on that fundamental premise that you were articulating that there are now significantly different requirements for those two different parts of the AI lifecycle. So TPU-8T, if you work back from what the customer cares about. they care about accelerating innovation. And what does that mean? It means shrinking the time to train a model from maybe months to weeks, or shrinking the time to do an experiment from weeks to days. And so that's all about sort of the highest level of scale, the highest level of computing performance, to really shrink those innovation cycles. And so TPU-8T was designed exactly for that. Think about very high performance chips, We can network them together in a single super pod with 9,600 TPU chips in a single super pod. That super pod has two petabytes of memory in it, connected over interchip interconnect networking technology. And then we can even scale further from there with our Virgo backend network designed specifically for accelerators. We can scale to 134,000 TPU chips, literally filling an entire data center. Ultimately, this is about sort of that scale of compute performance. You need to shrink the time to train models and iterate faster. That's super powerful for Google, of course, as we train the next generation of Gemini. But it's also really critical for our customers because it's how they can innovate faster and compete better in the market. But then on the other side, you've got inference. And as we were talking about before, agents put a tremendous amount of pressure on token generation and what's required for inference. And so, TPU-8i was designed specifically for those requirements. And if you think about what customers care about there, they care about the lowest level of latency, so super high performance, quick response, but also at the lowest cost per token. Because if you think about the tokens increasing 50x, you need to drive down costs at the same rate or more to meet the needs of the business. And so maybe we can go a little bit deeper, but TPU-8i, was designed specifically for the lowest levels of latency, maximum performance, but at lowest cost per transaction.

PATRICK MOORHEAD: 

Yeah, I am fascinated. We've gone through these cycles that, hey, one chip can do both. And then I've seen people split off into two. And then we went back into talking about how one was the right way to go. And here we are. And I always just say, first of all, it is going to change based on where you are in the technology and what you're trying to accomplish. And it's clear that a two-phase architecture, I mean, listen, getting more granular is always better if you can afford to do it. And the complexity outweighs the costs and effort and investment to do that. So I'm really excited to see where this goes. The other thing that I think is catching people by surprise, I don't think it should, is I remember back, remember the old days of machine learning? Eight years ago, we were talking about it, and we went from 90-10 training inference to 10-90. Do you have any thoughts on kind of where are we on this map? Meaning, what does the equation look like in, I don't know, six months or a year?

MARK LOHMEYER: 

Yeah, I would say, I mean, we're still in the early days of what we're going to see on the inference and reinforcement learning side. I mean, if you think about how AI is getting embedded in every application. They're all becoming agentic to a certain degree. The number of users that are leveraging these services increasing rapidly. More use cases, more applications, more agents, more usage. That's going to continue to drive exponential growth in the requirements for inference and for reinforcement learning. So while training obviously is not going away, it's going to continue to grow. Where the exponential growth curve is going to be in my mind, many, many years to come, is on the inference side. And that's why it's so important that we have these architectures that are specialized specifically for that. We've actually hit the threshold where it makes sense to have that degree of specialization. You think about things like on the TPU8i chip itself, tripling the amount of SRAM, increasing the HBM by 50%, that is really, really critical for low latency inference because that's where you store the KB cache, which is for the context of the model. It only makes sense to do that if you believe in this exponential growth curve, which we obviously do.

PATRICK MOORHEAD: 

Yeah, it's just, gosh, I remember 10 years ago when people were talking about the commoditization of infrastructure and just where we are today is so exciting and I'm sure it is for you too. So, Google infrastructure has its feet in two different fundamental camps. very dialed in, proprietary. If I look at a TPU rack system, I'll call it a fleet, a pod, multiple pods. It's very much bespoke. And I'm sure the trade-offs, let's leverage as much industry standard as we can, but it's very dialed in. And then the other side, you also work very closely with partners like NVIDIA. And, you know, I saw an announcement with Intel as well. So how does the strategy benefit customers who want to avoid, you know, quote unquote lock-in?

MARK LOHMEYER: 

Absolutely. So I think, you know, fundamental to our strategy is this idea of, as you said, and Thomas highlighted, being open at every level of the stack. And that's important to customers because it enables choice for them. So they may have different applications or different models or even different teams within a large company that are optimizing for different things. And so we want Google Cloud to be the best platform for them across all those different scenarios. NVIDIA is a great example of that. We've got a super deep partnership with them across many levels, certainly at the engineering side, very, very deep collaboration. And so we're able to enable the best of NVIDIA within Google Cloud. And we try to even go, I would say, one step beyond in terms of how we work with them. For example, we're super excited. We think we'll be one of the first hyperscalers to deliver a Vera Rubin and MuleLink 72 as a cloud service within Google Cloud. But we're not just stopping there. For example, we're leveraging that Virgo network that I mentioned before, this accelerator-optimized network. where you can take multiple of these NVLink 72 racks, you can connect them together to scale to 80,000 Vero Rubin chips, fully filling a Google data center with Vero Rubin chips. By the way, that network is non-blocking, collapsed fabric, like super high performance in a single data center. We've also designed it so we can scale across multiple Google data centers, and the architecture actually supports up to 960,000 Vero Rubin chips, Massive, if you think about it.

PATRICK MOORHEAD: 

And this is across Virgo?

MARK LOHMEYER: 

Yeah, this is across Virgo, yeah. And in this case, it would be across multiple data centers to fit that many chips in and empower that many chips. Ultimately, a cloud customer comes to us, they're purchasing the A5X product, but it's powered by that Vera Rubin amazing platform, plus our networking technologies, plus many other things we do together. Ultimately, that customer choice is incredibly valuable. You heard, you know what I mean, talked about that on the main stage. massive round of applause because we have so many customers that have optimized for NVIDIA, and we want to be the best place for those workloads and for those customers as well.

PATRICK MOORHEAD: 

A lot of the clients that I talk to, they need direction on what to use when, and what are some of the tools that you've provided to know, so I don't have to run a complete workload, test this out, and test this out over here?

MARK LOHMEYER: 

Yeah, for sure. So I'd say there's two parts of what we're looking to do here for customers. The first is through software on top of these specific hardware platforms. We want to make it feel operationally the same and make it easy to leverage them together or move between one or the other so they have that level of flexibility. So investing in things like PyTorch, native PyTorch on TPUs, VLLM on TPUs. working in the same way across TPUs and GPUs. So that's maybe the first layer of it. The second is to your specific question. We want to make it easy for customers to benchmark and understand for their specific model, whether it's a Google model or an open source model or a third party model. How does that model perform? on these different options. And so we just recently introduced something called Prism. It's based on the LLMD project. So it's an open community based effort, not Google only, where you can come to this website, you can see how we've benchmarked different workloads. And then probably most importantly, you don't have to just trust us, you can actually deploy and validate those benchmarks in the customer's own environment with the specifics of their workloads and the platforms they're running on and how it connects to the rest of their infrastructure. So this is super important because that's where the rubber really meets the road. Like, what is it actually going to deliver in the customer's environment?

PATRICK MOORHEAD: 

I want to do a little bit of a deep dive. I know we're kind of going high level, a drill down, but let's talk about data, right? Because you're not just changing, you know, the compute's great, right? We talked about Virgo and networking, but the data side, you have to have all three of these working. And that's an issue for AI training. How are you helping to solve some of those bottlenecks for AI training?

MARK LOHMEYER: 

I'll start at the high level and then we can dive in. I think this is a 

great example of the AI hypercomputer architecture and the importance of co-design to deliver a sort of a balanced system, let's say, that works across all the different elements. And at that lowest level of the stack, we call it purpose-built hardware. And purpose-built hardware includes compute, storage, and networking. So we talked a lot about compute. We talked a little bit about the networking. So storage is the other three-legged stool. And, you know, if you think about these, let's talk about training for a sec. You think about these very large scale training clusters that we talked about before. it's super important to keep those clusters well-fed with storage for those training jobs. And so we're significantly expanding our storage options and offerings for that. One announcement we had was a managed Lustre 10T. So a lot of customers love Lustre for storage for training. With this offering, we are now 20 times faster than we were last year. Sorry, 10 times faster than we were last year and actually 20 times faster than the competition today. So sort of huge, huge gains. But that is all sort of reinforcing this idea of shrinking those training cycles from, let's say, months to weeks, for example. And one of the really cool things here is that This is based on fundamental technical building blocks from Google Cloud that allows us to deliver this level of performance. Managed Lustre needs to run on servers, ultimately. And so we built a whole new Google Compute Engine server platform called C4NX. It's got eight SmartNICs in it. So it's a lot of SmartNICs. And that is what we use to drive super high performance for the data flows between the storage to the GPUs or TPUs. We do that over RDMA, so you can bypass the CPU as you're moving that data across. And then that whole thing is back-ended by something called HyperDisk ExaPools, which is a scalable block storage service. So you can get really, really large storage capacities. up to 80 petabytes of storage in this thing. So, yeah, so that completes, let's say, the three-legged stool sort of powered by some of these fundamental Google technologies.

PATRICK MOORHEAD: 

Final and last question. GKE, you talked about it a little bit. I'm curious, does it improve operational efficiency and performance?

MARK LOHMEYER: 

Yeah, so I think it's interesting if you think about Google for many years, you've worked with us for so many years. GKE was the engine for cloud native applications and cloud native orchestration. And what we're seeing now in the AI era is that we're transforming GKE. to be agent-native orchestration. So think about from cloud-native to agent-native. And you think about, take a customer like Lovable. When they open up a vibe coding event to their customers, all of a sudden they're needing to spin up tens of thousands of agents to respond to those customer demands. With GKE, we're basically optimizing every layer of that stack. to be able to rapidly spin up thousands and thousands of containers, and then be able to rapidly spin them down to respond to the needs of these events. This is super, super important for customers like Lovable to provide a great experience as they do that. So, GKE is going to be more and more important going forward for, let's call these agent-native workloads.

PATRICK MOORHEAD: 

Oh, excellent. Mark, I really appreciate your time. Congratulations on the announcements. And I really appreciate you sitting down with with me and the Six Five every last three years. Yes. Sincerely appreciate that.

MARK LOHMEYER: 

Yeah. No, thank you, too. I always really, really enjoy these discussions. We can go from the high level strategic down into the technology and connect them together.

PATRICK MOORHEAD: 

No, I love it. Experience matters, I think. I mean, it's fun and I love it. And I can tell that you love it, too. Absolutely. Excellent. So this is Pat Moorhead with The Six Five here at Google Cloud Next 2026. Great discussion on the hypercomputer, really fills in some of the blanks on a full stack Google offering to enterprise customers, but also other model makers and customers like that. Check out all of our Google Cloud 2026 content and all of our, we love infrastructure, you know that in The Six Five. Check that out, hit that subscribe button, be part of our community. Take care.

MORE VIDEOS

From Infrastructure to Intelligence: How Google Cloud Is Architecting the Agentic Enterprise

At Google Cloud Next 2026, Patrick Moorhead and Muninder Sambi, VP of Google Distributed Cloud, examine the five infrastructure shifts enterprises must execute to support AI agents at production scale. From Fluid Compute and Agent Gateway to sovereign AI deployment via Google Distributed Cloud, the conversation maps the architectural decisions that determine how far agentic execution can scale.

From Models to Agents: How Enterprises Are Scaling AI with Google Cloud

Daniel Newman and Oliver Parker, VP of Global Gen AI GTM at Google Cloud, examine the enterprise AI inflection at Google Cloud Next 2026. The conversation covers the drivers behind the shift from production capability to scale production, how inference cost structures are shaping what gets deployed, vertical AI execution through industry-specific customer deployments, and the emerging FinOps framework for evaluating agent ROI against labor cost equivalents.

Google Cloud Goes Full Stack, Amazon's $100B Anthropic Bet, Intel's Foundry Moment & More

Patrick Moorhead and Daniel Newman break down a massive week in enterprise tech, from Google Cloud Next's full-stack AI push and Amazon's $100 billion Anthropic commitment, to Apple's leadership transition and Intel's long-awaited foundry validation courtesy of Elon Musk.

See more

Other Categories

CYBERSECURITY

QUANTUM