AI Infrastructure Connectivity: Strategies and Recommendations from Broadcom

How is accelerated computing driving the next wave of AI innovation from its very foundation? 

Discover the answer in this dynamic conversation from the Six Five Summit: AI Unleashed! We're excited to feature Jas Tremblay, GM, Data Center Solutions Group at Broadcom, as one of our Semiconductors Speakers! He joins host David Nicholson for an in-depth discussion on the critical role of accelerated computing in the AI era!

Key takeaways include:

🔹Foundational Role in Accelerated Computing: Explore how Broadcom is uniquely positioned at the bedrock of the AI technology stack, continuously pushing the boundaries of what's possible in accelerated computing.

🔹Addressing the Challenges of AI at Scale: Dive into the significant hurdles facing the industry in deploying large-scale AI, including the growing need for specialized infrastructure and enhanced performance.

🔹The Interplay of Hardware, Software, and Systems: Understand Broadcom’s holistic approach to innovation, emphasizing the critical synergy between advanced hardware, intelligent software, and integrated systems to deliver groundbreaking AI capabilities.

🔹Future Trends and Strategic Vision: Glimpse into the most exciting future trends in AI and accelerated computing, along with Broadcom’s strategic vision for continuing to lead and shape the industry's direction.

Learn more at Broadcom.

Watch the video below on Six Five Media, and be sure to subscribe to our YouTube channel, so you never miss an episode.

Or listen to the audio here:

David Nicholson: Welcome to six five Summit. AI Unleashed. I'm Dave Nicholson with six five Media and I am joined today for this special Semiconductor Spotlight with Jas Tremblay. Jas is general manager of the Data Center Solutions Group at Broadcom. Welcome Jas. How are you?

Jas Tremblay: Thank you, Dave. I'm doing great. It's always good to see you and thanks for the invite to this event.

David Nicholson: Yes, of course, of course. Well, we have a great subject to unpack with you, specifically Broadcom's role in accelerating AI. Let's start from the top. Broadcom has, like other very large companies, multiple divisions that touch on all sorts of aspects of AI. Take us through that. What does that look like from a Broadcom perspective?

Jas Tremblay: Well, what that looks like is a semiconductor technology platform where you have multiple types of products, custom and standard, that use that platform for AI. The first product in that category is the custom XPUS. So if you're a hyperscaler, you want to build your own accelerator for AI that's tailored to, to your software requirements, your environment, to your workload. We can help with that. We've got a whole platform for that.

David Nicholson: So why wouldn't someone just buy good old fashioned cots, commercial, off the shelf, off the shelf generic GPUs instead of, instead of going through the work to create something custom.

Jas Tremblay: So, the scale of the deployments is massive, the spend is massive and the requirements are unique. So it's good to have, you know, a general purpose GPU to cover everybody's application. But if you have the capabilities and the means to develop your own that's tailored to your software environment, to your infrastructure, to your requirements, you can build something that's a lot more efficient and a lot more cost effective. And the hyperscalers have done this in other parts from a semiconductor perspective and now they're applying all their might and capabilities to AI. So what we provide is we partner with them. They provide the core logic for the accelerator and we have the building blocks, the HBMs, the networking I O, the packaging Complete Chiplet Interconnects foundation certes, and we bring all these together and instantiate it as a custom silicon offering. And all this is part of our 3D AI silicon package. And we've got, we've announced publicly that we have seven customers building AI accelerators with us on this. So that's the first kind of key product category that we have.

David Nicholson: What do the timelines look like when someone is coming in and working with Broadcom to create a custom XPU? This isn't a have this done by next week sort of thing. What are the kind of lead times that are involved with semiconductors like this?

Jas Tremblay: So there's two lead times. One is the design phase, and the other one is actual manufacturing lead time, in which we're building these AI chips on the lowest geometries possible with the most advanced technology. And one of these AI chips can have multiple chiplets and ingredients in them. So you have to build all the individual chiplets through different manufacturing processes, bring in the HBMs, and some of these chiplets are at different core technologies, and then you bring all this together. So the manufacturing cycle time is several months. That has gone up because we're building stuff in lower geometries and we're integrating multiple components on an interposer. The design piece that has actually gone down. And the key to reducing that is shifting left activities. So I'll give you an example. So historically, when we build custom chips, you build the whole chips, you tape it out, you sample it, you test it, you find issues, you create another version of the chip, and then you go to production cycle. If you're on the cutting edge of technology like we are with these large AI chips, you need to de-risks. So, for example, before we tape out, the customer's version of their chip will tape out multiple test chips. For example, we'll tape out a very large package that consumes large quantities of power and de-risk all the packaging, the chip warpage, all these elements that we can do before. So when the time the customer tapes out, the confidence they have that it'll work out of the gate, first time is improved. So it's really that design, the manufacturing cycle goes down, the design cycle reduces. We have to do a lot of work ahead of time, which is a significant investment. But that's part of the whole platform we're offering for these custom XPUs. In this world of AI, we're finding ways to go faster. Just to do things that used to take multiple years, we're doing them in one year. Take things that used to take a year, we're doing them in months. So it's really pushing our engineers to find ways to go faster, get this technology in the hands of people as fast as possible.

David Nicholson: So that's one way that you're accelerating the AI universe. These custom XPUs. What's something else that Broadcom does?

Jas Tremblay: Well, the other big category is connectivity. The connectivity is super important because of the big picture rack, and you have multiple of these data points now out there, the rack has tens of thousands or even hundreds of thousands of components in them. Chips, cables, pieces of software, all these things they need to come together. And connectivity is super important in an AI server. If you compare a few years ago it was all about the CPU centric data center and the connectivity was not as important. You would basically hook up all your servers to the top of the rack switch. With 25 gig, 50 gig sureties, you'd have plenty of bandwidth. Now in an AI world people are trying to create the million node XPU cluster. The percentage of connectivity in Iraq, you know, used to be 10% in an AI rack but is now going to 20, 30 plus percent. So much more connectivity and connectivity. If you think of a training workload, these things take time. It's not like a database transaction. You need to have a super reliable and high performance network to cut down the overall training job time and make sure you don't have to restart them. So the importance of connectivity from a performance and system reliability perspective are more important than, than ever. And you hear, you know these new racks like, oh, we have 3,000, 4,000, 5,000 cables in one rack. All that needs to be connected through, through chipsets. So that's one, one key element. The other part that's very dear to our heart is giving customers choices and keeping things open. So we were raised in, you know, we've been working in a united industry for decades. Couldn't tell a customer, you got to use this software with this chipset, you got to buy this chipset because you bought that chipset. People had choices. And we have a, we've built up a code of conduct in the data center space to keep things open, have industry level interoperability, let people pick and choose what they want. And that model is being challenged in the AI world. That model needs to thrive in the AI world. It's even more important in the AI world given the amount of money that's going to be spent there. And we see that open standards connectivity is fundamental to that vision happening.

David Nicholson: Yeah, so we're, we're still in the early days of AI. A year or two ago we were in the earlier days of AI and I would say definitely a year or two ago people were overwhelmingly driven by fomo, fear of missing out. And the quickest way they felt to get from where they were to some semblance of an AI solution was to buy a complete stack. You know, it's the, it's the, it's the never ending argument of the walled garden, the completely integrated stack of stuff that just, you just, you give it Power, you give it access to the Internet and. And you're off and running. Do you think we're at a point now where people are able to take a deep breath and start taking advantage of some of the work that's been done in the interoperability space? Are we there yet?

Jas Tremblay: I think we are. So. So first is people that have been in this industry a long time, they trust that openness will prevail and win. So they trust that that's going to happen long term. And I'll give you, I'll share one data point that gives me confidence that my team tracks the companies building either standard or custom XPUs globally. And I'm not going to pretend that we have the complete list, but there's over 40 companies on that list. So 40 companies, some very large, some very small in different geographies that are coming out with either custom or standard XPU silicon. And some of these companies have multiple versions, multiple SKUs and different generations. So the amount of innovation being focused on this problem is huge. And this notion that I'm going to build a data center that can only take one type of accelerator is just not sustainable. So we feel really strongly about this. We feel like we have a big responsibility to help the ecosystem make that happen. But it's going to happen.

David Nicholson: Yeah. It's always interesting to see when an individual company is innovative and they may lead in a space for a period of time, but ultimately industries out innovate individual companies. I think that history has taught us that so specifically in this, you know, under the heading of open networking standards. Networking standards obviously. But if you were to go through what it means to be open in this context, what are, what are characteristics of an open environment for AI at.

Jas Tremblay: At this point there's a standards body that intercoms that define how those connections should be made so that there can be multiple participants for every piece of technology. Whether I'm connecting a CPU to a drive, I'm connecting an ethernet NIC to an ethernet switch. This has to be governed by standard bodies and for a lot of us we take this for granted. For example, you don't need to drag around, you know, a compatibility list of WI FI clients with WI FI access points. It just works Ethernet, you just plug in the port, it works. PCIe is behind the scenes only for the people building systems, it works. But in some cases, some companies are making it difficult and they want to lead you to believe that if you use proprietary technology it'll lead to better performance and so forth. And, and you Know, as we both know, sometimes it's healthy to start with a closed system, get things out there, keep going, but eventually you need to open it up.

David Nicholson: You mentioned PCIe as an example. I just want to kind of double click on that. Where this is, this is the semiconductor track. So we do have a, we have a, we have, we have an audience that's probably heard that acronym before. But where, where are we? You hear people talk about what's coming versus what is on, what's on the shelf now kind of where we are in that four to five to six and beyond timeline? Where are we actually today?

Jas Tremblay: So let's go back in history a little bit. So PCIe is 35 years old.

David Nicholson: Okay.

Jas Tremblay: And before PCIe it was some of my employees actually been around that long and they saw how chaotic it was before PCIe emerged and got strong and there was different, you know, if you wanted to put together a desktop, there was different protocols to connect the device and it was a bit chaos. So now we have PCIe which is one of the best, most organized standards bodies. But it was slow. So going from PCIe Gen 3 to Gen 4 took us eight years. As an industry, you know, shame on us. Going from four to five took us three years and five to six has taken us two years. And we're going to do it for at least two years. We're going to do two years for six to seven also. So we're, we're really accelerating things. And the other thing is we are. You know, when you talked at the beginning, he said, what are you guys doing to go faster and shrink the schedules and what are the timelines? The products are getting more complicated, but our team is delivering first out quality much better than before. So complexity increases because of that, we're putting staffing it up, taking better methodology. An example is the PCA Gen 6 switch, 6th generation PCIe switch. We built a zero, came back within six weeks, and we declared production on it. And we've been shipping to dozens and dozens of companies. So we didn't have that before. So you really gotta go faster, deliver better quality upfront and then help everybody in the ecosystem solidify and mature their products. So I work with some of my customers, some of my ecosystem partners and my competitors. We got to get this stuff working and we rolled up our sleeves. We go to the industry Plug fest. We've already been to three plug fest with PC Gen 6. We've built an Interop development kit and we're shipping that to anybody that wants it. You're welcome to send us a comment if you haven't gotten one. So those are the types of things we need to not only take care of our stuff, but help with these industry transitions, make them go faster.

David Nicholson: So Gen 6 is sort of state of the art right now. If you, let's assume for a moment that every server OEM, every component manufacturer that might want to connect their stuff to other people's stuff has the ability to work in a Gen 4 environment today, let's just say 100%. Would it be fair to say 100% are Gen 5 capable or are they any laggards even in Gen 5 today?

Jas Tremblay: Well, one of the differences is what are the lead applications to that are for that are on the leading edge and the bleeding edge of these technology transitions. And AI has taken over everything.

David Nicholson: Okay.

Jas Tremblay: So in the sense that AI is setting the pace, we are going Gen 6 because of AI and it's going to be that way for probably 2, 2 years. So with example PCA gen 6, our new switch is a 9.2 terabit switch, one port. You can do 1 terabit per second on one port of PCA gen 6. But now that's pretty fast. And the latency, I'm going to brag here a little bit, but we're, we have 93 nanosecond latency on our PCA Gen 6 switch, which is 25% less than before. And it's better than any other protocol or technology out there that, that we, that we know of. But yeah, we have to go faster and the industry needs it. And if, especially with these protocols that are so fundamental to all these devices connecting to each other.

David Nicholson: Yeah, it makes a lot of sense. And when you, you, you highlight the importance of connectivity. There is no such thing as AI without connectivity at this stage of the game. There's, I can't think of a single application where training models especially are doing anything with one physical box with some GPUs in it. They're all interconnected. It's all memory bandwidth and bandwidth between all the components.

Jas Tremblay: Actually Dave, what we should, what we should do, one of our future videos is we should have you do an in depth video close up of a rack with 4,000 cables. How crazy is that? Just the physicality of it and all the signal integrity.

David Nicholson: And can we at least color code them so they're not all the same blue or the same, you.

Jas Tremblay: Know, you're gonna need a lot of colors. A lot of colors.

David Nicholson: Yeah. So yeah, it's interesting. When I see, when I look inside a cabinet and I see all of those cables or I do the math on how much heat has to be dissipated, we can pat ourselves on the back and pretend we're advanced, but eventually what we want is the zero heat cable less design. Do you think you'll be retired before then, Jas?

Jas Tremblay: I think we're gonna need a lot more cables before that happens.

David Nicholson: Yeah, a lot, a lot more capable. So we, so we hit on, we hit on some of the pillars that Broadcom is contributing. I mean, is that, is that the way that you divide it up or are there others?

Jas Tremblay: We have the custom XPUs we talked about. I talked about the internal connectivity within the AI server which is based on PCIe and the industry has, everybody's in alignment on that. There's no, no questions. Now the other big piece that I like to talk about is the scale out network. Okay. So how do I interconnect my rack to other racks and bring all this together inside a data center? So if you go back two years ago, there was a debate in the industry, should it be Finiband, should it be Ethernet? And Ethernet is older than PCIe, not 35 years, but 50 years. And for scale out Ethernet has, has won, you know, the hearts and minds of engineers across the world. That's going to be the standard, the ecosystem to build scale out.

David Nicholson: Why though? Why though, just because it's old? Do the Infiniband people argue, well, ours is newer, therefore better. Or what are the fundamental advantages of.

Jas Tremblay: Moving forward with new gen Ethernet fundamentally is you need to pick technology that'll be better from a technical perspective. And Ethernet does that. But it's more than that. You need to invest in healthy, thriving ecosystems that are based on competition, multiple companies, open standards, and then you need that ecosystem to go faster than the other ecosystem. Over the past 50 years, how many protocols Ethernet has gobbled up and we could think of probably 10 of them off the top of our heads. But there's many more than that that are forgotten. For example, Broadcom just announced Tomahawk 6, the world's first 102 terabit per second switch. You can get the switch in 100 gig Serdes flavor and 200 gig serdes flavor in co package optics flavor. And imagine the chip is one of the biggest chips. I think it's the biggest chip I've seen. It's very impressive. You have up to 1024 physical serdes running at 100 gig on one chip. So just that drives massive Cost savings, massive performance boost in the network. And the reason we've been doing that for 10 years, every two years, double the cadence and we push the ecosystem to go faster. So the Ethernet ecosystem is just stronger now. There's a lot of work that's going into the standards body, you've heard of Alter Ethernet Consortium, things like that. So there's a bunch of innovation to continue to enhance the protocol, but fundamentally it stays on Ethernet.

David Nicholson: Yeah, and there's, there's obviously precedents for, you know, for making the assertion that this is the way to go. If you look at what has developed in general compute servers, the cost of a general compute server versus the cost of an AI server today, there's something that is out of whack. And that thing, that thing that is out of whack is, you know, how much participation does the industry have versus an individual company? So it makes it, it makes intuitive sense. But are there any differentiators in terms of power consumption or things like that? Or is, or is the largest benefit the overall coopetition that happens?

Jas Tremblay: So to give you a simple example, the, with Tomahawk 6 launch, if you take 102 terabit switch, compare it to the biggest alternative protocol switch, the number of switches you need to build the Same cluster is 6 to 1. So it's dramatic cost savings. It's the first switch with 200 gig sureties. Instead of two cables, one cable. So there's, there's, it's, it's not incremental benefits, it's order of magnitude benefits. And that, that's just at the physicality of it. Then you've got the congestion control and all things at the network layer. And one of the biggest advantages of Ethernet is every switch port is attached to another Switch or a NIC. And to go from the AI servers to the switch, you need to go through NICs. And the amount of NICs in AI servers is significant. It could be one for every accelerator, one for every two accelerators. In some topologies it's multiple NICs for one accelerator. So you as a hyperscaler, as an enterprise customer, you don't want to have just one NIC option. You want to have multiple Ethernet NIC options. And Ethernet gives you that freedom of choice.

David Nicholson: Where could, where can people buy ethernet NICs? Who's in the business of making those things?

Jas Tremblay: So there's about, we have an, we keep an inventory of this and there's less people than building XQs, but we, there's multiple people building standard product NICs and there's three main providers there. We're one of them. We're investing heavily in this. And there's multiple people building their own NICs. They feel like the hyperscalers want differentiation at the NIC level. They want it to be based on open standards, but they still want to tailor it to their requirements. And our Ethernet infrastructure is any nic, any switch. We're open.

David Nicholson: Which is an important point because yes, Broadcom can provide them, but your feet are constantly held to the fire because you adhere to the open standards. Others are building things. You're constantly competing, you're constantly being driven to, to make that, to make those improvements. Which is, which is the whole pitch about, you know, from an open perspective. So, we talked about, you know, the concept of the XPU, the accelerator that is custom built by Broadcom. We talked about connectivity from two different angles. Sort of internal connectivity and then external connectivity, the kind of scale out and then the choice that you bring, especially adhering to open standards up and down the stack. Any, any other, any other kind of pillar of what Broadcom does or is that kind of how you think about it?

Jas Tremblay: There's a lot more.

David Nicholson: Okay, okay.

Jas Tremblay: There's a lot more to this. So the, the other aspect is innovation. And in the nick, and we have two, a very unique strategy for that. The, the first one is we're building a series of AI NICs that are standard product. We were the first one to come out with a 5 nanometer product. We put a lot of innovation on our fun foundational serdes. So example, with some of the competition, you need to use an optic cable, huge amounts of power. With ours, you can do an excess of 4 meters with copper. And this sounds like a simple thing, but it's so fundamentally important to the cost, the power and the reliability of a rack. If you have thousands of optical cables, those transceivers will fail much higher rate than just a direct attached cable, what we call DAC in the industry. So the reliability is much higher. And that has an impact on the training times. It's not just about going to replace them. It's like you're, you're running along, training gets interrupted because they're link flaps. So keeping things in copper is very important. And if you need more reach than, than that 4 meter. We were pioneers in a technology called linear pluggable optics, which is a simplified way to do optics. Again, much more reliability, much less cost. So we were the first to support that. And we're putting pressures on others to say, hey, you want to Participate, you need to innovate. Innovate. Also the other part of the NIC is we actually offer NICs in a chiplet form factor to put in the custom XPU. So instead of having your XPU and your NIC as two different pieces of silicon, we can integrate them by offering a chiplet that comes with, you know, the silicon and the device drivers, the firmware and so forth. And again, you can take any one of these NICs and you can connect them to the Broadcom switch or anybody else's Ethernet switch fully open.

David Nicholson: When you talk about things like cabling and the difference between copper and optical and the advances that you've made, I've had conversations with a few large server OEMs and they don't run on really fat margins in their businesses. Often, often they're running on what we might call razor thin margins. And as you scale these clusters out to massive sizes, the difference between their business, serving their customers, being profitable and not sometimes comes down to things like are we focusing on open standards or the right kinds of cables? So it's so interesting. The devil is in the details with all of this. I just wanted to kind of echo that sentiment.

Jas Tremblay: But go on, what were you that's so important, Dave? And we take great pride in building high quality products to make sure that the system providers integrating all these chipsets can be successful. And we believe that the other pillar is scale up networking. So there's a lot of networks, the internal network, the scale out and the scale up. The scale up is the one to directly interconnect the XPUs to XPUs. And there's more debate on where things are going to land. You've got NVLink as one option, which is controlled by one company and you've got Ethernet. So we believe strongly that to get the maximum scale, the openness, the innovation at the industry level, Ethernet will prevail in scale up. But scale up has unique requirements. So actually in April 2025 at the OCP Dublin Summit, we submitted the Scale up Ethernet white paper, opening this up so everybody can see what we're up to so that we can do this in an open way. So that's going to be interesting. So for scale up you've got more options. One of the options is actually using PCIe if you need a very small scale. But we're confident that Ethernet is gonna prevail in scale up also.

David Nicholson: So a fascinating broad view of what Broadcom does in AI. And by the way, Jas, you're general manager of the Data Center Solutions group. Some of the things that you referenced from the Broadcom perspective are not strictly DCSG things, correct? Am I correct?

Jas Tremblay: Most of them are not. Yeah, that's correct. So the custom silicon is the division. Switching is a division. Serdes Retimers is a division. We have a division focused on optical and I cover PCA switches, some of the ethernet NICs and storage connectivity. So we've got a lot of brothers and sisters working on this jointly.

David Nicholson: Well, we're gonna, we're gonna send links to this video definitely to your colleagues who will high five you on making sure that they get, they get their love when it comes to all of the different things that Broadcom is doing. Jas Tremblay do you have any final thoughts on the way forward? Maybe predictions for what's coming down the line, challenges that people are going to have? Any final thoughts?

Jas Tremblay: I think people, if you're building systems or building your infrastructure, get ready for a lot of innovation at the semiconductor, at the software level coming at you at a fast pace and take the time to build out a strategy that's going to enable you to take advantage of these innovations and not be within one vertical silo of innovation. Things are going to be changing pretty fast over the next few years.

David Nicholson: Sage advice from a well respected leader in this space, Jas Tremblay of Broadcom. Thanks so much for joining us here. Thanks for joining us for this Semiconductor Spotlight at The Six Five Summit. Stay connected with us on social and explore more conversations at SixFiveMedia.com/summit. More insights coming up next.

Disclaimer: Six Five Summit is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded, and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors, and we ask that you do not treat us as such.

Speaker

Jas Tremblay
GM, Data Center Solutions Group
Broadcom

Jas Tremblay is Vice President and General Manager of the Data Center Solutions Group at Broadcom, responsible for developing and delivering silicon, software and adapters for data center connectivity. DCSG delivers Storage, Ethernet and PCIe connectivity for cloud service providers, enterprise customers and embedded OEM systems. Mr. Tremblay joined the company through the LSI acquisition, where he was Vice President of North America Sales responsible for hyperscale, networking, telecom and storage accounts. Before joining LSI, Mr. Tremblay held various marketing, system engineering and software engineering positions at CGI and Nortel. Mr. Tremblay received a B.S. in electrical engineering from the University of Sherbrooke, Canada and an MBA from the University of Montreal, Canada.

Jas Tremblay
GM, Data Center Solutions Group