Unlocking AI Innovation Requires Agile Infrastructure

The AI stress test is revealing cracks in a converged system architecture that traps expensive compute resources. That’s why AI is driving the next evolution in data center computing.

fibre-optic-cables

AI is everywhere, in everything from driverless cars to predictive biomarkers and fraud detection. Developers are already incorporating ChatGPT 4 into increasingly sophisticated generative AI creations. These massive data sets and deep learning algorithms are driving new applications and business models – and creating unpredictable, escalating demand on data centers to process, manage and store all of it.

The AI stress test is revealing cracks in the converged system architecture that underpins today’s data center infrastructure. Designed more than 30 years ago for capacity-based workloads, the current system model is struggling to handle compute-intensive applications with specialized accelerators, memory scaling and complex traffic flows.

VIDEO: Cerio CTO Matthew Williams talks to Dr. Ryan Grant from Queen’s University about advanced AI infrastructure.

System model locks AI resources

The bottleneck is in the box, an artifact of the converged architecture that was created to simplify the management of homogenous, CPU-centric workloads. It made sense at the time, but the model no longer applies. With AI, it’s all about GPUs and accelerators, but to get access to more compute means adding surplus CPU cores and overhead that increase costs and cut efficiency.

Need more GPUs? Just buy multiples of these systems in fixed quantities of 4, 8 or 10. Then, anticipate your processing demands for the next 3 years, pay for extra CPUs and memory, and get creative with scheduling to reserve processing time for the jobs that really need it.

AI creates unpredictable spikes in demand, but there’s no flexibility to adapt an accelerator workload mix to non-GPU jobs, which means tying up expensive resources that sit idle when they could be used more efficiently in other parts of the data center.

Navigating the capacity trap

The costs and waste from the unintended inefficiency of converged systems is creating a capacity problem that even overprovisioning can’t fix.

Overprovisioning was a common workaround in the data center even before AI exploded – it’s estimated that about 40% of AI infrastructure is overprovisioned – but it’s a blunt and expensive instrument that leads to underutilization of critical AI resources.

System infrastructure is very costly to replace, which limits refresh cycles to 3-5 years. With innovations in memory and acceleration on a 6–12-month cycle, that lag time creates a significant competitive disadvantage.

Power is also expensive. AI applications are driving up the cost and environmental impact for the energy and space used to power, cool and house these systems.

Moving infrastructure to the Cloud has been one strategy to outsource the capacity problem but AI workloads are proving too expensive to run offsite. According to a recent report from Hyperion, more than 38 % of data center operators are exploring repatriation this year. (Trends and Forecasts in HPC Storage and Interconnects, 2022.)

Next Evolution – Composable Disaggregated Infrastructure

Whether on-prem or in the Cloud, continuing the static system model for AI transformation is simply too expensive, rigid and inefficient. Every minute you’re not using AI resources translates into real cost.

Composable disaggregated infrastructure (or CDI) is a resource-centric architecture that pairs the agility of cloud with the performance of bare metal. GPUs, specialized accelerators and memory resources are disaggregated from servers into resource pools. CDI makes it possible to deploy specialized accelerators that don’t make sense to buy in a system, utilize them at close to 100%, and right-size those resources to fit the application improves both efficiency and utilization.

Why Scale Matters

Disaggregating resources into an external chassis doesn’t really solve the capacity problem if you can’t get past the rack. It’s like providing an extension cord for the current system model.

Getting to row-scale disaggregation provides more flexibility to have the resources and servers available to deploy specialized accelerators when they’re needed. Row scale gives you not only a larger number of resources that can be configured for the system you need for a specific workload, but it also lets you use those very specialized accelerators for more jobs because they are no longer trapped in servers.

CDI is the next evolution in data center architecture, and AI is just the first step towards a more agile infrastructure.

Check out the next episode in our CTO Leadership Series: The Disaggregated Data Center.

Mapping the cosmos takes a fabric that won’t fail.

Find out how Durham University improved workload performance by 10-20 percent using the Rockport scale fabric.
Read the case study


Video transcript

Hi, welcome to the first episode of our CTO Leadership Series. Today we’ll be focusing on unlocking AI innovation. Hi, I’m Matthew Williams. I’m Chief Technical Officer of Rockport Networks, and I’m very pleased to welcome and introduce Dr. Ryan Grant, one of our world’s leading experts on high performance computing and large-scale systems as well. Thank you for joining us.  

Hi, Matt. Thanks for having me.  

So, AI/ML is being adopted across almost all industries, whether it’s generative AI, deep learning, imaging, or fraud detection. What’s common is that the primary compute devices moved away from a CPU. People are now using GPUs and custom accelerators to get more efficient and faster time to results. 

The challenge is the cost of those GPUs and memory make resource utilization and overall efficiency key concerns. But the existing system models can strand some of these expensive resources, increase cost, reduce overall performance and limit innovation. So Ryan, in your work with organizations around the world, what are some of the biggest issues that you’re seeing in meeting the needs of these new applications? 

So Matt, the workloads that we are starting to see on the systems now have drastically changed from what they were in the past. So in the past we had workloads that could generally be categorized into capacity-based workloads. These were simple tasks that didn’t require anything fancy, like accelerators or a lot of compute. 

Think about this as solving and resolving a search that you might have typed into your favorite search engine, returning a result to you with some pages for your search to capability class problems. These are things like training, AI, ML models, solving scientific simulations, things that are very compute heavy that we would use resources for a long time. 

Well, now we’re starting to see those capacity elements become more capability or more compute enhanced intensive as well. So when you’re doing a request, Chat GPT or Bard or one of the large language models or doing anything related to AI, those requests now are a lot more compute heavy, requiring those solutions to have large amounts of accelerators, memories sitting out there that just weren’t required in large data centers for these tasks in the past. 

And I want to mention that these accelerators are gaining some element of ubiquity in terms of the heterogeneity of the type of accelerators used for AI as well. They’re both specialized AI hardware and the vast majority of them being GPU type hardware in the in the network and the data center as well. 

So, we’re seeing new demands on the new types of hardware that need to go into the systems, as well as new workloads that are being run on the systems and new mixes and requirements for compute capabilities.  

Very interesting. So let’s talk a little about this 30 year old system infrastructure, whether it’s system model that is really trapping some of these expensive resources. 

There’s GPUs and accelerators that have to live somewhere. They can’t just be placed in a rack on their own. So typically, you put them inside servers, and because they’re high-power devices, you have to put them in custom servers that have the power and cooling required to support them. What people typically leverage is this two socket rack, optimized server is the 30 year old system architecture.  

The reason you want two sockets for general purpose workloads is you need the memory bandwidth or you just need the CPU cores, or you need just the number of slots for devices. When you’re talking about GPUs and accelerators, it’s really about how many of those can I fit within this chassis to get the efficiency I need in terms of how many I can deploy within a rack. 

The challenge of course, is you don’t just get those slots, you have to get all the other things, the CPU cores and the memory. One of the big challenges is that ratio between the GPUs and the sockets is pretty much fixed to floor one and you can get eight device chassis. But really what you’re talking about is that fixed ratio limiting the flexibility, limiting the types of accelerators that can be deployed efficiently within the data center. What do you think?  

I think that’s right. I mean, tying our accelerators and the other elements of the system. They’re doing significant compute to the server itself. It’s just a legacy of the old system model, right? So in the past we didn’t have accelerators, CPUs and memory were the unit of compute that was being used, and so it made sense to allocate them that way. 

This gets even worse if you think about these things as not just general-purpose accelerators, like GPUs. Let’s say now I have some specialized FPGA accelerator, or I have an AI accelerator, or I have some other type of accelerator for specific things having to do with security and other elements of the network. I can essentially trap those inside the box because I’m allocating those resources that are not those accelerators to doing other work that needs to be done just because I don’t have other servers available to do them. So by doing that, I’m not allowing enough flexibility to be able to adapt my accelerator per load mix to the jobs that I’m actually doing, which means I am tying up resources that are just sitting there idle that could be otherwise used more efficiently in other parts of the data center.  

And that’s both inefficient on the cloud front in terms of you’re either renting something to a client that they’re not using and they’re paying for unnecessarily or you’re not renting it to them, but it’s getting tied up anyways when you could be renting it to somebody else. 

And for on-prem solutions, it also just ties up your hardware so that you may end up having to buy more hardware than you otherwise would to provide the services that you need for your own organization, just because occasionally things will get trapped in nodes that aren’t being used for them, and unfortunately what we see in workloads is a lot of time we’re not using CPUs and GPUs simultaneously. 

It’s very much a ying yang and moving the workload between one to the other workload moves over to the GPUs. We solve it and we bring it back to the CPUs, do some serial compute for a little while, and then we move workload back to the GPUs. So even in a single box, the flexibility there, we’re always sort of trapping some resources in the current workflows that we have with the applications that we have. 

And from a sort of architectural limitation point of view with this two-socket system which you’re using to get the density of devices, is there’s a lot of east-west traffic between those devices when you’re talking about GPUs and accelerators as they work together to solve a problem when you’re using those slots for IO. 

You know, having four slots for one socket and four slots for another really is not much of a limitation, but as soon as you’re having to cross that socket boundary, and use west traffic across all eight slots, many again, they’re a pretty significant performance bottleneck. So, this overall architecture has inefficiencies and a strand of resources and really doesn’t give true optimized behaviors to address the needs of these applications. 

So, as we look at the advantages of disaggregated infrastructure, and this is when we’re taking those expensive resources, the GPUs and the accelerators, and even the memory, and moving them outside of that chassis so they’re no longer stranded within that chassis, you get tremendous flexibility, tremendous scalability. 

And you’re able to very efficiently add the kinds of accelerators, the kinds of compute required for the different kinds of workloads and even efficiently make use of expensive memory resources as some workloads require large memory pools, some require small. That flexibility really does drive high efficiency and high utilization of those resources and it really gives end users the right kind of combination of efficiency and power utilization and cost to optimize the design of a data center.  

And in this case too, Matt, you can have a single system, right? A single node that has a number of accelerators hooked to it that are just physically impossible to do today, right? 

So we see the demand for lots of different mixtures of CPU core types, memory channels, accelerators. You can see this variety in looking at a major provider like Amazon and just looking how many instant types they have in their cloud, how many types of systems their different clients want and need. 

There’s no reason that you necessarily need to trap those into specific boxes when you’re disaggregating this system. So you can put together things that instance types on the fly if you’re a cloud provider, to be able to create those things without having to have a specific box sitting around waiting for a client to come use it. 

Especially if it’s a specialized box, like having very high memory requirements or having a ton of storage in it or having a lot of accelerators in it doesn’t really matter. You don’t necessarily want to trap all of those things, but you also have the idea that you can have a node attached to, you know, 32 or 64 accelerators like you’re showing here, which is just impossible to provide in that old system model.  

So there’s nobody who can put together an instance where there’s 64 GPUs attached to two CPUs for you today. That’s just not possible without disaggregation. So I think it’s leading to some interesting new potential system models here. And even when you look at efficiency, it costs you a little bit of efficiency for disaggregation. 

But when you have 64 accelerators, what does that really mean? Would I rather have 64 GPUs that are a few single percentage points less efficient than the solution with four or eight of them in a server? Or would I want eight servers with eight GPUs? They’re going to have their own inefficiencies with having to communicate with themselves.  

So the disaggregation in and of itself, because you can right size your accelerators, has some interesting ramifications as well to potentially increase efficiency because you’re using the right hardware, even though there’s some small costs to physically moving those devices out of your node itself. 

Thanks, Ryan. So as we’re talking about disaggregation, one thing that’s pretty apparent is the need for scale, the need to have flexibility and well, how many servers can get access to that same pool of resources? How many different kinds of resources can I have in that pool? Is it general purpose, GPUs? 

Are they custom, you know, AI/ML accelerators as we’ve talked about, and what is that to my needs today and what are my needs going to be? So having that scale gives you that abilities to add the right kinds of accelerators and have the right kinds of combinations of CPUs and devices to meet needs of the different jobs. 

So it really is scale that drives the flexibility, it drives the efficiency and it drives the overall utilization of these resources. And as you mentioned, whether it’s an enterprise or a cloud provider, it really is that flexibility that drives the overall economics and innovation capability. 

And I think that’s really important, Matt, where we think about accelerators as not just being a homogenous GPU type solution, right? So when I start adding in accelerators that are extremely custom, so think about this too. This could be something as exotic as a quantum solution that’s an accelerator. 

Well, that’s not something you’re going to need and you want to deploy on a rack scale. There’s just some physical impossibilities to actually doing that in a data center. So you want to be able to have that sort of flexibility mix in those highly specialized accelerators. But tying them to a box or a rack really is going to get them trapped most of the time, right? 

So being able to put them out on a row means that you have a lot more flexibility to make sure that you always have the resources and servers available to deploy those specialized accelerators when they’re needed, but not trap them when they’re not, just because you want to use the resources that were attached to them in the first place. 

So row scale really gives you not only a larger number of pieces that you can fit together to make all the individual puzzles for the individual systems that you want, but it lets you use those very specialized pieces more liberally because they don’t get trapped into servers.  

You know, they only need that quantum accelerator a small amount of the time. I want to be able to shop that around and give it to many different users for the time periods that they need it, rather than reserving it and keeping it off as sort of special on the side. That we only take it out in special occasions, want that to be available in the everyday workflow as well. 

So it leads to a lot more flexibility and potentially efficiency improvements for the cloud providers that are using it. And it gives access to clients to be able to rent some of those very specialized accelerators very easily out of their row and have those things available. So, you can really put together some interesting system solutions here when you’ve got that entire row worth of hardware. 

It’s sort of that cohesive unit there. And from an economic point of view, not having to buy additional dedicated servers for the new kinds of compute, and again, half million utilize a portion of a time. Instead, being able to just add accelerators to the pool really does drive that efficiency home and allows innovation to occur a much more rapid cycle than waiting for budgets to deploy new racks of new servers for about new kinds of compute. 

And that row scale is probably perfect in terms of the limits of physics, right? Row is physically dense enough that you don’t have a really long run to other solutions with very long network connections where you’re going to start losing efficiency. Rack is very short. 

It’s very small space, right? Only a few feet, but you can expand to a row type scale and still get good efficiency in a disaggregated solution. But you’re probably not going to see disaggregation happening on whole data center side. For the gigantic data centers, you know, a warehouse level disaggregation probably won’t make sense for physics reasons. 

But row is probably that perfect combination between as big as you can without hurting from the physics ramifications, but the greatest amount of flexibility that you have in putting together the pieces correctly. So I think the row scale’s really the right target here for disaggregation, both from a performance and physics point of view, and a flexibility point of view. It seems like it’s the sweet spot.  

Yeah, you’re right. It’s about finding that balance. Economic sweet spot, the performance sweet spot. And as you said, laws of physics must be obeyed. Yes. Can’t, can’t make light go faster than it already does, unfortunately. 

All right. So Ryan, I wonder if you want to kick off, you know, talking about some of the key takeaways from our talk today.  

Sure. So I think we talked through most of these elements already. The first one being that it’s really hard to predict exactly what the demand is going to be on our workflows. 

So I talked about the current workflows and how they’ve changed quite significantly from our workflows, just relatively in the relatively recent past with AI, large language models, these sort of capacity over to capability type workloads. Being able to adapt to what the software is doing and the next generation of software, what those mixtures of accelerators might be as new hardware comes out there. The heterogeneity of the air accelerator mix that you might have with jobs. It’s really hard to design that perfect box today to deploy it and have it used for its entire lifetime in the data center. So what you really want to do is have the flexibility there to adapt your system to the workflow demands as they happen. 

And that really changes how you can run your overall data center, how you’re running your cloud, how you’re running your on-prem solutions, because you can be free to buy new hardware to put it in the system knowing that it’s accessible broadly to the entire system without creating custom bespoke little tiny systems off the side for particular user sets or clients that need very specific items in there. 

Absolutely it’s really about breaking that system model of expensive resources, stranded within devices, within servers, but instead having that flexibility to scale on demand, to add new kinds of innovation. And again, as you said before, finding the right balance of cost, performance, and flexibility to really make best use of the overall budget, get the fastest time to results, the most usefulness out of those expensive resources. 

So I really want to thank you, Ryan, for your awesome insights today. I look forward to our next episode where we’ll be talking about how we’re actually going to accomplish some of these goals we’ve talked about and really disaggregate the data center at large scale. 

Great. Thanks for having me.