Congestion is a major issue affecting high-performance computing networks in multiple ways – but there’s hope on the horizon. The HPCwire panel “Quieting Noisy Neighbors: Addressing Network Congestion in Multi-Workload Environments” was moderated by Bob Sorensen from Hyperion Research and featured Erik Smith (Dell), Hari Subramoni (OSU), John McCalpin (TACC) and Matt Williams (Rockport).
Here are our top 5 findings from the HPCwire Congestion Panel held earlier in November:
1. Congestion is a growing problem in high-performance clusters
To start, the panel agreed on a definition of congestion as when demand for network resources exceeded the capacity of the system to satisfy that demand and it reduces the quality of service experienced by users.
Beyond this “intrinsic” congestion, John McCalpin described another, more frustrating form of congestion that occurs when links are not being used in a production environment, leading to hardware sitting idle, an issue that is exacerbated by the lack of visibility into what’s happening inside the network.
John said, “It is very difficult to use more than about 20 nodes in a rack communicating externally without seriously overloading some of the uplinks and not using some of the others. This is what I considered to be the “zero order” problem that we’re working with. We aren’t actually able to use all of the wires that we paid for.”
And while many people think congestion occurs at longer intervals (e.g., 5 minutes), Matt Williams pointed out it can happen at much smaller timescales (e.g., fractions of a second) even on seemingly empty networks.
Hari Subramoni confirmed, saying “Congestion can be visible, even with just two nodes. You don’t need 100,000 processes talking at the same time. It’s directly proportional to the load on the links. And it is visible for any kind of packets, even for very small packets. You don’t really have to go to very large scales to see these issues with congestion.”
2. The composition of traffic on the network is changing and networks aren’t keeping up
The HPC clusters have grown from 1000 cores a decade ago, to half a million or more today. In addition, the size and type of traffic has changed, with applications like distributed deep learning increasing packet size from 128 kilobytes to 128 megabytes in hard-to-manage, all-to-all traffic patterns, which makes the performance of very large jobs much worse.
This is further complicated by networks being used by different kinds of workloads, some of which are bandwidth sensitive, while others are latency-sensitive. While networks are being used to carry both types of traffic, they can traditionally only be optimized for one or the other.
3. Workload completion times are important, but consistency is key
Bob Sorensen shared a story from when he was conducting a performance evaluation on an HPC environment. “One of the things we found was that users were relatively accepting of performance and job turnaround time if it was predictable or stable. If they submitted a job and it took an hour, and the next time they ran it also took an hour.” On the other hand, he said “If they submit a job and one time it takes an hour, and next time it takes 45 minutes or 1 hour 15 minutes the time after, even though the average job took an hour, the variability was much more concerning.”
This was because when users attempted to change their code to improve performance, they had no way of knowing whether it was the code or the network that was responsible for affecting workload completion times.
John pointed out a ripple effect of congestion is that because contention for resources is variable, users see variability in their job execution times, which means they’re uncertain about how much time to request for the job or how much usable research their allocated hours will produce. As a result, users are locally optimizing their workloads, doing as little network communication as possible, but this has the side effects of users having little experience with scaling, which limits what they can do as far as the science is concerned.
4. More visibility into the network is needed
Matt summed up the issue, saying “If you’re an administrator of a large-scale system, you need to know what’s happening on the network. You need to have the tooling, you need to have the ability to visualize, to gather a lot of data so that you can make workload placement choices. You can make optimization choices on the applications. If it’s a black box, there’s nothing you can do. It’s ‘I hope this works well’, rather than ‘I know it will’.”
Hari said more visibility would help shut down bad actors, saying “Something that I would like to see is the ability to trace back. For example, when you are running an 8,000 node cluster, there could be one or two bad users who are screwing up the entire system. But how do you find that one node, which is affecting the other 7,999 nodes so you could isolate that one node and shut it down? That would be very useful from my perspective.”
Hari added sharing visibility between cluster components would also be helpful. “I’d like a network fabric that gives regular users some sort of visibility. So that for instance, middleware can take an informed decision on whether I should pump in, an all-to-all for example, or wait for some time and then pump in an all-to-all. It could lead to a lot more intelligent designs at the software middleware level so that you can stop congestion before it becomes a problem.
5. Some promising developments should help address congestion
For Erik, it’s all about notifications. “Regardless of the workload type, whatever they happen to be doing it’s about notifying end devices that there’s something going on in the network. It will have an impact on the ability for end devices or systems to automatically remediate, to adjust the amount of load that they’re putting into the network. And I think that has a tremendous amount of promise.”
The rest of the panel emphasized that work being done on visibility and sharing information between different components of the network should help. Hari talked about “The ability for the MPI library to see into the network and hopefully adapt. Once that flexibility is there, the library can adapt so it is more dynamic.”
John agreed, saying “I think the real key to be successful is to be able to bring in people who could address each level of these many interacting levels so that you could work your way up and down, and understand the nonlinear feedback, understand the application requirements and how they map all the way down to the hardware and the hardware limitations and how they map all the way back up.”
The participants summed up the lively discussion by agreeing that to really start addressing the issue of congestion, the network could no longer be an afterthought, but had to become a first-class citizen alongside memory and compute resources – something we at Rockport Networks strongly agree with!