Dr. Basden is the technical manager for the COSMA High Performance Computing system at Durham University in the UK. He sat down with us to talk about some of the challenges he and other researchers face as they try to model the universe and unravel its origins, and why he’s excited to to add a Rockport Scalable fabric to his tier 1 national supercomputing facility.
I’m a manager of a supercomputer here called COSMA, which is part of a tier one national facility. And it’s primarily used for research in cosmology, in particle physics and astronomy, all things that are related to one particular research council. So, we do large simulations, huge simulations of the universe.
So, we start with the big bang and we propagate the universe forward in time, really looking for things that we can put into our models of the universe, tweaks and so on tweaking the science, things like dark energy, dark matter understandings of that, but then change these simulations in a way that allows them to match what we see with real telescopes when we look at the sky.
COSMA is our large supercomputer that we have here. So, we actually have several generations of COSMA service at the moment. The latest one is an AMD system with about 50,000 cores, 360 terabytes of RAM, so it’s a large system for doing these simulations. We have COSMA7 in operation as well as COSMA6, which is older technology. We also have DINE. DINE is quite interesting. It’s a much smaller cluster, it’s only 24 nodes, but it’s our test cluster for exploring new technologies. And it’s on DINE that we’ve been exploring things like the Rockport switchless network fabric.
So obviously no network is fast enough for us. We’ve got these large simulations, they’ve got bits of the universe on all different nodes scatters throughout the data center. And, of course, particles over here have some sort of gravitational impact in particles over here, because gravity is a long-range force.
So, what we have to do is transfer this information between all these different nodes, which obviously means we’re we are using a lot of networking. One thing that we like to do is we’re very dependent, not only on network bandwidth, but also on network latency, and what we want to be able to do is to be able to run our simulations consistently, even when there’s a lot of congestion on the network, even when there’s a lot of other traffic flying around the network, we need to really be able to keep these simulations moving forward, otherwise we start to waste compute cycles, whilst waiting for the network to complete.
If there’s heavy use of the network, that’s external to an application that’s being run, then that particular application can, depending on code, can take specifically longer. So we’re talking 10, maybe even 20% longer to complete which then of course means it’s eating up the project allocations, because they’re given a particular time on this system that they can use, over a year or whatever, and so it’s eating through that budget.
So, we looked at one of these codes called AREPO, which again is cosmology code and it’s known to be very sensitive to network congestion. What we found was that when we put this on the Rockport network, as we introduce more and more congestion, the AREPO code performance didn’t suffer, performance stayed pretty much constant, regardless of how much other traffic there was flowing around this network.
Whereas on the network that we were also looking at, as we introduced congestion, this code really started to slow down. We saw quite a big, slow down there. On the back of that, we decided actually this is something that’s worth investigating further. And so we got permission and funding to convert half of COSMA7, which had InfiniBand, convert half of COSMA7 to the Rockport fabric as well, which will then be very interesting, it allows us do large cosmology simulations directly comparing the performance of these two networks.
Very much like the very good monitoring environment where you can really see the traffic, the network traffic flowing between different nodes and as you’re running new applications, you can replay this in history as well, but as you’re running an application, for example, you can look and say, well, we’ve got a lot of data at the moment that’s coming from this node to this node and traveling around network and doing you can really get a lot of insights through how the network is performed by looking at some of these facilities that we’ve got there in real time. Also, the ability to keep things flowing, even when there’s lots of other congestion on the network.
So what we hope to achieve by this is much more consistency in the way that results are delivered to make sure that simulations can run that the researchers will know it’s going to be finished by time, it’s going to use this much budget. They can then plan ahead when they’re putting in applications in to use time on COSMA, they’ll know much more accurately, how much time they likely need to use because the simulations aren’t slowing down so much, depending on what else is going on.