TACC establishes new Center of Excellence with Rockport Networks 

Rockport Networks has announced a collaboration with the University of Texas’ Advanced Computing Center (TACC) to create a Center of Excellence in Austin, Texas. TACC houses Frontera, the fastest supercomputer on a university campus and the 10th most powerful supercomputer in the world. TACC has installed 396 nodes on Frontera running production workloads on Rockport’s switchless network including quantum computing, pandemic-related life sciences research as well as workloads focused on rapid responses to emergencies like hurricanes, earthquakes, tornadoes, floods, and other large-scale disasters.  

“TACC is very pleased to be a Rockport Center of Excellence. We run diverse advanced computing workloads which rely on high-bandwidth, low-latency communication to sustain performance at scale.  We’re excited to work with innovative new technology like Rockport’s switchless network design,” stated Dr. Dan Stanzione, director of TACC and associate vice president for research at UT-Austin. “Our team is seeing promising initial results in terms of congestion and latency control. We’ve been impressed by the simplicity of installation and management. We look forward to continuing to test on new and larger workloads and expanding the Rockport Switchless Network further into our data center.” 

Watch the video to learn more about how Rockport and TACC are fueling innovation in academic supercomputing environments:   

 

 

Transcript

Hi, I’m Dan Stanzione, I’m the associate Vice President for research here at the University of Texas at Austin and serve as the Executive Director of the Texas Advanced Computing Center, which is our large-scale research computing facility here in Austin. I’m pleased to be talking about our new partnership to form a center of excellence with Rockport Networks.

Rockport is offering a new switchless kind of optical network that gets us very high bandwidth and very low latency, which is something we value a lot. We’re always evaluating – not just new companies, but new products from old companies, new versions of products, and Interconnect is one where there used to be a lot of options. There’s less options now. We have good options, they won for a reason – the ones that are out there, but it’s always great to see new players in the space, so we’re very excited about it. 

We run all sorts of simulation AI and data analytics workflows on our machines here at TACC. It really spans all the forms of science, were funded primarily by the National Science Foundation, so we do any sort of unclassified research. So, it can be biology and genomics. It can be human health, it can be astrophysics, it can be climate and weather, it can be chemistry and new materials, it can be energy related. There’s a whole bunch of things.

We’ve done some of our initial testing with mostly weather forecasting and climate codes, and some molecular dynamics codes because they tend to really stress the network in the kinds of stuff that we’re trying to do. We did a smaller testbed a couple of months ago with a 24 node testbed that was actually fairly easy to install. It’s fairly small. We just have one or two of what are called SHFLs that make the optical connections between all the nodes. That went smoothly, and we tested it out again. We ran some climate codes and weather codes and molecular dynamics codes, and it did as well as our existing InfiniBand networks on the same compute nodes. So very competitive performance with what we have now. 

We’ve just done a much larger install. We’ve expanded it to about 400 nodes and we’ve just finished that up. We’ve done a lot of low level performance testing and things like that. But we’re very optimistic based on what we’ve seen up to 24 nodes. We expect to continue going forward with this plan.

So we’ve now put in this bigger testbed. We’re going to start evaluating how it does with our codes today, and maybe look at some AI codes and some other workflows, and then work with Rockport on how we might plan to make a much larger scale. I mean, future systems, we’re looking at 10,000, 20,000, 50,000 endpoints instead of a few hundred. So we’re going to figure out and start to really understand the performance characteristics of what goes on in the network – where is there congestion, where do we need flow control, quality of service, how does it do when we start overlaying storage with message passing? There’s a lot of research topics we want to explore and then think about how we might help sort of co-design the scale out of the future versions of the network.

Here’s something similar we think you’ll like.