At Durham University’s Institute for Computational Cosmology (ICC) in the UK, COSMA supercomputers run cosmological simulations from the Big Bang forward, generating some 130 billion particles and more than one Petabyte of data. Simulations for everything from planet formation to black holes can have a runtime of months. When compared to telescope observations, the data helps to construct a more complete picture of the evolution and structure of the universe.
This kind of exascale modeling consumes a lot of compute. But it also needs a high-performance network with the speed and ultra-low latency to connect the nodes and manage busy inter-node traffic as data flows across the cluster. Data from one part of a simulation is dependent on another set of data for updates, so a change in a one star will have an impact on stars elsewhere. Messages need to make their way across the model as quickly – and predictably – as possible to ensure accuracy.
Congestion at the breaking point
Port congestion and long tail latency create significant performance issues when code is running on thousands of nodes in an exascale system. Slower and more unpredictable workload completion times can delay research and add cost with the underutilization of expensive compute and storage resources. The 30-year-old centralized switching architecture is severely limited by a high switch radix and exponential increases in inter-process communication (IPC) latencies as workload scale increases. The wiring complexity and high operational costs associated with endlessly adding switching capacity make the centralized switching approach unworkable for performance-intensive applications.
Scale fabric improves performance, economics
Tackling congestion has moved beyond adding layers of switches and more bandwidth to address the root cause – the network. Changing the architecture to a distributed fabric delivers better performance, resource utilization and network economics. In the COSMA cluster, the Rockport scale fabric distributes the network switching function to the endpoint devices (nodes), which become the network. The result is 25 nanosecond latency versus upwards of 15,000 for centralized architectures. Eliminating layers of switches ensures that compute and storage resources are no longer starved, and researchers have more predictability regarding workload completion time.