Evaluating next generation HPC interconnection networks

Wolfe, Noah
Thumbnail Image
Other Contributors
Shephard, Mark S.
Carothers, Christopher D.
Ross, Robert B., 1972-
Slota, George M.
Issue Date
Computer science
Terms of Use
Attribution-NonCommercial-NoDerivs 3.0 United States
This electronic version is a licensed copy owned by Rensselaer Polytechnic Institute, Troy, NY. Copyright of original work retained by author.
Full Citation
The common theme throughout this thesis is the performance quantification and understanding of application workloads on current and potential future HPC interconnection networks. To achieve this goal, network systems are modeled and analyzed using discrete-event simulation. Unlike traditional cycle-accurate simulations, the discrete-event modeling methodology, and specifically parallel discrete-event modeling, makes it possible to execute large ensembles of simulations and generate a comprehensive set of results necessary to perform exhaustive network design/performance studies in a reasonable amount of time. The discrete-event based network models are used to evaluate and predict performance of large-scale HPC systems of various theoretical configurations under a wide range of workloads including synthetic, CPU applications, and neuromorphic computing applications.
In the second part of this thesis, we perform numerous evaluations analyzing the scaling performance of the simulation framework as well as the performance of these networks at large-scale in response to various workloads and HPC environment conditions. The back-end discrete-event simulator is analyzed showing the effectiveness of the approach in speeding up the simulation run times by running in parallel. The discrete-event based network models are then used to perform a number of studies to predict and quantify performance of the networks. We test the Slim Fly at large-scale under CPU workloads to observe the effect of routing on end time performance. We study the performance benefits of additional rails in the Fat-Tree network by analyzing rail-scaling, job placements, multi-job execution, and increased computational power per compute node. Finally, we test equally provisioned Dragonfly, Fat-Tree and Slim Fly networks under synthetically generated workloads as well as real CPU application and novel neuromorphic application trace workloads to provide a fair comparison across a wide range of traffic workloads. Lastly, the results of the comparisons are summarized and compared with physical system costs in an attempt to provide a single figure of merit in comparing each network's performance as an HPC system interconnect.
In the first part of this thesis, we describe a subset of network topologies chosen for evaluation. The networks are chosen because they are either currently used in a deployed HPC system or they posses characteristics such as a low-diameter that make them a promising option as the interconnection network in a next generation supercomputer. We describe the Fat-Tree network and extensions made to represent pruned multi-rail configurations. Additionally we discuss two approaches to Dragonfly networks selected for comparison that leverage all-to-all connections and 2D grid connectivity within router groups. We also cover a recently proposed theoretical network topology called the Slim Fly. The topology layouts, connectivity and routing algorithms, as well as model validation are discussed to provide a clear picture of each networks theoretical capabilities and simulator accuracy.
May 2019
School of Science
Dept. of Computer Science
Rensselaer Polytechnic Institute, Troy, NY
Rensselaer Theses and Dissertations Online Collection
CC BY-NC-ND. Users may download and share copies with attribution in accordance with a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License. No commercial use or derivatives are permitted without the explicit approval of the author.