From weeks to days: accelerating hpc network simulations through event reduction in parallel discrete event simulation

Loading...
Thumbnail Image
Authors
Cruz Camacho, Elkin Alejandro
Issue Date
2025-12
Type
Electronic thesis
Thesis
Language
en_US
Keywords
Computer science
Research Projects
Organizational Units
Journal Issue
Alternative Title
Abstract
High-fidelity Parallel Discrete Event Simulation (PDES) of modern HPC systems is slow and time-intensive. Simulating just 1.65 seconds in virtual time of an 8,448-node dragonfly network model using 33 CPU cores requires processing 49.6 billion events over 2 hours of real wall clock time. This is a 5000:1 slowdown between simulation virtual time and wall clock time, making the computational burden to execute comprehensive network studies prohibitively expensive, limiting our ability to design and optimize next-generation HPC systems, possibly requiring weeks-long simulation campaigns. This thesis presents the Model-Predictor-Director (MPD) architecture, an architecture for accelerating PDES through event reduction strategies. The MPD architecture decomposes acceleration into three pluggable components: the Model (core PDES simulation), Predictor (subroutines that estimate future states), and Director (orchestrates switching between fidelity levels). We categorize event reduction into two fundamental approaches: agglutination (aggregating events within processes) and extermination (fast-forwarding through virtual time). Through three concrete HPC network use cases, we demonstrate via event reduction acceleration gains of 1.96x--2,671x while maintaining high accuracy. Event reduction via agglutination on a neuromorphic simulator achieves 82x acceleration. HPC network simulations achieve 76x acceleration using bandwidth-delay surrogacy on dragonfly topologies. Application-level extermination delivers 11x acceleration with 99.45% accuracy across multi-scale validation (72--8,448 nodes). To exterminate events at the application-level, we fast-forwarded entire global states, which, when using a near-zero cost state approximation function, allowed us to run the simulation 2,671x faster than the original network model. We found that simple average-based predictors suffice to capture steady-state behavior with 96.4% reliability (< 5% error). Given the parallel nature of PDES models, implementing any new changes is an arduous and bug-prone task; thus, we also introduce a reverse handler validation methodology that discovered latent bugs in PDES, allowing us to build deterministic and robust models.
Description
December2025
School of Science
Full Citation
Publisher
Rensselaer Polytechnic Institute, Troy, NY
Terms of Use
Journal
Volume
Issue
PubMed ID
DOI
ISSN
EISSN
Collections