Efficient distributed graph algorithms for high performance computing contexts

Bogle, Ian
Thumbnail Image
Other Contributors
Devine, Karen
Szymanśki, Bolesław
Zaki, Mohammed J., 1971-
Slota, George M.
Issue Date
Computer science
Terms of Use
This electronic version is a licensed copy owned by Rensselaer Polytechnic Institute (RPI), Troy, NY. Copyright of original work retained by author.
Full Citation
There exists a rich and long-standing body of literature on the topic of efficient, parallel graphalgorithms. For many problems, an optimal shared memory approach has been described and refined by numerous studies. However, as our scientific computing applications increase in scale, so too do the graphs intrinsic to their representation of the world. The large scale of graph data poses a significant problem for analytics with well-established efficient shared memory approaches: They have not been tested in distributed memory environments. To make matters worse, it is not clear that every shared memory approach would be efficient in distributed memory, let alone easy to implement. We explore the problem space of distributed memory graph algorithms targeting sci- entific computing applications running on High Performance Computing (HPC) systems. Through this work, we have seen immense value gained from tailoring graph algorithms specifically to the problem at hand. We were able to make use of mesh data and other application data to see our approach execute at a fraction of a percent of a single simula- tion step of our target application. We also anecdotally note that incorporating efficient pre-processing into the simulation pipeline can save valuable time that would otherwise be spent exporting meshes, and running the pre-processing outside of the HPC platform. Additionally, we explore how to efficiently leverage all computing power at our disposal. We implemented a hierarchically parallel graph coloring framework that is able to execute efficiently on HPC systems with GPU resources or with only CPU resources. We use an architecture-aware approach to selecting algorithms we experimentally determined to be more efficient given certain hardware resources. Assembling such a framework is not a trivial feat, as build processes for these large scientific computing libraries can be complex. Our runtime experiments show that overall, our new approach is faster than a similar MPI-only framework, and importantly uses few more colors in general. This capability to adapt to different architectures is vital to performant distributed algorithms, as it allows the users of the algorithm to leverage all hardware at their disposal. Finally, we explore directly porting shared memory graph algorithms to distributed memory, and find that this approach is not guaranteed to yield efficient implementations. We study two implementations of graph biconnectivity algorithms, a well-known parallel algorithm that is known to be efficient, and a newer graph algorithm that is competitive in shared memory, and was selected due to its use of simple subroutines. We find that in general it can be difficult to obtain an efficient distributed memory implementation from an algorithm only formulated for shared memory. Additionally, there are specific implementa- tion details that can make certain shared memory algorithms very difficult to implement. A common approach to biconnectivity is the construction of a secondary graph, which is much simpler to achieve in shared memory than it is in distributed memory. We show run- times for our implementations of several shared memory algorithms including Breadth First Search (BFS), descendant counting, preorder labeling, and constructing spanning forests. Our implementations which follow simpler, but theoretically less optimal approaches tend to outperform the optimal shared memory algorithms we implemented in distributed memory. It is important to note that this does not mean that shared memory approaches cannot inform our distributed memory approaches, just that it can be difficult to intuit what approaches will be the most performant in distributed memory.
August 2022
School of Science
Dept. of Computer Science
Rensselaer Polytechnic Institute, Troy, NY
Rensselaer Theses and Dissertations Online Collection
Restricted to current Rensselaer faculty, staff and students in accordance with the Rensselaer Standard license. Access inquiries may be directed to the Rensselaer Libraries.