There exists a rich and long-standing body of literature on the topic of efficient, parallel graphalgorithms. For many problems, an optimal shared memory approach has been described
and refined by numerous studies. However, as our scientific computing applications increase
in scale, so too do the graphs intrinsic to their representation of the world. The large scale
of graph data poses a significant problem for analytics with well-established efficient shared
memory approaches: They have not been tested in distributed memory environments. To
make matters worse, it is not clear that every shared memory approach would be efficient in
distributed memory, let alone easy to implement.
We explore the problem space of distributed memory graph algorithms targeting sci-
entific computing applications running on High Performance Computing (HPC) systems.
Through this work, we have seen immense value gained from tailoring graph algorithms
specifically to the problem at hand. We were able to make use of mesh data and other
application data to see our approach execute at a fraction of a percent of a single simula-
tion step of our target application. We also anecdotally note that incorporating efficient
pre-processing into the simulation pipeline can save valuable time that would otherwise be
spent exporting meshes, and running the pre-processing outside of the HPC platform.
Additionally, we explore how to efficiently leverage all computing power at our disposal.
We implemented a hierarchically parallel graph coloring framework that is able to execute
efficiently on HPC systems with GPU resources or with only CPU resources. We use an
architecture-aware approach to selecting algorithms we experimentally determined to be
more efficient given certain hardware resources. Assembling such a framework is not a trivial
feat, as build processes for these large scientific computing libraries can be complex. Our
runtime experiments show that overall, our new approach is faster than a similar MPI-only
framework, and importantly uses few more colors in general. This capability to adapt to
different architectures is vital to performant distributed algorithms, as it allows the users of
the algorithm to leverage all hardware at their disposal.
Finally, we explore directly porting shared memory graph algorithms to distributed
memory, and find that this approach is not guaranteed to yield efficient implementations.
We study two implementations of graph biconnectivity algorithms, a well-known parallel
algorithm that is known to be efficient, and a newer graph algorithm that is competitive
in shared memory, and was selected due to its use of simple subroutines. We find that in
general it can be difficult to obtain an efficient distributed memory implementation from an
algorithm only formulated for shared memory. Additionally, there are specific implementa-
tion details that can make certain shared memory algorithms very difficult to implement.
A common approach to biconnectivity is the construction of a secondary graph, which is
much simpler to achieve in shared memory than it is in distributed memory. We show run-
times for our implementations of several shared memory algorithms including Breadth First
Search (BFS), descendant counting, preorder labeling, and constructing spanning forests.
Our implementations which follow simpler, but theoretically less optimal approaches tend to
outperform the optimal shared memory algorithms we implemented in distributed memory.
It is important to note that this does not mean that shared memory approaches cannot inform
our distributed memory approaches, just that it can be difficult to intuit what approaches
will be the most performant in distributed memory.;
August 2022; School of Science
Dept. of Computer Science;
Rensselaer Polytechnic Institute, Troy, NY
Rensselaer Theses and Dissertations Online Collection;
Restricted to current Rensselaer faculty, staff and students in accordance with the
Rensselaer Standard license. Access inquiries may be directed to the Rensselaer Libraries.;