New approaches to efficient structural analysis of social and biological networks

Authors
Kuzmin, Konstantin
ORCID
Loading...
Thumbnail Image
Other Contributors
Szymanśki, Bolesław
Adali, Sibel
Carothers, Christopher D.
Gaiteri, Christopher
Korniss, Gyorgy
Issue Date
2017-12
Keywords
Computer science
Degree
PhD
Terms of Use
Attribution-NonCommercial-NoDerivs 3.0 United States
This electronic version is a licensed copy owned by Rensselaer Polytechnic Institute, Troy, NY. Copyright of original work retained by author.
Full Citation
Abstract
Network science is a discipline within the larger area of computer science that studies complex networks using the methodologies of graph theory, statistical analysis, and data visualization. As more and more data are made available in the form of networks, the methods of structural graph analysis become common tools used to gain important insights into the real-world objects represented by the network, reveal meaningful patterns, and unveil features that would not otherwise be discoverable. The process of finding communities (or clusters) of nodes that are more densely connected inside a community than with the rest of the network is commonly referred to as community detection or clustering. It provides valuable information about the network structure, its resilience, stability, susceptibility to external disturbances, and many other properties. An important feature of communities is the ability to be overlapping, i.e., to allow nodes to participate in more than one community. Overlapping community detection is substantially more computationally intensive than is disjoint community detection which poses additional challenges to algorithm designers.
We believe that suggesting potentially successful high-risk transformative research collaborations and revealing traces through the data that support predictions, can be a decisive factor in reducing uncertainty and stimulating researchers to embrace engagement in collaboration which would be unlikely to form through conventional channels. As part of the Synergy Landscapes project we created a MoleClue application which implements the Synergy principles in an actual implementation which will be made available to users. The application consists of a multilayer network that includes molecular, publication, and author graphs and a set of algorithms for performing nontrivial searches and ranking of the results. Our experiments show that potential collaborators recommended by several ranking methods implemented in MoleClue based on several molecules commonly associated with Alzheimer's Disease have a high degree of correlation with each other. To further verify the validity of our method, we consider authors who frequently coauthor publications and compute the proximity of molecules that such authors have in common. Then, we contrast those values to the proximity of random molecules. The results indicate that potential collaborators suggested by our algorithm are at least an order of magnitude more likely to appear than by random chance.
Our previous work on extending the SLPA algorithm led to the development of SpeakEasy - a robust community detection algorithm which combines top-down and bottom-up approaches with the label propagation process and performing multiple runs of consensus clustering. We showed that SpeakEasy can surpass SLPA in terms of the quality of communities it is capable of discovering for a number of representative real-world and synthetic networks. At the same time, since SpeakEasy is a more sophisticated extension of SLPA, its base sequential version does not provide the efficiency needed to analyze billion-scale graphs. In this work, we developed a parallel SpeakEasy algorithm that is capable of efficiently performing community detection on both shared memory and distributed memory machines. Since SpeakEasy requires that certain global data (e.g., the global label histogram) are maintained and made available to all processors, designing an efficient parallel solution requires especially thorough planning. We show that by carefully selecting data structures and communication patterns and by optimizing the algorithm to take advantage of both the specific MPI library features and certain capabilities provided by the underlying hardware platforms, parallel SpeakEasy can achieve the expected degree of parallel efficiency.
Network science is inherently interdisciplinary as it deals with complex networks that originate from a variety of domains - biology, geology, ecology, social sciences, telecommunication, transportation, power generation and distribution, etc. We propose a Synergy Landscapes project which combines data from different domains (e.g., molecules and publications in biology) with multilayer graph representation and analysis algorithms provided by network science. In biology, innovative research is often associated with novel combinations of well known molecules which are studied in some new context. Despite the fact that this approach can potentially bring breakthrough results in addressing some of the most challenging common diseases, like cancer or Alzheimer's Disease, a lot of research tends to be incremental, focusing on producing "safe" findings.
Wide spread of digital technologies which penetrate almost all aspects of our lives leads to an enormous growth of both the amount of data collected and generated and the size of individual datasets. Even with the fastest linear time community detection algorithms, networks which contain millions or billions of nodes and edges can not be efficiently analyzed on single processor computers. We review and compare a group of major parallel community detection solutions and select the near linear time Speaker-listener Label Propagation Algorithm (SLPA) as the basis for our parallel overlapping community detection algorithm. Generally speaking, a large graph cannot be divided between several processors such that each processor performs its share of work independently of the others. Hence, community detection is not embarrassingly parallel. We first describe our previous work that explores the benefits of a multithreaded programming paradigm and show that it yields a significant performance gain over sequential execution in detecting overlapping communities. Then, we discuss the limitations of the multithreaded solution and propose a parallel community detection method which uses Message Passing Interface (MPI). This approach does not rely on the availability of memory shared between the processors. Therefore, it is well suited for distributed memory architectures, like the IBM System Blue Gene/Q supercomputer. We show that our MPI SLPA provides a higher parallel processing scalability than does the multithreaded solution. We also present the evidence that scalability is limited by the properties of the base SLPA algorithm as described by Amdahl's Law.
Description
December 2017
School of Science
Department
Dept. of Computer Science
Publisher
Rensselaer Polytechnic Institute, Troy, NY
Relationships
Rensselaer Theses and Dissertations Online Collection
Access
CC BY-NC-ND. Users may download and share copies with attribution in accordance with a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License. No commercial use or derivatives are permitted without the explicit approval of the author.