An information theoretic approach to graph representations, graph embedding and embedding evaluation

Loading...
Thumbnail Image
Authors
Eleish, Ahmed Morad
Issue Date
2022-05
Type
Electronic thesis
Thesis
Language
en_US
Keywords
Multidisciplinary science
Research Projects
Organizational Units
Journal Issue
Alternative Title
Abstract
For centuries, mineralogists have sought means of classifying the now more than 5500 mineral species based on some physical and chemical attributes. Collectively, these studies have been constrained by tabular representations with a relatively small number of attribute columns. In recent efforts Earth science researchers have been incorporating mineral co-occurrence data and studying patterns of coexisting mineral species in order to understand the evolution of minerals and their interaction with the environment. Paleontologists rely on the fossil record to learn about the development and behaviors of various species of flora and fauna throughout geologic time. Fossil data is used to reconstruct taxonomical maps and hierarchies describing the relationships between species based on evolutionary and environmental criteria. Studying these relationships is key to learning about the evolution of life and its effect on the planet. Underlying each of these fundamental systems in our natural world has been the constraint of low-dimensional tabular representations and thus mostly descriptive analyses. This constraint has been a barrier to harnessing the power of machine learning as a vehicle for predictive analysis of data from these systems. In this thesis, motivated by such problems as well as others in human systems, we utilized graph-based representations of extant data, and developed and applied graph entropy methods to quantify structural information content of graphs to improve upon existing graph embedding methods and introduce an new evaluation method for graph embeddings. Many natural and human systems are modeled well as networks of interacting entities and mathematically represented by graphs which opens the door to the application of quantitative methods to extract useful information from these graphs. This is a challenging task however, given the highly dimensional nature of graphs. Graph embedding methods aim to learn concise vector representations that accurately preserve the graph structure and are often built around a specific representation of the graph. The performance of these methods varies greatly depending on the characteristics of the input graph and the most commonly utilized approach to evaluating this performance is the training and testing of downstream predictive models. Both of these aspects of the embedding process introduce complexity and increase uncertainty, further exacerbating the problem and making it difficult for researchers to make decisions while designing and deploying analysis pipelines. Graph embedding methods operate on high dimensional graph representations such as graph matrices or sets of sampled graph paths to obtain low dimensional vector embeddings by applying various methods including matrix factorization and deep learning. A limitation to current methods is that each is built around one specific graph representation and thus the outcome is sensitive to that one perspective. Graphs are complex structures that can be viewed from several perspectives and are best described through multiple complementary structures. A graph adjacency matrix for example represents distances between nodes hinting at the strength of relationships between entities while a node degree vector captures the connectedness of each node offering potential insight into its role within the graph. Both perspectives provide useful information for downstream analysis tasks such as node classification and link prediction. The main challenge is finding a suitable combination of graph representation, embedding method and machine learning model to maximize predictive performance. In this thesis, we have developed a method to incorporate graph structural information content into the graph embedding process, using graph entropy measures. We begin by computing structural information content of a graph using six methods and then selectively combine the value vectors with the adjacency matrix producing novel graph representations. Every graph information method we have used, captures structural information based on a different graph element and the information functions utilized by these methods can be extended to use new graph measures and metrics. We have also developed a method to obtain vector embeddings from the new structural information matrix using a deep neural network. Deep neural networks are capable of capturing non-linear structure and thus are suitable in modeling complex graph structure. We designed an autoencoder deep neural network to learn low-dimensional vector embeddings from the structural information matrices described above. We utilized labeled graph datasets and an array of machine learning predictive models to show that our embedding method results in improvements upon the predictive accuracy achieved by existing embedding approaches. Finally we have developed a method to evaluate the suitability of graph embeddings by measuring the loss in information between the original graph and a reconstructed graph. We use the previously mentioned graph entropy measures to quantify and compare this loss in information between different embedding methods. The overall goal of this thesis is to introduce an information theoretic approach to graph representation, embedding and embedding evaluation, in order to improve the predictive performance of downstream machine learning models trained using these representations. We have constructed a software workflow to process graph data and execute all the steps needed to compute structural information content, obtain graph embeddings and then train and test machine learning predictive models. After examining the results of the experiments we have identified factors in the input representations that affect the performance of the embedding methods and thus reveal the boundaries of the usefulness of information content based graph representations. We also discuss the implications of the variability in predictive accuracy between combinations of graph embedding methods and machine learning models. Finally we introduce a novel method of evaluating embedding fitness based on graph reconstruction and graph entropy methods.
Description
May 2022
School of Science
Full Citation
Publisher
Rensselaer Polytechnic Institute, Troy, NY
Terms of Use
Journal
Volume
Issue
PubMed ID
DOI
ISSN
EISSN