Data-efficient machine learning on and with graphs

Loading...
Thumbnail Image
Authors
Chi, Hongliang
Issue Date
2025-12
Type
Electronic thesis
Thesis
Language
en_US
Keywords
Computer science
Research Projects
Organizational Units
Journal Issue
Alternative Title
Abstract
Graph-structured data provides a uniquely expressive representation of relationships across diverse domains from social networks to molecular systems. Unlike independent and identically distributed (i.i.d.) data where samples exist independently, graphs encode explicit dependencies through edges connecting related entities. This expressiveness enables powerful modeling but introduces fundamental challenges. Annotating graph data demands substantial resources because domain experts must understand both individual nodes and their relational context. Evaluating how graph elements contribute to model performance requires rethinking standard data valuation frameworks. We define data valuation as quantifying individual data contributions to model performance. Traditional valuation methods assume training samples are independent and identically distributed. This assumption fails for graphs where node value depends on network position and neighbor labels. Conversely, graph structures offer a powerful lens for understanding relationships among i.i.d samples. The graph representation provides an effective computational structure for modeling how data points substitute for one another. This thesis addresses data efficiency through complementary perspectives: developing methods that reduce labeling requirements for graph machine learning and leveraging graph structures as computational tools for data valuation in general learning contexts. We first address learning from graphs with limited supervision. Obtaining labels presents substantial practical barriers across domains. Validating molecular properties requires expensive laboratory experiments, social network analysis confronts privacy regulations, and biological networks demand time-consuming expert annotation. Real-world graphs further contain structural noise from measurement errors and adversarial perturbations that compound the supervision challenge. We develop enhanced contrastive learning objectives that improve representation quality through probabilistic modeling of node similarity. This approach achieves better positive sample diversity and mitigates false negatives through carefully designed anchor-aware sampling distributions. We complement this with an active learning framework that jointly selects informative nodes for labeling while purifying corrupted graph structures through decoupled representation learning. The framework addresses label scarcity and structural noise simultaneously through an iterative process with theoretical grounding in the Expectation-Maximization algorithm. The second challenge arises from the mismatch between traditional valuation assumptions and graph dependencies. In graph neural networks, node contributions emerge through hierarchical message passing where information propagates across multiple hops. This creates complex contribution patterns where both labeled and unlabeled nodes participate in determining prediction quality. We introduce PC-Winter value, which respects precedence and level constraints inherent in computation trees that graph neural networks construct during forward passes. Efficient approximation strategies including hierarchical truncation and local propagation enable practical valuation while maintaining theoretical guarantees. For inference time when test labels remain unavailable, we develop the first framework that quantifies neighbor importance without ground truth labels. This framework extracts transferable features capturing both data properties such as graph homophily and model behaviors such as prediction confidence. Rather than predicting intermediate accuracy estimates, the approach employs Shapley-guided optimization that directly targets accurate Shapley value prediction for improved efficiency and effectiveness. Beyond graph-specific challenges, we reformulate data selection as sequential decision-making through a Markov Decision Process (MDP) formulation. This perspective reveals that existing game-theoretic valuation methods represent myopic approximations to optimal dynamic programming solutions. We analyze when these approximations achieve optimality and quantify their performance under different utility properties. To bridge theoretical optimality and practical scalability, we develop bipartite graph approximations that encode training-validation relationships through edge connectivity patterns. This representation enables efficient estimation of how training samples influence validation performance without repeated model retraining. The approach maintains provable performance bounds while achieving computational efficiency. These insights demonstrate how graph-based reasoning enhances data efficiency beyond specialized graph applications. Our contributions span theoretical foundations, algorithmic innovations, and empirical validations. The developed methods reduce annotation costs while maintaining or improving model performance. By establishing connections between graph structures, data valuation, and sequential decision-making, this thesis offers a principled framework for data-centric AI that addresses efficiency challenges across both graph-specific and general machine learning contexts.
Description
December2025
School of Science
Full Citation
Publisher
Rensselaer Polytechnic Institute, Troy, NY
Terms of Use
Journal
Volume
Issue
PubMed ID
DOI
ISSN
EISSN
Collections