Transformers for graph learning

Loading...
Thumbnail Image
Authors
Hussain, Md Shamim
Issue Date
2025-12
Type
Electronic thesis
Thesis
Language
en_US
Keywords
Computer science
Research Projects
Organizational Units
Journal Issue
Alternative Title
Abstract
Recently, pure attention-based transformer neural networks have set new state-of-the-art in various machine learning domains, including natural language processing, computer vision, and audio processing. The success of transformers lies in their use of the flexible and adaptive global attention mechanism, which allows for dynamic long-range connectivity among different parts of the input. Despite their success in representing unstructured data such as text and images, until recently, the adoption of pure transformers for graph-structured data has been limited. In this dissertation, we discuss our contributions toward developing transformer-based frameworks for graph-structured data. We introduce three key innovations to the transformer framework to make it suitable for graphs while preserving its dynamic nature and non-local properties: the effective incorporation and processing of explicit graph structure, the computational efficiency of the dense attention mechanism, and the incorporation of higher-order interactions within the transformer framework for better geometric understanding and prediction. We first address the difficulty of incorporating complex structural information of graphs in the basic transformer framework. We propose a simple yet powerful extension to the transformer - residual edge channels. The resultant framework, which we call Edge-augmented Graph Transformer (EGT), can directly accept, process, and output structural information as well as node information, which is crucial for graph learning. It allows us to use global self-attention, the key elements of the transformer encoder, directly for graphs, and comes with the benefit of long-range interaction among nodes. Moreover, the edge channels allow the structural information to evolve from layer to layer, and prediction tasks on edges/links can be performed directly from the output embeddings of these channels. We empirically establish that convolutional aggregation is not an essential inductive bias for graphs and that global self-attention can serve as a flexible and adaptive alternative. Additionally, we achieve state-of-the-art results on a variety of graph learning tasks, including node classification, graph classification and regression, and edge classification. Our second contribution aims to make the attention mechanism more efficient, in order to make transformers scale to large graphs. We start by observing that the dense self-attention mechanism, which gives the transformer a lot of flexibility for long-range connectivity, is under-utilized in most learning tasks on real-world data, including graphs. Over multiple layers of a deep transformer, the number of possible connectivity patterns increases exponentially. However, very few of these contribute to the performance of the network, and even fewer are essential. We hypothesize that there are sparsely connected sub-networks within a transformer, called information pathways, which can be trained independently. However, the dynamic (i.e., input-dependent) nature of these pathways makes it difficult to prune dense self-attention during training. But the overall distribution of these pathways is often predictable. We take advantage of this fact to propose Stochastically Subsampled self-Attention (SSA) - a general training strategy of self-attention based transformers that can reduce both the memory and computational cost of self-attention by 4 to 8 times during training while also serving as a regularization method - improving generalization over dense training. Finally, we show that an ensemble of sub-models can be formed from the subsampled pathways within a network, which can achieve better performance than its densely attended counterpart. We perform experiments on a variety of NLP, computer vision, and graph learning tasks in both generative and discriminative settings to provide empirical evidence for our claims and show the effectiveness of the proposed training method. Finally, to imbue transformers with a deeper geometric understanding and processing capability, we develop a novel framework for third-order interaction in graph transformers. The original transformer framework lacks higher-order interactions, limiting their geometric understanding, which is crucial for tasks like molecular geometry prediction. We propose the Triplet Graph Transformer (TGT) that enables direct communication between pairs within a 3-tuple of nodes via novel triplet attention and aggregation mechanisms. TGT is applied to molecular property prediction by first predicting interatomic distances from 2D graphs and then using these distances for downstream tasks. A novel three-stage training procedure and stochastic inference further improve training efficiency and model performance. Our model achieves new state-of-the-art (SOTA) results on open challenge quantum chemistry benchmarks. We also obtain SOTA results on multiple molecular property prediction benchmarks via transfer learning. We also demonstrate the generality of TGT with SOTA results on the traveling salesman problem (TSP). Collectively, these contributions advance the state of graph learning by developing transformer-based frameworks that are structurally aware and robust, computationally efficient, and geometrically informed.
Description
December2025
School of Science
Full Citation
Publisher
Rensselaer Polytechnic Institute, Troy, NY
Terms of Use
Journal
Volume
Issue
PubMed ID
DOI
ISSN
EISSN
Collections