Optimization and generalization analysis of advanced neural networks and learning algorithms

Loading...
Thumbnail Image
Authors
Li, Hongkang
Issue Date
2024-12
Type
Electronic thesis
Thesis
Language
en_US
Keywords
Electrical engineering
Research Projects
Organizational Units
Journal Issue
Alternative Title
Abstract
In recent years, deep learning has undergone rapid advancement. A notable trend in this development is the enhancement of learning efficiency for large foundation models, giving rise to numerous advanced learning algorithms designed for advanced neural networks. However, there remains a limited theoretical understanding of these algorithms and deep models. This thesis addresses this gap by delving into the optimization and generalization in advanced neural networks. The first part of the thesis focuses on the theoretical investigation of the basic block of advanced neural networks. The first work is to study one-layer single-head Vision Transformers (ViT), which is a self-attention layer followed by a two-layer perceptron. This work provides the sample complexity and the required number of iterations to achieve zero generalization on a binary classification task based on a data model where each data contains several label-relevant and label-irrelevant tokens. The sample complexity bound implies that a larger fraction of label-relevant tokens, a smaller token noise level, and a smaller initial model error can enhance the generalization. The theoretical finding also verifies the general intuition about the success of attention by proving that training using stochastic gradient descent (SGD) generates a sparse attention map focused on label-relevant tokens. Moreover, we also conclude that proper token sparsification can improve performance by removing label-irrelevant or noisy tokens, including spurious correlations. We then explore the generalization of Graph Transformer, a developing architecture originated from Transformers for graph learning. This work is based on a graph data model with discriminative nodes that determine node labels and non-discriminative nodes that are class-irrelevant. The theoretical results quantitatively characterize the sample complexity and number of iterations for convergence dependent on the fraction of discriminative nodes, the dominant patterns, and the fraction of erroneous labels. Meanwhile, we show that self-attention and positional encoding lead to generalization by making the attention map sparse and promoting the core neighborhood. This explains the superior feature representation of Graph Transformers compared with GCN and Graph Transformer without positional encoding. The second part of this thesis is to study modern algorithms on basic neural models. The first work studies learning with the group imbalance issue on a one-hidden-layer fully-connected neural network with a mixture of Gaussian input. This work quantifies the impact of individual groups on learning performance. The theoretical results include that when all group-level co-variance is in the medium regime and all mean are close to zero, we can achieve a small sample complexity, a fast training rate, and a high average and group-level testing accuracy. Moreover, it is shown that increasing the fraction of the minority group in the training data does not necessarily improve the generalization performance of the minority group. The second is the graph topology sampling with a three-layer Graph Convolutional Network (GCN), which not only consists of three-layer networks but also includes graph information in the model. This work characterizes sufficient conditions for graph topology sampling, such that GCN training leads to a diminishing generalization error on a semi-supervised node classification task. The sample complexity result explicitly depicts the impact of graph structures and topology sampling on the generalization performance. The third work is to study in-context learning (ICL), an inference method using pairs of testing data and labels as a prompt to make predictions without fine-tuning the model. We theoretically quantify the required number of training prompts and iterations and the length and distribution of the testing prompts for a desired ICL capability on unseen tasks with and without data distribution shifts. The training dynamics analysis also characterizes how different components in the learned Transformers contribute to the ICL performance. Moreover, this work proves that proper magnitude-based pruning has a minimal impact on performance while reducing inference costs. The last work is about Chain-of-Thought (CoT), a prompting method that incorporates multiple intermediate steps into each context example. This work establishes the first theoretical analysis of training nonlinear Transformers to obtain the CoT generalization ability by quantifying the required training samples and iterations. This work next theoretically characterizes the conditions for an accurate inference output by CoT when the provided reasoning examples contain noises and are not always accurate. Meanwhile, ICL, i.e., one-step CoT without intermediate steps, may fail to provide an accurate output when CoT does.
Description
December2024
School of Engineering
Full Citation
Publisher
Rensselaer Polytechnic Institute, Troy, NY
Terms of Use
Journal
Volume
Issue
PubMed ID
DOI
ISSN
EISSN
Collections