Overcoming the challenges of reinforcement learning: safety, sample efficiency and generalization

Loading...
Thumbnail Image
Authors
Chen, Weiqin
Issue Date
2025-12
Type
Electronic thesis
Thesis
Language
en_US
Keywords
Electrical engineering
Research Projects
Organizational Units
Journal Issue
Alternative Title
Abstract
Reinforcement learning (RL) has emerged as a transformative paradigm, driving remarkable advances across diverse domains such as robotics, healthcare and large language model (LLM) post-training, positioning RL at the frontier of modern AI techniques. Yet, RL's implementation remains challenging as it is besieged by several fundamental obstacles, of which safety, sample efficiency and generalization are paramount. Accordingly, the goal of this thesis is to develop algorithms to systematically overcome these critical challenges, which hinder the adoption and deployment of RL techniques in real-world scenarios. Safety. This part delves into appropriate safety constraints for safety-critical applications and hyperparameter optimization of safe RL. We consider learning safe policies under the probabilistic constraints, and establish the first explicit gradient expressions for the probabilistic-constrained RL. Moreover, we extend to a generalized safety constraint that exhibits a better return-safety trade-off than both the probabilistic constraint and cumulative constraint. Our third work investigates the necessity of adaptive learning rates in safe RL, due to the inter-dependency of the learning rate and Lagrange multipliers. In each of these works, we develop efficient algorithms accompanied by theoretical guarantees such as convergence, optimality and feasibility. Sample Efficiency. This part centers on improving the sample efficiency of RL that ranges a broad area of knowledge including optimization and statistical analysis in domain adaptation learning, and optimal control. The first work considers the scenarios in offline RL where the target dataset comprises limited samples, while auxiliary samples from related source datasets (such as simulators) can be leveraged. We propose the first framework that theoretically explores the optimal balance between the limited target dataset and large-but-biased source dataset. Our second work, inspired by optimal control and with improved sample efficiency, proposes a control-based RL framework to facilitate the direct learning of optimal RL policies. Generalization. This part explores the zero-shot generalization capability of LLM-powered RL. In particular, we propose the first framework that enables effective in-context RL under random policies and random contexts, where no optimal or well-trained policies are demanded for all pretraining environments, along with the quantitative analysis of the trustworthiness as well as the performance guarantees of our approach. Applications for LLMs. This part highlights the potential of constrained RL in tackling reward hacking, a persistent issue faced by any RL algorithms. We develop a constrained RL framework for LLMs training with provable theoretical guarantees, and demonstrate its effectiveness on Text2SQL by incorporating natural and interpretable reward and constraints while automatically and dynamically balancing the trade-offs among them during the training.
Description
December2025
School of Engineering
Full Citation
Publisher
Rensselaer Polytechnic Institute, Troy, NY
Terms of Use
Journal
Volume
Issue
PubMed ID
DOI
ISSN
EISSN
Collections