Variable selection and manipulation with missing data

Loading...
Thumbnail Image
Authors
Koyuncu, Deniz
Issue Date
2024-12
Type
Electronic thesis
Thesis
Language
en_US
Keywords
Electrical engineering
Research Projects
Organizational Units
Journal Issue
Alternative Title
Abstract
Both causal modeling and associative feature selection aim to identify essential relationships among the set of modeled variables, albeit with different objectives. While causal modeling focuses on functional relationships, associative feature selection focuses on statistical relationships. In practice, inferring causal or associational relations often has to be done from a dataset with missing entries under the potential bias missing data introduces. Traditionally, missingness has been attributed to benign data collection processes, but as datasets are increasingly curated from diverse sources, including untrusted parties, maliciously engineered missingness has become a likely threat. In turn, to make reliable inferences, a practitioner has to understand how the methods used to extract these causal or associational relationships are affected by benign and adversarial missingness. This dissertation addresses these challenges in three parts. First, we examine the impact of benign missing data on the model-X knockoffs framework, a recent method that provides false discovery rate (FDR) control across a broad range of feature selection techniques. We identify how the distribution shift resulting from imputing the missing entries or dropping partially observed data points interferes with the model-X knockoffs’ FDR guarantees. Next, we introduce sufficient conditions under which imputation using the generative model originally intended for FDR calibration can preserve all assumptions of the model-X framework. Second, we study the effects of adversarial missing data on causal structural learning from observational data. We introduce the adversarial missingness treat model, where an attacker selectively omits data entries. Under this threat model, we show an adversary can asymptotically render a corrupted causal model an optimal solution by concealing a subset of the features in certain observations. We also propose learning-based attacks that are effective with finite data and show that they can successfully obscure adversarially targeted causal relationships in various experimental setups. Third, we extend our study of adversarial missingness to associative learning tasks through a bi-level optimization approach. To tailor attacks to standard missing data handling methods, we develop differentiable approximations for three widely used techniques: mean imputation, regression-based imputation, and complete-case analysis. Our results demonstrate that these attacks can effectively manipulate generalized linear models, altering p-values from significant to insignificant by omitting less than 20% of targeted features.
Description
December 2024
School of Engineering
Full Citation
Publisher
Rensselaer Polytechnic Institute, Troy, NY
Terms of Use
Journal
Volume
Issue
PubMed ID
DOI
ISSN
EISSN