• Login
    View Item 
    •   DSpace@RPI Home
    • Rensselaer Libraries
    • RPI Theses Online (Complete)
    • View Item
    •   DSpace@RPI Home
    • Rensselaer Libraries
    • RPI Theses Online (Complete)
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Supervised cadre models for subpopulation-based learning

    Author
    New, Alexander
    View/Open
    179793_New_rpi_0185E_11534.pdf (6.089Mb)
    Other Contributors
    Bennett, Kristin P.; Mitchell, John E.; Xu, Yangyang; Ji, Qiang, 1963-;
    Date Issued
    2019-08
    Subject
    Mathematics
    Degree
    PhD;
    Terms of Use
    This electronic version is a licensed copy owned by Rensselaer Polytechnic Institute, Troy, NY. Copyright of original work retained by author.;
    Metadata
    Show full item record
    URI
    https://hdl.handle.net/20.500.13015/2442
    Abstract
    Throughout, we analyze the supervised cadre model's learning problem and develop metrics to assess cadre goodness and interpretability. For goodness, we consider generalization in the predictive task and robustness of the cadre structure to perturbations in the data. For intepretability, we consider the sparsity of the cadre-assignment mechanism, the diversity of each cadre's predictive models, and the typical uncertainty in the cadre assignment process. We show that the SCM's learning problem is nonconvex and non-differentiable, and we apply several different current methods to mitigate this state. Extensive tests on synthetic and existing datasets show that cadre models perform well with respect to these metrics.; Our fourth case study applies survival analysis to study breast cancer mortality. Patients are stratified into risk groups based on demographic and tumor characteristics. The cadre models identified high risk and low risk cadres, with mean survival times of 49 and 130 months, respectively. Cadre membership was based on patient demographics, tumor severity, and geography.; We extend this case study by combining the SCM with semantically-encoded domain knowledge. Our system uses a novel analysis ontology and a knowledge graph (KG) to discover informative subpopulations and identify their risk factors. The resulting semantically-targeted analytics (semantalytics) framework makes use of cartridges -- application-specific fragments of the underlying KG that extend its analytic capabilities. Semantalytics drives an automated architecture allowing analysts to rapidly and dynamically conduct studies for different health outcomes, risk factors, cohorts, and analysis methods.; The second applies binary classification and multivariate regression to precision health. We generalize the environment-wide association study workflow to be able to model heterogeneity in population risk. More than two hundred environmental exposure factors were analyzed for their association with diastolic and systolic blood pressure, and hypertension. We found 25 exposure variables that had a significant association with at least one of our response variables, eight of which were significant for a discovered subpopulation but not for the overall population. Discovered populations were based on easily-understood rules.; We consider four primary case studies. The first applies scalar regression to materials-by-design. In it, the SCM provides state-of-the-art prediction of polymer glass transition temperature. The method identifies cadres of polymers that respond differently to structural perturbations, thus providing design insight for targeting or avoiding specific transition temperature ranges. It identifies chemically meaningful polymer subpopulations.; In this work, we propose a new framework for discriminative supervised learning problems with complicated target variation patterns. This framework, the supervised cadre model (SCM), is based on grouping observations into subpopulations (cadres) that can each be assigned a simple predictive model. Cadres are distinct from existing concepts like cohorts and clusters because they are dynamically learned from data using a supervised metric. Cadre modeling may be applied to any optimization-based supervised learning problem, regardless of loss function. Here, we focus on regression, classification, and survival analysis.; Our third case study applies survival analysis to examine time-to-fill for data science and related job offer postings. Our model used skill tags, salary, and location to group postings into 7 cadres with dramatically different time-to-fill profiles. Skill categories sought for by cadre included data engineering, business data analysis, and healthcare data specialists. This analysis can help guide recruiting efforts and data science education programs.;
    Description
    August 2019; School of Science
    Department
    Dept. of Mathematical Sciences;
    Publisher
    Rensselaer Polytechnic Institute, Troy, NY
    Relationships
    Rensselaer Theses and Dissertations Online Collection;
    Access
    Restricted to current Rensselaer faculty, staff and students. Access inquiries may be directed to the Rensselaer Libraries.;
    Collections
    • RPI Theses Online (Complete)

    Browse

    All of DSpace@RPICommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    Login

    DSpace software copyright © 2002-2022  DuraSpace
    Contact Us | Send Feedback
    DSpace Express is a service operated by 
    Atmire NV