Show simple item record

dc.rights.licenseRestricted to current Rensselaer faculty, staff and students. Access inquiries may be directed to the Rensselaer Libraries.
dc.contributorBennett, Kristin P.
dc.contributorMitchell, John E.
dc.contributorXu, Yangyang
dc.contributorJi, Qiang, 1963-
dc.contributor.authorNew, Alexander
dc.date.accessioned2021-11-03T09:11:47Z
dc.date.available2021-11-03T09:11:47Z
dc.date.created2020-06-12T12:29:30Z
dc.date.issued2019-08
dc.identifier.urihttps://hdl.handle.net/20.500.13015/2442
dc.descriptionAugust 2019
dc.descriptionSchool of Science
dc.description.abstractThroughout, we analyze the supervised cadre model's learning problem and develop metrics to assess cadre goodness and interpretability. For goodness, we consider generalization in the predictive task and robustness of the cadre structure to perturbations in the data. For intepretability, we consider the sparsity of the cadre-assignment mechanism, the diversity of each cadre's predictive models, and the typical uncertainty in the cadre assignment process. We show that the SCM's learning problem is nonconvex and non-differentiable, and we apply several different current methods to mitigate this state. Extensive tests on synthetic and existing datasets show that cadre models perform well with respect to these metrics.
dc.description.abstractOur fourth case study applies survival analysis to study breast cancer mortality. Patients are stratified into risk groups based on demographic and tumor characteristics. The cadre models identified high risk and low risk cadres, with mean survival times of 49 and 130 months, respectively. Cadre membership was based on patient demographics, tumor severity, and geography.
dc.description.abstractWe extend this case study by combining the SCM with semantically-encoded domain knowledge. Our system uses a novel analysis ontology and a knowledge graph (KG) to discover informative subpopulations and identify their risk factors. The resulting semantically-targeted analytics (semantalytics) framework makes use of cartridges -- application-specific fragments of the underlying KG that extend its analytic capabilities. Semantalytics drives an automated architecture allowing analysts to rapidly and dynamically conduct studies for different health outcomes, risk factors, cohorts, and analysis methods.
dc.description.abstractThe second applies binary classification and multivariate regression to precision health. We generalize the environment-wide association study workflow to be able to model heterogeneity in population risk. More than two hundred environmental exposure factors were analyzed for their association with diastolic and systolic blood pressure, and hypertension. We found 25 exposure variables that had a significant association with at least one of our response variables, eight of which were significant for a discovered subpopulation but not for the overall population. Discovered populations were based on easily-understood rules.
dc.description.abstractWe consider four primary case studies. The first applies scalar regression to materials-by-design. In it, the SCM provides state-of-the-art prediction of polymer glass transition temperature. The method identifies cadres of polymers that respond differently to structural perturbations, thus providing design insight for targeting or avoiding specific transition temperature ranges. It identifies chemically meaningful polymer subpopulations.
dc.description.abstractIn this work, we propose a new framework for discriminative supervised learning problems with complicated target variation patterns. This framework, the supervised cadre model (SCM), is based on grouping observations into subpopulations (cadres) that can each be assigned a simple predictive model. Cadres are distinct from existing concepts like cohorts and clusters because they are dynamically learned from data using a supervised metric. Cadre modeling may be applied to any optimization-based supervised learning problem, regardless of loss function. Here, we focus on regression, classification, and survival analysis.
dc.description.abstractOur third case study applies survival analysis to examine time-to-fill for data science and related job offer postings. Our model used skill tags, salary, and location to group postings into 7 cadres with dramatically different time-to-fill profiles. Skill categories sought for by cadre included data engineering, business data analysis, and healthcare data specialists. This analysis can help guide recruiting efforts and data science education programs.
dc.language.isoENG
dc.publisherRensselaer Polytechnic Institute, Troy, NY
dc.relation.ispartofRensselaer Theses and Dissertations Online Collection
dc.subjectMathematics
dc.titleSupervised cadre models for subpopulation-based learning
dc.typeElectronic thesis
dc.typeThesis
dc.digitool.pid179792
dc.digitool.pid179793
dc.digitool.pid179794
dc.rights.holderThis electronic version is a licensed copy owned by Rensselaer Polytechnic Institute, Troy, NY. Copyright of original work retained by author.
dc.description.degreePhD
dc.relation.departmentDept. of Mathematical Sciences


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record