• Login
    View Item 
    •   DSpace@RPI Home
    • Tetherless World Constellation
    • Tetherless World Publications
    • View Item
    •   DSpace@RPI Home
    • Tetherless World Constellation
    • Tetherless World Publications
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Knowledge Graph Construction from Data, Data Dictionaries, and Codebooks: the National Health and Nutrition Examination Surveys Use Case

    Author
    Santos, Henrique; Pinheiro, Paulo; McGuinness, Deborah L.
    Thumbnail
    View/Open
    Slides (2.617Mb)
    Other Contributors
    Date Issued
    2022-09-29
    Degree
    Terms of Use
    Metadata
    Show full item record
    URI
    https://hdl.handle.net/20.500.13015/6287
    Abstract
    CDC’s National Health and Nutrition Examination Surveys (NHANES) is a continuous survey that aims to study the relationship between diet, nutrition, and health and their roles in designated population subgroups with selected diseases and risk factors. Data is acquired using questionnaires (either by human interviewers or computer-assisted), aimed at collecting data about participants’ households and families, medical conditions, substance usage, and more. NHANES data and supporting documentation, including data dictionaries (DDs) and codebooks (CBs), are made publicly available and are used in many data science efforts to support a wide range of health informatics projects. A typical use of NHANES data requires a complex human interpretation of the data with the help of the DDs and CBs. For example, to retrieve “diseases treated by a specific drug in households with annual income under $20,000”, one should select all the relevant variables (diseases, drugs, household income, participants) across the relevant datasets (demographic, drug usage) and perform a series of transformations (normalizing disease and income codes) to generate the answer for the query. During data processing, it is not uncommon for data to be misinterpreted as NHANES may use the same variable for multiple purposes (e.g. the same variable is used for diseases being treated and diseases being prevented by a drug and sometimes this distinction is critical to applications). Furthermore, the result of this processing may be incorrectly combined (e.g., harmonized with new data, from NHANES or other studies). We present our approach for translating NHANES’ datasets, metadata, and any additional documentation from the surveys into a rich knowledge graph (KG) that maintains semantic distinctions. We leverage the Human-Aware Data Acquisition Infrastructure (HADatAc) [1] and its underlying Human-Aware Science Ontology (HAScO) [2], to systematically represent the complete data acquisition process. Semantic Data Dictionaries (SDDs) [3], which are derived from DDs and CBs, support the elicitation of objects that are not directly represented within NHANES datasets (including household, the household reference person, drug usage for disease treatment, drug usage for disease prevention, etc.). We demonstrate1 how we use the KG to generate tailored datasets based on user choice of variables and alignment criteria across multiple NHANES datasets. Our use of SDDs enables the combined use of ontologies and data. We further demonstrate that once data is encoded into the KG, the KG can be used to support complex automated data harmonization that until now, when required in any kind of meta-analysis study based on NHANES, is still done manually.;
    Department
    Relationships
    Access
    Collections
    • Tetherless World Publications

    Browse

    All of DSpace@RPICommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    Login

    DSpace software copyright © 2002-2023  DuraSpace
    Contact Us | Send Feedback
    DSpace Express is a service operated by 
    Atmire NV