Knowledge Graph Construction from Data, Data Dictionaries, and Codebooks: the National Health and Nutrition Examination Surveys Use Case

Santos, Henrique
Pinheiro, Paulo
McGuinness, Deborah L.
Thumbnail Image
Other Contributors
Issue Date
Terms of Use
Full Citation
CDC’s National Health and Nutrition Examination Surveys (NHANES) is a continuous survey that aims to study the relationship between diet, nutrition, and health and their roles in designated population subgroups with selected diseases and risk factors. Data is acquired using questionnaires (either by human interviewers or computer-assisted), aimed at collecting data about participants’ households and families, medical conditions, substance usage, and more. NHANES data and supporting documentation, including data dictionaries (DDs) and codebooks (CBs), are made publicly available and are used in many data science efforts to support a wide range of health informatics projects. A typical use of NHANES data requires a complex human interpretation of the data with the help of the DDs and CBs. For example, to retrieve “diseases treated by a specific drug in households with annual income under $20,000”, one should select all the relevant variables (diseases, drugs, household income, participants) across the relevant datasets (demographic, drug usage) and perform a series of transformations (normalizing disease and income codes) to generate the answer for the query. During data processing, it is not uncommon for data to be misinterpreted as NHANES may use the same variable for multiple purposes (e.g. the same variable is used for diseases being treated and diseases being prevented by a drug and sometimes this distinction is critical to applications). Furthermore, the result of this processing may be incorrectly combined (e.g., harmonized with new data, from NHANES or other studies). We present our approach for translating NHANES’ datasets, metadata, and any additional documentation from the surveys into a rich knowledge graph (KG) that maintains semantic distinctions. We leverage the Human-Aware Data Acquisition Infrastructure (HADatAc) [1] and its underlying Human-Aware Science Ontology (HAScO) [2], to systematically represent the complete data acquisition process. Semantic Data Dictionaries (SDDs) [3], which are derived from DDs and CBs, support the elicitation of objects that are not directly represented within NHANES datasets (including household, the household reference person, drug usage for disease treatment, drug usage for disease prevention, etc.). We demonstrate1 how we use the KG to generate tailored datasets based on user choice of variables and alignment criteria across multiple NHANES datasets. Our use of SDDs enables the combined use of ontologies and data. We further demonstrate that once data is encoded into the KG, the KG can be used to support complex automated data harmonization that until now, when required in any kind of meta-analysis study based on NHANES, is still done manually.