A Semi-Automated Approach to Data Harmonization Across Environmental Health Studies

Johnson, Matt
Ravi, Meenu
Pinheiro, Paulo
Stingone, Jeanette
McGuinness, Deborah L.
No Thumbnail Available
Other Contributors
Issue Date
Human Health Exposure Analysis Repository (HHEAR)
Terms of Use
Full Citation
The NIEHS-supported Human Health and Exposure Analysis Resource (HHEAR) Data Center maintains a public-use data repository to promote reuse of environmental health data generated by the HHEAR program. The creation and maintenance of this repository requires the integration of information from a wide variety of epidemiologic studies. We have developed the Human Aware Data Acquisition Framework to enable this complex integration, supporting harmonization across multiple studies, and enabling meaningful search and access of the data deposited in the HHEAR Data Repository. To integrate data from a new study, investigators engage in an initial, time-consuming effort to link study data to the HHEAR ontology, a controlled vocabulary of environmental and public health terms. This is accomplished by generating a semantic data dictionary (SDD) from the data dictionaries and codebooks provided by HHEAR study investigators. Originally, this had been done manually by an expert in both epidemiological terminology and ontological modeling. To increase the accessibility of these tools for environmental health scientists who lack formal ontologic training, we have developed an SDD-Editor that simplifies the ontology modeling process. The SDD-Editor reuses elements common to epidemiologic data dictionaries and spreadsheet software, while integrating features needed to form semantic links between public health concepts and existing ontologies. The SDD-Editor suggests potential concept matches for study variables within the SDD using natural language processing to capture the semantic similarity between data dictionary and ontology class descriptions. If no suitable suggestion exists, investigators can search for ontology terms using a search engine powered by Bioportal. Once finished, a validator is run to check that the SDD has the correct format and all classes are valid. By automating parts of the ontology modeling process, the SDD-Editor greatly facilitates the dynamic integration of HHEAR environmental health studies into a single repository, benefiting the scientific community.