Improving Tabular Reusability Through Data Dictionary Descriptions

No Thumbnail Available
Authors
Johnson, Matthew
Rashid, Sabbir
McGuinness, Deborah L.
Issue Date
2025-02-03
Type
Language
Keywords
Research Projects
Organizational Units
Journal Issue
Alternative Title
Abstract
Tables have become a ubiquitous standard for capturing, storing, and sharing data on the web. This is primarily due to the semi-structured nature of tables, where relationships between data are often ambiguously encoded using locality. While this format can be easy for humans to interpret in simple cases, as table complexity increases, so does the difficulty in interpretability. To bridge this context gap, many data publishers provide a data dictionary to capture schema elements' meaning through text descriptions. Existing work compounds the need for data dictionaries to improve tabular interoperability, but few provide detailed requirements for data dictionary descriptions. This paper identifies and defines three common types of data dictionary descriptions in the biomedical domain. We then compare the effectiveness of each description type by normalizing data dictionary descriptions to a single type using large language models and measuring their performance using a semantic tabular interpretation algorithm. Our experiments show that intensional descriptions, which describe the general properties a column member should have, are most effective for tabular alignment and improve the reusability of data dictionaries.
Description
Full Citation
Publisher
IEEE Computer Society
Journal
Volume
Issue
PubMed ID
ISSN
EISSN