A data dictionary based approach to semantic tabular mapping
Loading...
Authors
Johnson, Matthew, John
Issue Date
2024-08
Type
Electronic thesis
Thesis
Thesis
Language
en_US
Keywords
Computer science
Alternative Title
Abstract
Knowledge graphs are an important technology that enables a wide variety of analytics and data visualizations across an enterprise. However, creating knowledge graphs or adding to an existing knowledge graph can be challenging because data is often stored in a semi-structured form within tables that do not capture the full context of the data. To fill the context gap many data publishers include a data dictionary that aims to capture the meaning of schema elements, typically with text descriptions. These descriptions are helpful for humans to understand the data for integration tasks but are challenging for machines to interpret. Previous work has focused on integrating tables into an existing knowledge graph using data-level alignments without the additional context provided by data dictionaries. While these data-level alignment algorithms have proven successful on synthetic datasets, they struggle to make accurate alignments on real-world datasets that exhibit complex structures. Humans overcome these issues by leveraging context information from data dictionaries to understand the groups and relationships among the entities within a table. Recently, data publishers have started using this metadata to create semantic data dictionaries (SDDs) that formally capture alignments between tabular data. These alignments allow data publishers to convert tabular data into Resource Description Framework (RDF) triples and create or integrate data into a knowledge graph. However, SDDs require authors to have domain knowledge and experience in ontology modeling, which creates a barrier to entry for users. In this thesis, our goal is to improve the field of data integration by exploring algorithms that leverage context information from data dictionaries to align complex tabular data to ontology classes and properties. To achieve this, we address three key research questions: Can algorithms effectively use context from data dictionaries to improve alignment on complex tables? Are alignment algorithms that leverage data dictionaries competitive with data-level alignment algorithms on simple tables? What type of data dictionary descriptions are well suited for alignment algorithms? For the first research question, we developed the Semantic Data Dictionary Generator (SDD-Gen), a tabular alignment algorithm that generates SDDs by leveraging context information from data dictionaries. We show the effectiveness of SDD-Gen by comparing the performance against the current state of the art on complex tables. For the second research question, we developed a methodology for generating representative artificial data dictionaries using large language models. We use this methodology to generate data dictionaries for a popular tabular alignment dataset and show that SDD-Gen is as effective as the data-level algorithms on simple tables. For the final research question, we developed an evaluation framework to determine the type of data dictionary description best suited for tabular alignment. We show that intensional descriptions that define the conditions needed to be a member of a column are most effective and improve the reusability of data dictionaries.
Description
August2024
School of Science
School of Science
Full Citation
Publisher
Rensselaer Polytechnic Institute, Troy, NY