End-to-end question answering on semi-structured tabular data

Thumbnail Image
Pan, Feifei
Issue Date
Electronic thesis
Computer science
Research Projects
Organizational Units
Journal Issue
Alternative Title
Semi-structured tables are commonly seen in everyday life as one of the most popular and convenient ways to store and organize data. They widely appear in open-domain digital documents such as PDFs, web pages, knowledge bases (KBs), i.e., Wikipedia, and domain-specific documents such as scientific papers, journals, and enterprise reports. Recently, semi-structured tables have been recognized as a rich knowledge source for the Question Answering (QA) tasks. Unlike relational databases tables, it is more challenging for machines to automatically understand semi-structured tables and use them in downstream tasks. Despite the existing effort, computational approaches still suffer for the following reasons: 1. the abundance of semi-structured tables, 2. lack of an explicit schema, and 3. complex and flexible table structures. Besides, the annotation for Table QA datasets is labor and time-intensive, especially for domain-specific, large-size tables corpus. Due to the complexity of Table QA, existing works tried to crack the problem as sub-tasks, i.e., table retrieval and QA over tables. Given a natural language query or question, table retrieval studies how to locate the table containing the correct answer from the table corpus, while the QA over tables task focuses on finding table cells from a given table to answer the questions. The traditional two-step pipeline has its limitation on performance, mainly due to error propagation. In this thesis, we aim to fill in the vacancy in the research of end-to-end Table QA, leveraging the transformer-based models with the support of semantic-driven approaches. More specifically, with any natural language question, our goal is to design models that can efficiently search through a massive table corpus, retrieve the table containing the correct answer, and finally locate the correct answer to the given questions from the table. This thesis covers a series of supervised Table QA models as well as a brief discussion on unsupervised solutions. We first focus on providing sophisticated solutions to the QA over tables task. While the existing models highly rely on specialized pre-training techniques, we introduce the RCI model, which utilizes an existing language model to build connections between questions and table components. The RCI model locates the correct cells as the intersection of table rows and columns by capturing the row and column semantics with transformer-based architectures. With the RCI model producing the state-of-the-art results in QA over tables, we further extend the RCI architecture to an end-to-end Table QA pipeline called Cell Level Table Retrieval (CLTR). This model consists of two components: 1. a retriever model integrates traditional information retrieval methods with RCI to identify relevant tables; 2. a reader model identifies the correct table cells as answers to the questions using the RCI architecture. To enhance the accuracy and simplify the training of the end-to-end Table QA model, we investigate Dense Passage Retrieval (DPR) and Retrieval-Augmented Generation (RAG). We propose the T-RAG model, which unifies the traditional [retriever + reader] pipeline with a single training step. The T-RAG model holds the current best scores for the end-to-end Table QA and the table retrieval tasks. Finally, we explore unsupervised, neuro-symbolic approaches for Table QA. The lexical- and semantic-driven methods are applied to identify the correct table rows and columns to answer natural language questions without the benefit of supervision and training data.
May 2022
School of Science
Full Citation
Rensselaer Polytechnic Institute, Troy, NY
Terms of Use
PubMed ID