Show simple item record

dc.rights.licenseRestricted to current Rensselaer faculty, staff and students in accordance with the Rensselaer Standard license. Access inquiries may be directed to the Rensselaer Libraries.
dc.contributorMcGuinness, Deborah L.
dc.contributorAdali, Sibel
dc.contributorGliozzo, Alfio
dc.contributorJi, Heng
dc.contributor.advisorHendler, James A.
dc.contributor.authorPan, Feifei
dc.date.accessioned2022-09-15T22:09:46Z
dc.date.available2022-09-15T22:09:46Z
dc.date.issued2022-05
dc.identifier.urihttps://hdl.handle.net/20.500.13015/6201
dc.descriptionMay 2022
dc.descriptionSchool of Science
dc.description.abstractSemi-structured tables are commonly seen in everyday life as one of the most popular and convenient ways to store and organize data. They widely appear in open-domain digital documents such as PDFs, web pages, knowledge bases (KBs), i.e., Wikipedia, and domain-specific documents such as scientific papers, journals, and enterprise reports. Recently, semi-structured tables have been recognized as a rich knowledge source for the Question Answering (QA) tasks. Unlike relational databases tables, it is more challenging for machines to automatically understand semi-structured tables and use them in downstream tasks. Despite the existing effort, computational approaches still suffer for the following reasons: 1. the abundance of semi-structured tables, 2. lack of an explicit schema, and 3. complex and flexible table structures. Besides, the annotation for Table QA datasets is labor and time-intensive, especially for domain-specific, large-size tables corpus. Due to the complexity of Table QA, existing works tried to crack the problem as sub-tasks, i.e., table retrieval and QA over tables. Given a natural language query or question, table retrieval studies how to locate the table containing the correct answer from the table corpus, while the QA over tables task focuses on finding table cells from a given table to answer the questions. The traditional two-step pipeline has its limitation on performance, mainly due to error propagation. In this thesis, we aim to fill in the vacancy in the research of end-to-end Table QA, leveraging the transformer-based models with the support of semantic-driven approaches. More specifically, with any natural language question, our goal is to design models that can efficiently search through a massive table corpus, retrieve the table containing the correct answer, and finally locate the correct answer to the given questions from the table. This thesis covers a series of supervised Table QA models as well as a brief discussion on unsupervised solutions. We first focus on providing sophisticated solutions to the QA over tables task. While the existing models highly rely on specialized pre-training techniques, we introduce the RCI model, which utilizes an existing language model to build connections between questions and table components. The RCI model locates the correct cells as the intersection of table rows and columns by capturing the row and column semantics with transformer-based architectures. With the RCI model producing the state-of-the-art results in QA over tables, we further extend the RCI architecture to an end-to-end Table QA pipeline called Cell Level Table Retrieval (CLTR). This model consists of two components: 1. a retriever model integrates traditional information retrieval methods with RCI to identify relevant tables; 2. a reader model identifies the correct table cells as answers to the questions using the RCI architecture. To enhance the accuracy and simplify the training of the end-to-end Table QA model, we investigate Dense Passage Retrieval (DPR) and Retrieval-Augmented Generation (RAG). We propose the T-RAG model, which unifies the traditional [retriever + reader] pipeline with a single training step. The T-RAG model holds the current best scores for the end-to-end Table QA and the table retrieval tasks. Finally, we explore unsupervised, neuro-symbolic approaches for Table QA. The lexical- and semantic-driven methods are applied to identify the correct table rows and columns to answer natural language questions without the benefit of supervision and training data.
dc.languageENG
dc.language.isoen_US
dc.publisherRensselaer Polytechnic Institute, Troy, NY
dc.relation.ispartofRensselaer Theses and Dissertations Online Collection
dc.subjectComputer science
dc.titleEnd-to-end question answering on semi-structured tabular data
dc.typeElectronic thesis
dc.typeThesis
dc.date.updated2022-09-15T22:09:49Z
dc.rights.holderThis electronic version is a licensed copy owned by Rensselaer Polytechnic Institute (RPI), Troy, NY. Copyright of original work retained by author.
dc.description.degreePhD
dc.relation.departmentDept. of Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record