Planning and evaluation of federated queries on the web

Williams, Gregory Todd

Planning and evaluation of federated queries on the web

Authors

Williams, Gregory Todd

Files

167041_Williams_rpi_0185E_10046.pdf (911.44 KB)

Other Contributors

Hendler, James A.
Berners-Lee, Tim
Fox, Peter A.
Adali, Sibel

Issue Date

2013-05

Keywords

Computer science

Degree

PhD

Terms of Use

Attribution-NonCommercial-NoDerivs 3.0 United States
This electronic version is a licensed copy owned by Rensselaer Polytechnic Institute, Troy, NY. Copyright of original work retained by author.

URI

https://hdl.handle.net/20.500.13015/843

Abstract

To demonstrate the practicality of this federated query planning framework, we present results of empirical evaluation of the framework components over a real-world dataset of bibliographic data. These results show that the federated query planning, evaluation, and caching techniques are able to produce query results quickly and efficiently. The effects of several optimizations on the execution of federated queries is discussed, and their impact on performance is evaluated.
The Web of Data continues to increase in size and diversity, providing access to large amounts of structured, linked data. However, existing approaches to querying this data often fail to make use of existing database access points and must resort to web crawling to collect data of interest. Furthermore, in order to provide efficient query answering over this data, existing systems are forced to construct centralized database indexes, making it difficult to maintain up-to-date data. For approaches that do utilize existing databases, disregard for fundamental design principles of the Web results in query systems that lack some basic features of their web crawling counterparts. If an efficient query answering system can be provided that does not require centralized indexing, and leverages both existing databases and static web content, users may benefit from up-to-date access to structured, disparate data.
In this dissertation, we develop a federated query planning framework based on the RDF data model and the SPARQL query language. This framework is able to leverage the high performance of existing SPARQL databases while also providing access to linked data available as RDF documents on the web. These two access methods are used to provide a single interface to querying semantic data.
The primary challenge of evaluating queries over both SPARQL databases and linked data is in finding an efficient execution plan. Such a plan must perform better than the naive approach of completely decomposing the query and executing each subquery against each data source or traversing linked data by web crawling. Moreover, it must allow metadata discovered during query execution to be incorporated into the existing plan.
Given this, in this dissertation, we develop three techniques to increase performance and flexibility of federated query evaluation: we develop a federated query planning algorithm that prioritizes the execution of subqueries that have high expected value (that is, expected relevant results with low latency); we develop a re-planning algorithm, able to augment an existing query plan with newly discovered data sources and a mechanism for discovering such sources; and we develop a server-side technique to greatly enhance the web cacheability of SPARQL query results.
Finally, the developed framework is designed using a traditional query planner, allowing it to integrate with and benefit from existing work on query planning and optimization.

Description

May 2013
School of Science

Department

Dept. of Computer Science

Publisher

Rensselaer Polytechnic Institute, Troy, NY

Relationships

Rensselaer Theses and Dissertations Online Collection

Access

CC BY-NC-ND. Users may download and share copies with attribution in accordance with a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License. No commercial use or derivatives are permitted without the explicit approval of the author.

Collections

RPI Theses Open Access
RPI Theses Online (Complete)

Full item page