Planning and evaluation of federated queries on the web

Authors
Williams, Gregory Todd
ORCID
Loading...
Thumbnail Image
Other Contributors
Hendler, James A.
Berners-Lee, Tim
Fox, Peter A.
Adali, Sibel
Issue Date
2013-05
Keywords
Computer science
Degree
PhD
Terms of Use
Attribution-NonCommercial-NoDerivs 3.0 United States
This electronic version is a licensed copy owned by Rensselaer Polytechnic Institute, Troy, NY. Copyright of original work retained by author.
Full Citation
Abstract
To demonstrate the practicality of this federated query planning framework, we present results of empirical evaluation of the framework components over a real-world dataset of bibliographic data. These results show that the federated query planning, evaluation, and caching techniques are able to produce query results quickly and efficiently. The effects of several optimizations on the execution of federated queries is discussed, and their impact on performance is evaluated.
The Web of Data continues to increase in size and diversity, providing access to large amounts of structured, linked data. However, existing approaches to querying this data often fail to make use of existing database access points and must resort to web crawling to collect data of interest. Furthermore, in order to provide efficient query answering over this data, existing systems are forced to construct centralized database indexes, making it difficult to maintain up-to-date data. For approaches that do utilize existing databases, disregard for fundamental design principles of the Web results in query systems that lack some basic features of their web crawling counterparts. If an efficient query answering system can be provided that does not require centralized indexing, and leverages both existing databases and static web content, users may benefit from up-to-date access to structured, disparate data.
In this dissertation, we develop a federated query planning framework based on the RDF data model and the SPARQL query language. This framework is able to leverage the high performance of existing SPARQL databases while also providing access to linked data available as RDF documents on the web. These two access methods are used to provide a single interface to querying semantic data.
The primary challenge of evaluating queries over both SPARQL databases and linked data is in finding an efficient execution plan. Such a plan must perform better than the naive approach of completely decomposing the query and executing each subquery against each data source or traversing linked data by web crawling. Moreover, it must allow metadata discovered during query execution to be incorporated into the existing plan.
Given this, in this dissertation, we develop three techniques to increase performance and flexibility of federated query evaluation: we develop a federated query planning algorithm that prioritizes the execution of subqueries that have high expected value (that is, expected relevant results with low latency); we develop a re-planning algorithm, able to augment an existing query plan with newly discovered data sources and a mechanism for discovering such sources; and we develop a server-side technique to greatly enhance the web cacheability of SPARQL query results.
Finally, the developed framework is designed using a traditional query planner, allowing it to integrate with and benefit from existing work on query planning and optimization.
Description
May 2013
School of Science
Department
Dept. of Computer Science
Publisher
Rensselaer Polytechnic Institute, Troy, NY
Relationships
Rensselaer Theses and Dissertations Online Collection
Access
CC BY-NC-ND. Users may download and share copies with attribution in accordance with a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License. No commercial use or derivatives are permitted without the explicit approval of the author.