Randomized algorithms for mining massive matrices: design & implementation at terascale and beyond

Authors
Iyer, Chander Jayaraman
ORCID
Loading...
Thumbnail Image
Other Contributors
Carothers, Christopher D.
Drineas, Petros
Zaki, Mohammed J., 1971-
Shephard, Mark S.
Mitchell, John E.
Issue Date
2018-05
Keywords
Computer science
Degree
PhD
Terms of Use
Attribution-NonCommercial-NoDerivs 3.0 United States
This electronic version is a licensed copy owned by Rensselaer Polytechnic Institute, Troy, NY. Copyright of original work retained by author.
Full Citation
Abstract
Modern technological advancements and innovation has led to an explosive growth of data in various domains, ranging from physics and biological sciences to economics and social sciences. Research on mathematical libraries has been on the leading edge of the high-performance computing(HPC) community's effort to address the long imposing set of challenges posed by Big Data. Primary among those challenges are the need for asynchronous communication and bridging the gap between computing power and network bandwidth. This has led to the advent of randomization in math libraries to develop scalable algorithms for large-scale linear algebra problems. In this dissertation, we focus on the design, implementation and analysis of randomized algorithms for scalable mining of terabyte sized matrices and above. We focus on three fundamental problems that are pervasive throughout large-scale data analytics where randomized NLA algorithms have shown significant impact over state-of-the-art approaches: least-squares regression, low-rank approximation and kernel ridge regression.
In part III of the dissertation, we explore the behavior of large-scale kernel approximations using the Nystrom approach to solve the kernel ridge regression (KRR) problem. We demonstrate the scalability of one such Nystrom approximation approach based on the FALKON algorithm and contrast it with other state-of-the-art approaches for the KRR problem.
In part II of the dissertation, we explore the behavior of randomized block iterative solvers to compute low rank matrix approximations for dense terabyte sized matrices. We are particularly interested in the behavior of randomized block iterative solvers on matrices with clustered singular values. We analyze the scalability and numerical stability of our block iterative solvers and demonstrate the performance of these randomized solvers for varying spectral gaps. Experiments with real-world large-scale datasets showed high quality approximations for the kernel PCA problem while achieving significant speedups over state-of-the-art direct solvers.
This dissertation is divided into three parts. In part I, we explore the behavior of randomized matrix algorithms based on the Blendenpik algorithm in a distributed memory setting. We show that a variant of the algorithm that uses a batchwise transformation leads to an implementation that is not only faster than state-of-the-art implementations of baseline least-squares solvers, but is also able to scale to much larger matrix sizes. In particular, we show that a Blendenpik based algorithm can solve least-squares regression problems for dense terabyte-sized (and larger) input matrices as well as sparse ill-conditioned matrices that outperform state-of-the-art least-squares solvers in terms of performance and scalability while demonstrating comparable numerical stability in terms of established accuracy metrics.
Description
May 2018
School of Science
Department
Dept. of Computer Science
Publisher
Rensselaer Polytechnic Institute, Troy, NY
Relationships
Rensselaer Theses and Dissertations Online Collection
Access
CC BY-NC-ND. Users may download and share copies with attribution in accordance with a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License. No commercial use or derivatives are permitted without the explicit approval of the author.