Unsupervised learning : evaluation, distributed setting, and privacy

Tsikhanovich, Maksim

Unsupervised learning : evaluation, distributed setting, and privacy

Authors

Tsikhanovich, Maksim

Files

178938_Tsikhanovich_rpi_0185E_11252.pdf (1.97 MB)

Other Contributors

Magdon-Ismail, Malik
Ji, Heng
Xia, Lirong
Mitchell, John E.

Issue Date

2018-05

Keywords

Computer science

Degree

PhD

Terms of Use

Attribution-NonCommercial-NoDerivs 3.0 United States
This electronic version is a licensed copy owned by Rensselaer Polytechnic Institute, Troy, NY. Copyright of original work retained by author.

URI

https://hdl.handle.net/20.500.13015/2175

Abstract

Chapter 1 is an overview of topic modeling as a set of unsupervised learning tasks. We present the Latent Dirichlet Allocation (LDA) model, and show how k-means as well as non- negative matrix factorization (NMF) can also be interpreted as topic models. We present a variety of quantitative and qualitative evaluation techniques that aim to capture different properties of the model. Finally we show how we can leverage evaluation techniques and hyperparameter optimization tools to answer typical parameter selection questions. We hope to facilitate future research on topic modeling by encapsulating each of the above parts as a robust and re-usable set of tools, so that a future researcher can focus on one part at a time.
In Chapter 3 we study empirical measures of Distributional Differential Privacy. We want to measure to what extent one participant in a distributed computation can correctly identify the presence of a single document in another participant’s database. We propose a measure based on the p-value of the Kolmogorov-Smirnov two-sample hypothesis test. We compare our measures to existing measures such as Differential Privacy, and use it to evaluate the privacy of our online algorithms.
In Chapter 2 we present two algorithms for the data-distributed non-negative matrix fac- torization (NMF) task, and one for the singular value decomposition (SVD). In the offline setting, M parties have already computed NMF models of their local data. Our algorithm ensembles these into a global model by minimizing an upper bound on the reconstruction error for the original data in terms of reconstruction error on the local models. In the on- line setting, the M parties are all participating in a synchronous distributed computation. We present an algorithm that reconstructs the centralized NMF solution exactly if given the same initialization. Finally we present an online SVD algorithm. We compare these algorithms in terms of how well they initialize NMF.

Description

May 2018
School of Science

Department

Dept. of Computer Science

Publisher

Rensselaer Polytechnic Institute, Troy, NY

Relationships

Rensselaer Theses and Dissertations Online Collection

Access

CC BY-NC-ND. Users may download and share copies with attribution in accordance with a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License. No commercial use or derivatives are permitted without the explicit approval of the author.

Collections

RPI Theses Open Access
RPI Theses Online (Complete)

Full item page