Unsupervised learning : evaluation, distributed setting, and privacy

Tsikhanovich, Maksim
Thumbnail Image
Other Contributors
Magdon-Ismail, Malik
Ji, Heng
Xia, Lirong
Mitchell, John E.
Issue Date
Computer science
Terms of Use
Attribution-NonCommercial-NoDerivs 3.0 United States
This electronic version is a licensed copy owned by Rensselaer Polytechnic Institute, Troy, NY. Copyright of original work retained by author.
Full Citation
Chapter 1 is an overview of topic modeling as a set of unsupervised learning tasks. We present the Latent Dirichlet Allocation (LDA) model, and show how k-means as well as non- negative matrix factorization (NMF) can also be interpreted as topic models. We present a variety of quantitative and qualitative evaluation techniques that aim to capture different properties of the model. Finally we show how we can leverage evaluation techniques and hyperparameter optimization tools to answer typical parameter selection questions. We hope to facilitate future research on topic modeling by encapsulating each of the above parts as a robust and re-usable set of tools, so that a future researcher can focus on one part at a time.
In Chapter 3 we study empirical measures of Distributional Differential Privacy. We want to measure to what extent one participant in a distributed computation can correctly identify the presence of a single document in another participant’s database. We propose a measure based on the p-value of the Kolmogorov-Smirnov two-sample hypothesis test. We compare our measures to existing measures such as Differential Privacy, and use it to evaluate the privacy of our online algorithms.
In Chapter 2 we present two algorithms for the data-distributed non-negative matrix fac- torization (NMF) task, and one for the singular value decomposition (SVD). In the offline setting, M parties have already computed NMF models of their local data. Our algorithm ensembles these into a global model by minimizing an upper bound on the reconstruction error for the original data in terms of reconstruction error on the local models. In the on- line setting, the M parties are all participating in a synchronous distributed computation. We present an algorithm that reconstructs the centralized NMF solution exactly if given the same initialization. Finally we present an online SVD algorithm. We compare these algorithms in terms of how well they initialize NMF.
May 2018
School of Science
Dept. of Computer Science
Rensselaer Polytechnic Institute, Troy, NY
Rensselaer Theses and Dissertations Online Collection
CC BY-NC-ND. Users may download and share copies with attribution in accordance with a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License. No commercial use or derivatives are permitted without the explicit approval of the author.