Noise reduction in user generated datasets

Authors
Gutierrez, Louis Alberto
ORCID
Loading...
Thumbnail Image
Other Contributors
Krishnamoorthy, M. S.
Eglash, Ron, 1958-
Spooner, David
Sawyer, Shayla Maya Louise
Issue Date
2014-08
Keywords
Computer science
Degree
PhD
Terms of Use
Attribution-NonCommercial-NoDerivs 3.0 United States
This electronic version is a licensed copy owned by Rensselaer Polytechnic Institute, Troy, NY. Copyright of original work retained by author.
Full Citation
Abstract
The effect of modeling data noise using the Inverse Gaussian Distribution is twofold: first, Normally distributed and Gaussian predicated statistical methods for mining, measuring and analyzing data become less effective, given that noise distribution is non-Gaussian. Secondly, new statistical methods---which are predicated on the Inverse Gaussian Distribution---can be devised to mine, measure and analyze large user generated datasets with greater statistical integrity. Specifically, in this research, we demonstrate that predictive models can be developed, by artificially modeling the historical performance of an evolving dataset on the Inverse Gaussian Distribution, and used to pre-process the dataset for noise.
The overall implications in the findings of this research are in the fields of data mining, analysis and prediction. By employing complex statistical methods, a developer or statistician works on the assumption that the data being analyzed is normally distributed, and if that is not the case, the results can be misleading. And since extremely large datasets are impractical to thoroughly analyze, the statistical error will likely go unnoticed. Moreover, the results in this research suggest that noise leans towards the adversarial case for Normally distributed statistical methods, and thus could potentially instigate the largest possible margin of error. By modeling noise after the Inverse Gaussian Distribution, statistical methods can be modified to work optimally, with a greater focus on the signal and a minimization of noise.
The purpose of this research is to address the issue of noise and its propagation in massively large data sets. The claim is made that larger datasets disproportionately attract data noise, and intuitively, this is because malicious contributors, machine errors, Observer Bias, Groupthink, are synthesized, compelled, incentivised---by sheer numbers---to contribute with higher frequency to larger datasets. We demonstrate this empirically by analyzing large user generated datasets from Stack Exchange, Yelp, Amazon, as well as machine generated datasets from National Energy Technology Lab (NETL). In all these datasets we draw one unifying property, in addition to having exceptionally high levels of noise; the noise distribution is non-Gaussian. Moreover, the noise - which is characterized by an initial surge, a quick decline, and then a slow (slower than exponential) descent towards zero - strictly follows an Inverse Gaussian Distribution.
Description
August 2014
School of Science
Department
Dept. of Computer Science
Publisher
Rensselaer Polytechnic Institute, Troy, NY
Relationships
Rensselaer Theses and Dissertations Online Collection
Access
CC BY-NC-ND. Users may download and share copies with attribution in accordance with a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License. No commercial use or derivatives are permitted without the explicit approval of the author.