• Login
    View Item 
    •   DSpace@RPI Home
    • Rensselaer Libraries
    • RPI Theses Open Access
    • View Item
    •   DSpace@RPI Home
    • Rensselaer Libraries
    • RPI Theses Open Access
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Noise reduction in user generated datasets

    Author
    Gutierrez, Louis Alberto
    Thumbnail
    View/Open
    172927_Gutierrez_rpi_0185E_10406.pdf (6.211Mb)
    Other Contributors
    Krishnamoorthy, M. S.; Eglash, Ron, 1958-; Spooner, David; Sawyer, Shayla Maya Louise;
    Date Issued
    2014-08
    Subject
    Computer science
    Degree
    PhD;
    Terms of Use
    This electronic version is a licensed copy owned by Rensselaer Polytechnic Institute, Troy, NY. Copyright of original work retained by author.;
    Metadata
    Show full item record
    URI
    https://hdl.handle.net/20.500.13015/1167
    Abstract
    The effect of modeling data noise using the Inverse Gaussian Distribution is twofold: first, Normally distributed and Gaussian predicated statistical methods for mining, measuring and analyzing data become less effective, given that noise distribution is non-Gaussian. Secondly, new statistical methods---which are predicated on the Inverse Gaussian Distribution---can be devised to mine, measure and analyze large user generated datasets with greater statistical integrity. Specifically, in this research, we demonstrate that predictive models can be developed, by artificially modeling the historical performance of an evolving dataset on the Inverse Gaussian Distribution, and used to pre-process the dataset for noise.; The overall implications in the findings of this research are in the fields of data mining, analysis and prediction. By employing complex statistical methods, a developer or statistician works on the assumption that the data being analyzed is normally distributed, and if that is not the case, the results can be misleading. And since extremely large datasets are impractical to thoroughly analyze, the statistical error will likely go unnoticed. Moreover, the results in this research suggest that noise leans towards the adversarial case for Normally distributed statistical methods, and thus could potentially instigate the largest possible margin of error. By modeling noise after the Inverse Gaussian Distribution, statistical methods can be modified to work optimally, with a greater focus on the signal and a minimization of noise.; The purpose of this research is to address the issue of noise and its propagation in massively large data sets. The claim is made that larger datasets disproportionately attract data noise, and intuitively, this is because malicious contributors, machine errors, Observer Bias, Groupthink, are synthesized, compelled, incentivised---by sheer numbers---to contribute with higher frequency to larger datasets. We demonstrate this empirically by analyzing large user generated datasets from Stack Exchange, Yelp, Amazon, as well as machine generated datasets from National Energy Technology Lab (NETL). In all these datasets we draw one unifying property, in addition to having exceptionally high levels of noise; the noise distribution is non-Gaussian. Moreover, the noise - which is characterized by an initial surge, a quick decline, and then a slow (slower than exponential) descent towards zero - strictly follows an Inverse Gaussian Distribution.;
    Description
    August 2014; School of Science
    Department
    Dept. of Computer Science;
    Publisher
    Rensselaer Polytechnic Institute, Troy, NY
    Relationships
    Rensselaer Theses and Dissertations Online Collection;
    Access
    Users may download and share copies with attribution in accordance with a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License. No commercial use or derivatives are permitted without the explicit approval of the author.;
    Collections
    • RPI Theses Online (Complete)
    • RPI Theses Open Access

    Browse

    All of DSpace@RPICommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    Login

    DSpace software copyright © 2002-2022  DuraSpace
    Contact Us | Send Feedback
    DSpace Express is a service operated by 
    Atmire NV