Edits based categorization of crowd sourced document corpora with application to Wikipedia

Authors
Fang, Yue
ORCID
Loading...
Thumbnail Image
Other Contributors
Magdon-Ismail, Malik
Gittens, Alex
Ji, Heng
Issue Date
2018-12
Keywords
Computer science
Degree
MS
Terms of Use
This electronic version is a licensed copy owned by Rensselaer Polytechnic Institute, Troy, NY. Copyright of original work retained by author.
Full Citation
Abstract
Currently, most hierarchies are constructed based on document contents, which can be computationally challenging and memory intensive. More importantly, not all contents can be acquired due to memory capacity or privacy restrictions. When articles are crowd created, much more information on article relationships is available in the editor behavior. We propose a new method that constructs the article hierarchy only using editors′ revision histories, which we apply to the English Wikipedia article corpus. The accuracy of this method is verified by comparing the revision-history-generated tree to the human-defined Wikipedia category network. We also compare our approach to content-based categorization. The result suggests that the revision- history-generated categorization can produce an article structure that is comparable to the content-generated categorization and thus offers the possibility for hierarchy clustering when it is impossible to extract or store article contents.
Description
December 2018
School of Science
Department
Dept. of Computer Science
Publisher
Rensselaer Polytechnic Institute, Troy, NY
Relationships
Rensselaer Theses and Dissertations Online Collection
Access
Restricted to current Rensselaer faculty, staff and students. Access inquiries may be directed to the Rensselaer Libraries.