Edits based categorization of crowd sourced document corpora with application to Wikipedia

Loading...
Thumbnail Image
Authors
Fang, Yue
Issue Date
2018-12
Type
Electronic thesis
Thesis
Language
ENG
Keywords
Computer science
Research Projects
Organizational Units
Journal Issue
Alternative Title
Abstract
Currently, most hierarchies are constructed based on document contents, which can be computationally challenging and memory intensive. More importantly, not all contents can be acquired due to memory capacity or privacy restrictions. When articles are crowd created, much more information on article relationships is available in the editor behavior. We propose a new method that constructs the article hierarchy only using editors′ revision histories, which we apply to the English Wikipedia article corpus. The accuracy of this method is verified by comparing the revision-history-generated tree to the human-defined Wikipedia category network. We also compare our approach to content-based categorization. The result suggests that the revision- history-generated categorization can produce an article structure that is comparable to the content-generated categorization and thus offers the possibility for hierarchy clustering when it is impossible to extract or store article contents.
Description
December 2018
School of Science
Full Citation
Publisher
Rensselaer Polytechnic Institute, Troy, NY
Terms of Use
Journal
Volume
Issue
PubMed ID
DOI
ISSN
EISSN