Neural natural language models are designed to learn word and sequence representations from large volumes of text. Such amount of data is typically achieved by merging multiple heterogeneous corpora from the Web.
However, language use is entrenched in the social context it appears, and linguistic variations manifest social differentiation such as ethnicity, gender, sex, and social class.
Words may have their meanings altered based not only on the lexical context but also in the social context they emerge, being associated with the group or community who utilizes them.
These changes are the object of study of computational semantic shift methods, the majority of which are currently designed to handle temporal language change, or linguistic evolution, with little endeavor made towards characterizing changes across domains. In this work, we proposed a method to improve the current semantic shift techniques in cross-domain tasks, and demonstrated its capability in unsupervised feature learning tasks. We focused on addressing the two major challenges of this problem: the assumption of gradual language change used in temporal analysis, and the lack of labeled data for supervised learning.
In particular, we designed a self-supervised learning method to obtain monolingual mappings of words, and showed that it surpasses the performance of state-of-the-art baselines both on over time and cross-domain detection.
Moreover, we designed a framework for the explainability of semantic shifts based on the learned mappings, showing the words that are semantically shifted across input sources, explaining the shift via word representatives and examples in sentence. Finally, we confirmed that semantic shift is able to perform domain differentiation by applying it in a study of scientific news source credibility. The study showed that by using semantic shift in conjunction with citation and copy behavior as measures of concordance of news sources, we could learn representations that capture relevant information about them, such as credibility and political bias, creating clusters of sources that share similar traits.
A qualitative analysis of the observed clusters using semantic shift allowed us to characterize clusters of political conspiracy theorists and sources that propagate pseudoscience/health conspiracy theories.;
December 2022; School of Science
Dept. of Computer Science;
Rensselaer Polytechnic Institute, Troy, NY
Rensselaer Theses and Dissertations Online Collection;
Users may download and share copies with attribution in accordance with a Creative Commons
Attribution-Noncommercial-No Derivative Works 3.0 license. No commercial use or derivatives
are permitted without the explicit approval of the author.;