Long contextualized document embeddings for financial reports
Author
Rawte, Vipula DattatreyOther Contributors
Zaki, Mohammed J., 1971-; Gupta, Aparna; Szymanśki, Bolesław; Yener, Bülent, 1959-;Date Issued
2021-05Subject
Computer scienceDegree
MS;Terms of Use
This electronic version is a licensed copy owned by Rensselaer Polytechnic Institute, Troy, NY. Copyright of original work retained by author.; Attribution-NonCommercial-NoDerivs 3.0 United StatesMetadata
Show full item recordAbstract
With rapid advancements in the area of Natural Language Processing (NLP), Transformer-based techniques have been applied to a variety of downstream tasks such as question answering, text classification, and so on. However, the major challenge of implementing these models on long texts is the restriction on the maximum number of tokens, typically 512.To address this issue, in this thesis, we propose three different methods to handle document sizes of more than 512 tokens and to capture the dynamic context of the text. Our first method is based on a simple way to create a dynamic word list using the BERT model. However, it cannot capture new words missing in the original vocabulary. Hence, in our second method, we propose to represent documents hierarchically: pooling word and sentence embeddings. Since this method incurs a lot of computational overhead, we propose chunk-based document embeddings in our third method. We used chunk-level attention using BiLSTMs. We evaluate the effectiveness of our methods by using 10-K and 8-K company filing reports in predicting six key financial ratios: ROA, EPS, Leverage, Tobin's Q Ratio, Tier 1 Capital, and Z-Score. Our experimental results show that our word list method is comparable with the hand-curated finance sentiment word list. Additionally, the TF-IDF method using only words is a strong competitive baseline, and performs better than Transformer-based language models.;Description
May 2021; School of ScienceDepartment
Dept. of Computer Science;Publisher
Rensselaer Polytechnic Institute, Troy, NYRelationships
Rensselaer Theses and Dissertations Online Collection;Access
CC BY-NC-ND. Users may download and share copies with attribution in accordance with a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License. No commercial use or derivatives are permitted without the explicit approval of the author.;Collections
Except where otherwise noted, this item's license is described as CC BY-NC-ND. Users may download and share copies with attribution in accordance with a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License. No commercial use or derivatives are permitted without the explicit approval of the author.