Fetilda : a framework for fin-tuned embeddings of long financial text documents

Xia, Bolun (Namir)
Thumbnail Image
Other Contributors
Adalı, Sibel
Gupta, Aparna
Strzalkowski, Tomek
Zaki, Mohammed J., 1971-
Issue Date
Computer science
Terms of Use
Attribution-NonCommercial-NoDerivs 3.0 United States
This electronic version is a licensed copy owned by Rensselaer Polytechnic Institute (RPI), Troy, NY. Copyright of original work retained by author.
Full Citation
Increasingly, unstructured data are being utilized in different domains. In particular, textual data, in recent years, is becoming more important. When it comes to financial applications, unstructured data, such as text from financial documents that companies publish on a consistent basis to government regulators like the Securities and Exchange Commission (SEC), is accumulating more and more. These financial documents are typically quite long, but they usually contain soft information that can be valuable in gauging company performance. They are special in that this soft information is not taken into consideration when trying to perform predictive analysis with only numerical data. Therefore, it would be very beneficial to train predictive models to learn on these long financial documents, in order to forecast metrics that gauge a company's future performance. And indeed, much progress has been made in the sphere of Natural Language Processing (NLP) in pre-trained language models (LMs) that train on huge corpora of texts. However, that progress is still lacking when it comes to effectively representing long documents. This is the focus of this thesis: we are looking at how to learn better models to utilize the beneficial information contained in long financial text documents and generate more informative features from text, in order to use the soft information for various regression tasks. Towards that end, we propose and implement a novel machine learning framework that divides a long document into chunks, inputs the chunks through different LMs, both pre-trained and from scratch, use the outputs from those chunks to generate chunk-level vector representations, and inputs that representation into a self-attention bi-LSTM network to generate a document-level representation. In order to evaluate our deep learning framework, we experiment on one dataset of 10-K financial reports published annually by banks in the US, and another dataset of 10-K reports published by publicly traded companies in the US. Our experiment results show that our approach outperforms strong baseline approaches in terms of textual modeling and a baseline regression approach utilizing only quantitative data. Our work shows that using pre-trained, domain-specific, and fine-tuned LMs in representing long texts betters the quality of the textual features generated, and improves the performance of prediction tasks.
May 2022
School of Science
Dept. of Computer Science
Rensselaer Polytechnic Institute, Troy, NY
Rensselaer Theses and Dissertations Online Collection
CC BY-NC-ND. Users may download and share copies with attribution in accordance with a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 license. No commercial use or derivatives are permitted without the explicit approval of the author.