Fetilda : a framework for fin-tuned embeddings of long financial text documents
Loading...
Authors
Xia, Bolun (Namir)
Issue Date
2022-05
Type
Thesis
Electronic thesis
Electronic thesis
Language
en_US
Keywords
Computer science
Alternative Title
Abstract
Increasingly, unstructured data are being utilized in different domains. In particular, textual data, in recent years, is becoming more important. When it comes to financial applications, unstructured data, such as text from financial documents that companies publish on a consistent basis to government regulators like the Securities and Exchange Commission (SEC), is accumulating more and more. These financial documents are typically quite long, but they usually contain soft information that can be valuable in gauging company performance. They are special in that this soft information is not taken into consideration when trying to perform predictive analysis with only numerical data. Therefore, it would be very beneficial to train predictive models to learn on these long financial documents, in order to forecast metrics that gauge a company's future performance. And indeed, much progress has been made in the sphere of Natural Language Processing (NLP) in pre-trained language models (LMs) that train on huge corpora of texts. However, that progress is still lacking when it comes to effectively representing long documents. This is the focus of this thesis: we are looking at how to learn better models to utilize the beneficial information contained in long financial text documents and generate more informative features from text, in order to use the soft information for various regression tasks. Towards that end, we propose and implement a novel machine learning framework that divides a long document into chunks, inputs the chunks through different LMs, both pre-trained and from scratch, use the outputs from those chunks to generate chunk-level vector representations, and inputs that representation into a self-attention bi-LSTM network to generate a document-level representation. In order to evaluate our deep learning framework, we experiment on one dataset of 10-K financial reports published annually by banks in the US, and another dataset of 10-K reports published by publicly traded companies in the US. Our experiment results show that our approach outperforms strong baseline approaches in terms of textual modeling and a baseline regression approach utilizing only quantitative data. Our work shows that using pre-trained, domain-specific, and fine-tuned LMs in representing long texts betters the quality of the textual features generated, and improves the performance of prediction tasks.
Description
May 2022
School of Science
School of Science
Full Citation
Publisher
Rensselaer Polytechnic Institute, Troy, NY