Long contextualized document embeddings for financial reports

Loading...
Thumbnail Image
Authors
Rawte, Vipula Dattatrey
Issue Date
2021-05
Type
Electronic thesis
Thesis
Language
ENG
Keywords
Computer science
Research Projects
Organizational Units
Journal Issue
Alternative Title
Abstract
With rapid advancements in the area of Natural Language Processing (NLP), Transformer-based techniques have been applied to a variety of downstream tasks such as question answering, text classification, and so on. However, the major challenge of implementing these models on long texts is the restriction on the maximum number of tokens, typically 512.To address this issue, in this thesis, we propose three different methods to handle document sizes of more than 512 tokens and to capture the dynamic context of the text. Our first method is based on a simple way to create a dynamic word list using the BERT model. However, it cannot capture new words missing in the original vocabulary. Hence, in our second method, we propose to represent documents hierarchically: pooling word and sentence embeddings. Since this method incurs a lot of computational overhead, we propose chunk-based document embeddings in our third method. We used chunk-level attention using BiLSTMs. We evaluate the effectiveness of our methods by using 10-K and 8-K company filing reports in predicting six key financial ratios: ROA, EPS, Leverage, Tobin's Q Ratio, Tier 1 Capital, and Z-Score. Our experimental results show that our word list method is comparable with the hand-curated finance sentiment word list. Additionally, the TF-IDF method using only words is a strong competitive baseline, and performs better than Transformer-based language models.
Description
May 2021
School of Science
Full Citation
Publisher
Rensselaer Polytechnic Institute, Troy, NY
Journal
Volume
Issue
PubMed ID
DOI
ISSN
EISSN