• Login
    View Item 
    •   DSpace@RPI Home
    • Rensselaer Libraries
    • RPI Theses Open Access
    • View Item
    •   DSpace@RPI Home
    • Rensselaer Libraries
    • RPI Theses Open Access
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Long contextualized document embeddings for financial reports

    Author
    Rawte, Vipula Dattatrey
    Thumbnail
    View/Open
    180595_Rawte_rpi_0185N_11838.pdf (777.1Kb)
    Other Contributors
    Zaki, Mohammed J., 1971-; Gupta, Aparna; Szymanśki, Bolesław; Yener, Bülent, 1959-;
    Date Issued
    2021-05
    Subject
    Computer science
    Degree
    MS;
    Terms of Use
    This electronic version is a licensed copy owned by Rensselaer Polytechnic Institute, Troy, NY. Copyright of original work retained by author.;
    Metadata
    Show full item record
    URI
    https://hdl.handle.net/20.500.13015/2699
    Abstract
    With rapid advancements in the area of Natural Language Processing (NLP), Transformer-based techniques have been applied to a variety of downstream tasks such as question answering, text classification, and so on. However, the major challenge of implementing these models on long texts is the restriction on the maximum number of tokens, typically 512.To address this issue, in this thesis, we propose three different methods to handle document sizes of more than 512 tokens and to capture the dynamic context of the text. Our first method is based on a simple way to create a dynamic word list using the BERT model. However, it cannot capture new words missing in the original vocabulary. Hence, in our second method, we propose to represent documents hierarchically: pooling word and sentence embeddings. Since this method incurs a lot of computational overhead, we propose chunk-based document embeddings in our third method. We used chunk-level attention using BiLSTMs. We evaluate the effectiveness of our methods by using 10-K and 8-K company filing reports in predicting six key financial ratios: ROA, EPS, Leverage, Tobin's Q Ratio, Tier 1 Capital, and Z-Score. Our experimental results show that our word list method is comparable with the hand-curated finance sentiment word list. Additionally, the TF-IDF method using only words is a strong competitive baseline, and performs better than Transformer-based language models.;
    Description
    May 2021; School of Science
    Department
    Dept. of Computer Science;
    Publisher
    Rensselaer Polytechnic Institute, Troy, NY
    Relationships
    Rensselaer Theses and Dissertations Online Collection;
    Access
    Users may download and share copies with attribution in accordance with a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License. No commercial use or derivatives are permitted without the explicit approval of the author.;
    Collections
    • RPI Theses Online (Complete)
    • RPI Theses Open Access

    Browse

    All of DSpace@RPICommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    Login

    DSpace software copyright © 2002-2022  DuraSpace
    Contact Us | Send Feedback
    DSpace Express is a service operated by 
    Atmire NV