Multimodal machine learning for human conversational behavior analysis

Authors
Zhang, Lingyu
ORCID
Loading...
Thumbnail Image
Other Contributors
Wang, Meng
Chen, Tianyi
Golden, Timothy D.
Radke, Richard J., 1974-
Issue Date
2021-08
Keywords
Electrical engineering
Degree
PhD
Terms of Use
This electronic version is a licensed copy owned by Rensselaer Polytechnic Institute (RPI), Troy, NY. Copyright of original work retained by author.
Full Citation
Abstract
Human behavior during interactive communication can be complex, consisting of an interplay of data streams in multiple modalities, from multiple parties occurring with different orders and timings. This thesis focuses on multimodal machine learning for interactive human behavior analysis. We develop computational algorithms to combine computer vision, natural language processing, and audio signal features to automatically interpret and reason about social communicative behaviors. Our methodology involves building deep learning architectures and temporal sequential models to understand the intra- and inter-modality dependencies of human behavior cues. The first part of this thesis involves predicting emergent leaders and dominant contributors in a group meeting scenario based on the frequency of certain action events, e.g. the percentage of time that a person is being looked at by other participants. This requires the estimation of visual focus of attention (VFOA) from frontal-facing videos, for which we developed a deep learning model to predict the visual target classes based on eye gaze and head pose. This model was also incorporated into an automatic meeting summarization algorithm to provide an importance score for the sentences to be extracted in the summary. To better model human interaction behavior in communication scenarios, the second part of the thesis addresses how to use multi-party co-occurrent visual events (e.g. how a person moves her body while being looked at by others) to predict Big-Five personality traits. For this objective, we correlated the frequency of co-occurrent visual events with each personality class to understand the importance of different visual features, and then applied a machine learning approach to predict the personality traits of participants in a group meeting. The third part of the thesis is aimed at further modelling temporal interdependencies across different modalities of features including visual, audio, and language. We developed an end-to-end multi-stream recurrent neural network (RNN) to help integrate non-synchronized features in a given time window. The algorithm is applied to predict participants’ social roles (e.g., Protagonist, Supporter, Neutral, Gatekeeper or Attacker) which can change frequently during the whole group meeting. Given long, untrimmed videos of human interaction, there are critical moments that we are particularly interested in, such as a short segment when the team reaches a milestone, or a time window when two participants have conflict with each other. The fourth part of the thesis aims at locating the boundary of these specific moments from the original video. For this objective, we constructed a multimodal moment localization framework that takes a natural language query and the video content as input, and outputs the critical moment that semantically matches the query. We designed a temporal convolution module to explore the relationship between meaning in the query and the interaction between neighboring video frames. Moving beyond the estimation of specific social signals such Big-Five personality or social emotional roles, we consider a more data-driven approach in which we develop a multimodal deep learning model to automatically reason about human interactive behavior in a natural way. Specifically, instead of classifying manually defined categories of one single social signal, in this part of our work, we focus on visual question answering, e.g., automatically answering questions like “How is the man who is not being blamed responding to the situation?” from a multiple choice list. We developed a temporal attention-based model to highlight the critical moment in the video content and the keywords in the question sentence for better abstracting the important information in the given multimodal materials, and we capture the cross-modal dependencies using a consistency measuring module.
Description
August 2021
School of Engineering
Department
Dept. of Electrical, Computer, and Systems Engineering
Publisher
Rensselaer Polytechnic Institute, Troy, NY
Relationships
Rensselaer Theses and Dissertations Online Collection
Access
Restricted to current Rensselaer faculty, staff and students in accordance with the Rensselaer Standard license. Access inquiries may be directed to the Rensselaer Libraries.