Practical techniques for and applications of deep learning classification of long-term acoustic datasets

Loading...
Thumbnail Image
Authors
Morgan, Mallory Marie
Issue Date
2021-12
Type
Electronic thesis
Thesis
Language
en_US
Keywords
Architectural sciences
Research Projects
Organizational Units
Journal Issue
Alternative Title
Abstract
Audio recordings are crucial for many fields of study, because they can capture, for example, the emotion that plays a central role in human verbal communication, or the nuances of animal species interactions, especially the distinct vocalizations of species which serve as indicators of key environmental health factors. The multidimensionality of acoustic data, which contains both a frequency and temporal component, makes it suitable for analysis by deep learning, an established tool for classifying large and complex datasets. In this dissertation, a variety of deep learning techniques are used to create a temporal classification record for several large acoustic datasets. The detailed record of acoustic events that result are analyzed to advance research goals in various disciplines. To collect the first acoustic dataset, a microphone array is deployed in a suburban area of Albany County, New York for twelve months, resulting in over 8,000 hours of recordings. Concurrently, a second array is deployed in a rural area of Warren County, New York. As a proof of concept, a 19-class database containing a variety of biophony, geophony, and anthrophony is used to train a convolutional neural network on the suburban dataset, achieving an average class-weighted testing accuracy of 89\%. The trained network is then used to generate a record of species-specific calling activity for the entire study period. The average accuracy of this prediction record is characterized at 85% using a temporally dispersed validation dataset, allowing it to be used to create visualizations of the number of acoustic events of each sound class across each day, month, and year. With a general methodology established, several shortcomings of many long-term acoustic monitoring studies are addressed. The chief limitation is that supervised deep learning techniques require vast quantities of labeled training data. To reduce labeling efforts associated with the rural dataset, semi-supervised and cross-corpus learning are implemented, with the latter reducing labeling times by 30%. A second known issue is that neural networks are rarely designed for the highly realistic case of open set classification, in which examples belonging to the training classes must not only be correctly classified, but also separated from any spurious or unknown classes. To combat this reliance on closed set data, where the number of training classes is predefined, several open set classification frameworks are evaluated under multi-, single-, and cross-corpus training scenarios, all of which have been implemented in prominent acoustic scene classification challenges. Two different types of unknown data are also configured, each designed to highlight a challenge inherent to real-world classification tasks. A final shortcoming is that deep learning-focused publications often conclude with the classification itself, without specifically illustrating why such classification is necessary in the first place. Therefore, the prediction record is applied along with associated metadata to understand how various abiotic factors (number of daylight hours, temperature, and weather events etc.) impact acoustic communication. Novel pseudo-species richness and abundance distribution calculations confirm common understandings of ecological community structures through an auditory modality. With an understanding of which methodologies increase the practicality, flexibility, and applicability of deep learning-based acoustic analysis, their interdisciplinary utility is demonstrated on an unrelated speech emotion audio dataset. Specifically, to study the dynamics of a collaborative decision-making task, a novel multimodal dataset consisting of 14 group meetings and 45 participants is collected. To establish a training database, each participant’s audio is annotated both categorically and along a three-dimensional scale with axes of activation, dominance, and valence. Cross-corpus training is successfully used to boost classification performance by 2%. Several neural network architectures for predicting speech emotion are compared for two tasks: categorical emotion classification and three-dimensional emotion regression using previously piloted multi-task learning approaches. By regressing the annotated emotions against post-task questionnaire variables for each participant, the emotional speech content of a meeting is shown to predict perceived group leaders in 71% of meetings and major contributors in 86% of meetings. This work builds upon initial efforts that use low-level audiovisual metrics to predict leadership with 64% accuracy, indicating that a more semantically relevant metric, emotion, offers more substantial predictive power. The acoustic prediction record is also applied to derive correlations between speech emotion and gender, emotional intelligence, and the Big 5 personality traits.
Description
December 2021
School of Architecture
Full Citation
Publisher
Rensselaer Polytechnic Institute, Troy, NY
Terms of Use
Journal
Volume
Issue
PubMed ID
DOI
ISSN
EISSN