Authors
Andreas Triantafyllopoulos, Gil Keren, Johannes Wagner, Ingmar Steiner, Björn Schuller
Publication date
2019
Description
The use of deep learning (DL) architectures for speech enhancement has recently improved the robustness of voice applications under diverse noise conditions. These improvements are usually evaluated based on the perceptual quality of the enhanced audio or on the performance of automatic speech recognition (ASR) systems. We are interested instead in the usefulness of these algorithms in the field of speech emotion recognition (SER), and specifically in whether an enhancement architecture can effectively remove noise while preserving enough information for an SER algorithm to accurately identify emotion in speech. We first show how a scalable DL architecture can be trained to enhance audio signals in a large number of unseen environments, and go on to show how that can benefit common SER pipelines in terms of noise robustness. Our results show that incorporating a speech enhancement architecture is beneficial, especially for low signal-to-noise ratio (SNR) conditions.
Total citations
2019202020212022202320241111411138