Authors
Nicholas Cummins, Shahin Amiriparian, Gerhard Hagerer, Anton Batliner, Stefan Steidl, Björn W Schuller
Publication date
2017/10/19
Book
Proceedings of the 25th ACM international conference on Multimedia
Pages
478-484
Description
The outputs of the higher layers of deep pre-trained convolutional neural networks (CNNs) have consistently been shown to provide a rich representation of an image for use in recognition tasks. This study explores the suitability of such an approach for speech-based emotion recognition tasks. First, we detail a new acoustic feature representation, denoted as deep spectrum features, derived from feeding spectrograms through a very deep image classification CNN and forming a feature vector from the activations of the last fully connected layer. We then compare the performance of our novel features with standardised brute-force and bag-of-audio-words (BoAW) acoustic feature representations for 2- and 5-class speech-based emotion recognition in clean, noisy and denoised conditions. The presented results show that image-based approaches are a promising avenue of research for speech-based recognition …
Total citations
2017201820192020202120222023202432318323430304
Scholar articles
N Cummins, S Amiriparian, G Hagerer, A Batliner… - Proceedings of the 25th ACM international conference …, 2017