Authors
Saurabh Sahu, Vikramjit Mitra, Nadee Seneviratne, Carol Y Espy-Wilson
Publication date
2019/9
Conference
Interspeech
Pages
3302-3306
Description
In this paper we plan to leverage multi-modal learning and automated speech recognition (ASR) systems toward building a speech-only emotion recognition model. Previous studies have shown that emotion recognition models using only acoustic features do not perform satisfactorily in detecting valence level. Text analysis has been shown to be helpful for sentiment classification. We compared classification accuracies obtained from an audio-only model, a text-only model and a multi-modal system leveraging both by performing a cross-validation analysis on IEMOCAP dataset. Confusion matrices show it’s the valence level detection thats being improved by incorporating textual information. In the second stage of experiments, we used two ASR application programming interfaces (APIs) to get the transcriptions. We compare the performances of multi-modal systems using the ASR transcriptions with each other and with that of one using ground truth transcription. We analyze the confusion matrices to determine the effect of using ASR transcriptions instead of ground truth ones on class-wise accuracies. We investigate the generalisability of such a model by performing a cross-corpus study.
Total citations
202020212022202320248711103