Best Paper Award

Speech Emotion Recognition by Combining Multiple Discriminators with Multiple Temporal Resolution Features[IEICE TRANS. INF. & SYST., VOL.J105–D, NO. 3 MARCH 2022 ]

Hiroshi FUJIMURA
Hiroshi FUJIMURA

Speech emotion recognition is a technique to identify the human emotion from input speech. In the field of call center operations, this technique is applied to the analysis of customer satisfaction or the retention risk of operators. Recently, the performance of speech recognition has improved dramatically with the development of deep learning, and it becomes important not only to transcribe speech but also to identify nonverbal information such as emotion.

This paper proposes a novel emotion recognition method that extracts spectral features with multiple window lengths and integrates them using boosting. The proposed method can select appropriate window lengths adaptively according to the emotion. The extracted spectral features are converted into hidden representations using deep learning models, and then integrated via boosting to produce the emotion label. For a boosting algorithm, the authors use a gradient boosting decision tree to solve the problem of a limited amount of training data. Moreover, to integrate features effectively, the authors also propose a median-value feature that represents a comparison with the median value for each dimension. Through comprehensive experiments on EmoDB and RAVDESS databases, the authors showed the effectiveness of the proposed method. Moreover, the experiments suggest that an appropriate window length is dependent on the language, and they achieved state-of-the-art performance on RAVDESS. Furthermore, the authors showed the detailed analysis for each emotion label, which highly contributes to the community.

As the authors say in the paper, this method can be combined with convolutional neural network-based methods for further performance improvement. For these reasons, this paper is worthy to receive the IEICE Best Paper Award.