(英) |
In this paper, we propose a multimodal modeling framework to detect the important utterances
in human-robot interview dialogue.
The important utterance is defined as (1) the utterance which is spoken more actively and positively than the other utterances, or (2) the utterance which includes key contents to summarize the whole interview.
Multimodal features including spoken words, prosody, gesture, and posture are effective to capture the active and positive attitude of participant on the interview.In many case, such important utterances are observed sequentially with a duration, because
participants tend to continue the active attitudes to answer the questions about the topic if they are interested in a specific topic.
Therefore, time-series feature is also effective to recognize the important utterance.
The multimodal and time-series features are fused using a linear SVM.
Experimental results show that the recognition accuracy of proposed model with multimodal and time-series features was 68 % and the accuracy is improved from best accuracy: 57 % in unimodal models by 11 %. |