DNN音声合成における感情付与のためのモデル構造の検討

井上 勝喜; 原 直; 阿部 匡伸; 北条 伸克; 井島 勇祐

Presentation	2017-06-22 Comparisons on Transplant Emotional Expressions in DNN-based TTS Synthesis Katsuki Inoue, Sunao Hara, Masanobu Abe, Nobukatsu Hojo, Yusuke Ijima,
PDF Download Page	PDF download Page Link
Abstract(in Japanese)	(See Japanese page)
Abstract(in English)	Recent studies have shown that DNN-based speech synthesis can generate more natural synthesized speech than the conventional HMM-based speech synthesis. There are some studies that the method of emotional transplantation in order to variegate synthesized speech in HMM-based speech synthesis. However, it is not revealed whether emotion can be transplanted in DNN-based speech synthesis. In this paper, we compare DNN architectures in order to transplant emotional expressions to improve expressiveness of DNN-based TTS synthesis. The following three kinds of DNN architectures are examined. (1) Parallel Model : an output layer consisted of both speaker dependent layers and emotion dependent layers. (2) Serial Model : an output layer consisted of emotion dependent layers preceded by speaker dependent layers. (3) Auxiliary Input Model : an input layer consisted of speaker ID and emotion ID as well as linguistic feature vectors. The DNNs were trained using neutral speech uttered by 24 speakers, and joyful speech and sad speech uttered by 3 speakers out of the 24 speakers. The DNNs were compared by the objective evaluation and the subjective evaluation. When synthesizing unseen emotion, evaluation results showed that Parallel Model is much better than Serial Model and is slightly better than Auxiliary Input Model. Also the test showed that Serial Model is the best of the three models when synthesizing seen emotion.
Keyword(in Japanese)	(See Japanese page)
Keyword(in English)	speech synthesis / deep neural network / emotional transplantation / multi-task learning
Paper #	PRMU2017-29,SP2017-5
Date of Issue	2017-06-15 (PRMU, SP)

Conference Information
Committee	PRMU / SP
Conference Date	2017/6/22(2days)
Place (in Japanese)	(See Japanese page)
Place (in English)
Topics (in Japanese)	(See Japanese page)
Topics (in English)
Chair	Shinichi Sato(NII) / Yoichi Yamashita(Ritsumeikan Univ.)
Vice Chair	Hironobu Fujiyoshi(Chubu Univ.) / Yoshihisa Ijiri(Omron) / Hiroki Mori(Utsunomiya Univ.)
Secretary	Hironobu Fujiyoshi(AIST) / Yoshihisa Ijiri(NAIST) / Hiroki Mori(Shizuoka Univ.)
Assistant	Masato Ishii(NEC) / Yusuke Sugano(Osaka Univ.) / Kei Hashimoto(Nagoya Inst. of Tech.) / Satoshi Kobashikawa(NTT)

Paper Information
Registration To	Technical Committee on Pattern Recognition and Media Understanding / Technical Committee on Speech
Language	JPN
Title (in Japanese)	(See Japanese page)
Sub Title (in Japanese)	(See Japanese page)
Title (in English)	Comparisons on Transplant Emotional Expressions in DNN-based TTS Synthesis
Sub Title (in English)
Keyword(1)	speech synthesis
Keyword(2)	deep neural network
Keyword(3)	emotional transplantation
Keyword(4)	multi-task learning
1st Author's Name	Katsuki Inoue
1st Author's Affiliation	Okayama University(Okayama Univ.)
2nd Author's Name	Sunao Hara
2nd Author's Affiliation	Okayama University(Okayama Univ.)
3rd Author's Name	Masanobu Abe
3rd Author's Affiliation	Okayama University(Okayama Univ.)
4th Author's Name	Nobukatsu Hojo
4th Author's Affiliation	Nippon Telegraph and Telephone Corporation(NTT)
5th Author's Name	Yusuke Ijima
5th Author's Affiliation	Nippon Telegraph and Telephone Corporation(NTT)
Date	2017-06-22
Paper #	PRMU2017-29,SP2017-5
Volume (vol)	vol.117
Number (no)	PRMU-105,SP-106
Page	pp.pp.23-28(PRMU), pp.23-28(SP),
#Pages	6
Date of Issue	2017-06-15 (PRMU, SP)