Multi-stream FC-HiFi-GAN:学習可能な軽量アップサンプリングを用いた高速ニューラル波形生成モデル

Presentation	2023-02-28 Multi-stream FC-HiFi-GAN:Fast Neural Vocoder Model Using Learnable Lightweight Upsampling Haruki Yamashita, Takuma Okamoto, Ryoichi Takashima, Tetsuya Takiguchi, Tomoki Toda, Hisashi Kawai,
PDF Download Page	PDF download Page Link
Abstract(in Japanese)	(See Japanese page)
Abstract(in English)	In recent years, in text-to-speech synthesis, it is required to improve the inference speed while keeping the quality. Multi-stream(MS) iSTFT-HiFiGAN was proposed as a high-speed model of HiFi-GAN, a vocoder capable of inferring waveforms on single CPU. In the TTS task using VITS, although there was some deterioration in sound quality, the speed was increased by about 4 times. In this paper, we propose a MS-FC-HiFi-GAN in which the inverse short-time Fourier transform (iSTFT) part is changed to trainable fully connected layer for the purpose of improving the synthesis quality of the MS-iSTFT-HiFiGAN. As for the inference speed, RTF was 0.15 on 1 CPU, which is the same as MS-iSTFT-HiFiGAN. Synthesis quality was inferior to that of MS-iSTFT-HiFiGAN in TTS task, but was superior to thatin analysis/synthesis task.
Keyword(in Japanese)	(See Japanese page)
Keyword(in English)	speech synthesis / Neural Vocoder / HiFi-GAN / Text-to-Speech / Analysis Synthesis
Paper #	EA2022-76,SIP2022-120,SP2022-40
Date of Issue	2023-02-21 (EA, SIP, SP)

Conference Information
Committee	SP / IPSJ-SLP / EA / SIP
Conference Date	2023/2/28(2days)
Place (in Japanese)	(See Japanese page)
Place (in English)
Topics (in Japanese)	(See Japanese page)
Topics (in English)
Chair	Tomoki Toda(Nagoya Univ.) / Tomoki Toda(Nagoya Univ.) / Kenichi Furuya(Oita Univ.) / Toshihisa Tanaka(Tokyo Univ. Agri.&Tech.)
Vice Chair	/ / Tatsuya Kako(NTT) / Junki Ono(Tokyo Metropolitan Univ.) / Koichi Ichige(Yokohama National Univ.) / Takayuki Nakachi(Ryukyu Univ.)
Secretary	(NTT) / (Univ. of Electro-Comm.) / Tatsuya Kako(NTT) / Junki Ono(Univ. of Electro-Comm.) / Koichi Ichige(NTT) / Takayuki Nakachi(RitsumeikanUniv.)
Assistant	Ryo Aihara(Mitsubishi Electric) / Daisuke Saito(Univ. of Tokyo) / Ryo Aihara(Mitsubishi Electric) / Daisuke Saito(Univ. of Tokyo) / Masato Nakayama(Osaka Sangyo Univ.) / Kouhei Yatabe(Tuat) / Taichi Yoshida(UEC) / Shoko Imaizumi(Chiba Univ.)

Paper Information
Registration To	Technical Committee on Speech / Special Interest Group on Spoken Language Processing / Technical Committee on Engineering Acoustics / Technical Committee on Signal Processing
Language	JPN
Title (in Japanese)	(See Japanese page)
Sub Title (in Japanese)	(See Japanese page)
Title (in English)	Multi-stream FC-HiFi-GAN:Fast Neural Vocoder Model Using Learnable Lightweight Upsampling
Sub Title (in English)
Keyword(1)	speech synthesis
Keyword(2)	Neural Vocoder
Keyword(3)	HiFi-GAN
Keyword(4)	Text-to-Speech
Keyword(5)	Analysis Synthesis
1st Author's Name	Haruki Yamashita
1st Author's Affiliation	Kobe University/National Institute of Information and Communications Technology(Kobe Univ/NICT)
2nd Author's Name	Takuma Okamoto
2nd Author's Affiliation	National Institute of Information and Communications Technology(NICT)
3rd Author's Name	Ryoichi Takashima
3rd Author's Affiliation	Kobe University(Kobe Univ)
4th Author's Name	Tetsuya Takiguchi
4th Author's Affiliation	Kobe University(Kobe Univ)
5th Author's Name	Tomoki Toda
5th Author's Affiliation	Nagoya University/National Institute of Information and Communications Technology(Nagoya Univ/NICT)
6th Author's Name	Hisashi Kawai
6th Author's Affiliation	National Institute of Information and Communications Technology(NICT)
Date	2023-02-28
Paper #	EA2022-76,SIP2022-120,SP2022-40
Volume (vol)	vol.122
Number (no)	EA-387,SIP-388,SP-389
Page	pp.pp.7-12(EA), pp.7-12(SIP), pp.7-12(SP),
#Pages	6
Date of Issue	2023-02-21 (EA, SIP, SP)