Presentation 2023-02-28
Multi-stream FC-HiFi-GAN:Fast Neural Vocoder Model Using Learnable Lightweight Upsampling
Haruki Yamashita, Takuma Okamoto, Ryoichi Takashima, Tetsuya Takiguchi, Tomoki Toda, Hisashi Kawai,
PDF Download Page PDF download Page Link
Abstract(in Japanese) (See Japanese page)
Abstract(in English) In recent years, in text-to-speech synthesis, it is required to improve the inference speed while keeping the quality. Multi-stream(MS) iSTFT-HiFiGAN was proposed as a high-speed model of HiFi-GAN, a vocoder capable of inferring waveforms on single CPU. In the TTS task using VITS, although there was some deterioration in sound quality, the speed was increased by about 4 times. In this paper, we propose a MS-FC-HiFi-GAN in which the inverse short-time Fourier transform (iSTFT) part is changed to trainable fully connected layer for the purpose of improving the synthesis quality of the MS-iSTFT-HiFiGAN. As for the inference speed, RTF was 0.15 on 1 CPU, which is the same as MS-iSTFT-HiFiGAN. Synthesis quality was inferior to that of MS-iSTFT-HiFiGAN in TTS task, but was superior to thatin analysis/synthesis task.
Keyword(in Japanese) (See Japanese page)
Keyword(in English) speech synthesis / Neural Vocoder / HiFi-GAN / Text-to-Speech / Analysis Synthesis
Paper # EA2022-76,SIP2022-120,SP2022-40
Date of Issue 2023-02-21 (EA, SIP, SP)

Conference Information
Committee SP / IPSJ-SLP / EA / SIP
Conference Date 2023/2/28(2days)
Place (in Japanese) (See Japanese page)
Place (in English)
Topics (in Japanese) (See Japanese page)
Topics (in English)
Chair Tomoki Toda(Nagoya Univ.) / Tomoki Toda(Nagoya Univ.) / Kenichi Furuya(Oita Univ.) / Toshihisa Tanaka(Tokyo Univ. Agri.&Tech.)
Vice Chair / / Tatsuya Kako(NTT) / Junki Ono(Tokyo Metropolitan Univ.) / Koichi Ichige(Yokohama National Univ.) / Takayuki Nakachi(Ryukyu Univ.)
Secretary (NTT) / (Univ. of Electro-Comm.) / Tatsuya Kako(NTT) / Junki Ono(Univ. of Electro-Comm.) / Koichi Ichige(NTT) / Takayuki Nakachi(RitsumeikanUniv.)
Assistant Ryo Aihara(Mitsubishi Electric) / Daisuke Saito(Univ. of Tokyo) / Ryo Aihara(Mitsubishi Electric) / Daisuke Saito(Univ. of Tokyo) / Masato Nakayama(Osaka Sangyo Univ.) / Kouhei Yatabe(Tuat) / Taichi Yoshida(UEC) / Shoko Imaizumi(Chiba Univ.)

Paper Information
Registration To Technical Committee on Speech / Special Interest Group on Spoken Language Processing / Technical Committee on Engineering Acoustics / Technical Committee on Signal Processing
Language JPN
Title (in Japanese) (See Japanese page)
Sub Title (in Japanese) (See Japanese page)
Title (in English) Multi-stream FC-HiFi-GAN:Fast Neural Vocoder Model Using Learnable Lightweight Upsampling
Sub Title (in English)
Keyword(1) speech synthesis
Keyword(2) Neural Vocoder
Keyword(3) HiFi-GAN
Keyword(4) Text-to-Speech
Keyword(5) Analysis Synthesis
1st Author's Name Haruki Yamashita
1st Author's Affiliation Kobe University/National Institute of Information and Communications Technology(Kobe Univ/NICT)
2nd Author's Name Takuma Okamoto
2nd Author's Affiliation National Institute of Information and Communications Technology(NICT)
3rd Author's Name Ryoichi Takashima
3rd Author's Affiliation Kobe University(Kobe Univ)
4th Author's Name Tetsuya Takiguchi
4th Author's Affiliation Kobe University(Kobe Univ)
5th Author's Name Tomoki Toda
5th Author's Affiliation Nagoya University/National Institute of Information and Communications Technology(Nagoya Univ/NICT)
6th Author's Name Hisashi Kawai
6th Author's Affiliation National Institute of Information and Communications Technology(NICT)
Date 2023-02-28
Paper # EA2022-76,SIP2022-120,SP2022-40
Volume (vol) vol.122
Number (no) EA-387,SIP-388,SP-389
Page pp.pp.7-12(EA), pp.7-12(SIP), pp.7-12(SP),
#Pages 6
Date of Issue 2023-02-21 (EA, SIP, SP)