Presentation 2022-11-30
Semi-supervised joint training of text to speech and automatic speech recognition using unpaired text data
Naoki Makishima, Satoshi Suzuki, Atsushi Ando, Ryo Masumura,
PDF Download Page PDF download Page Link
Abstract(in Japanese) (See Japanese page)
Abstract(in English) This paper presents a novel joint training of text to speech (TTS) and automatic speech recognition (ASR) with small amounts of speech-text paired data and large amounts of text data. In conventional cycle-consistency-based methods, the TTS model and the ASR model are trained so that the text obtained by speech synthesis of text data and speech recognition of the synthesized speech matches the original text. However, this method leads to an overfitting of the synthesized speech to the ASR model, which results in the synthesized speech that (1) lacks speaker characteristics and (2) is easily recognizable. This problem not only degrades the quality of the synthesized speech but also limits the improvement of speech recognition performance. In this paper, we propose a learning method based on (1) speaker consistency loss and (2) step-wise optimization to solve this problem. Experimental results demonstrate the efficacy of the proposed method.
Keyword(in Japanese) (See Japanese page)
Keyword(in English) automatic speech recognition / text to speech / semi-supervised learning
Paper # NLC2022-14,SP2022-34
Date of Issue 2022-11-22 (NLC, SP)

Conference Information
Committee NLC / IPSJ-NL / SP / IPSJ-SLP
Conference Date 2022/11/29(3days)
Place (in Japanese) (See Japanese page)
Place (in English)
Topics (in Japanese) (See Japanese page)
Topics (in English)
Chair Mitsuo Yoshida(Univ. of Tsukuba) / 須藤 克仁(奈良先端科学技術大学院大学) / Tomoki Toda(Nagoya Univ.) / 戸田 智基(名古屋大学)
Vice Chair Hiroki Sakaji(Univ. of Tokyo) / Takeshi Kobayakawa(NHK)
Secretary Hiroki Sakaji(NTT) / Takeshi Kobayakawa(Hiroshima Univ. of Economics) / (株式会社デンソーアイティーラボラトリ) / (北海学園大学) / (東京農工大学)
Assistant Kanjin Takahashi(Sansan) / Yasuhiro Ogawa(Nagoya Univ.) / / Ryo Aihara(Mitsubishi Electric) / Daisuke Saito(Univ. of Tokyo)

Paper Information
Registration To Technical Committee on Natural Language Understanding and Models of Communication / Special Interest Group on Natural Language / Technical Committee on Speech / Special Interest Group on Spoken Language Processing
Language JPN
Title (in Japanese) (See Japanese page)
Sub Title (in Japanese) (See Japanese page)
Title (in English) Semi-supervised joint training of text to speech and automatic speech recognition using unpaired text data
Sub Title (in English)
Keyword(1) automatic speech recognition
Keyword(2) text to speech
Keyword(3) semi-supervised learning
1st Author's Name Naoki Makishima
1st Author's Affiliation NIPPON TELEGRAPH AND TELEPHONE CORPORATION(NTT)
2nd Author's Name Satoshi Suzuki
2nd Author's Affiliation NIPPON TELEGRAPH AND TELEPHONE CORPORATION(NTT)
3rd Author's Name Atsushi Ando
3rd Author's Affiliation NIPPON TELEGRAPH AND TELEPHONE CORPORATION(NTT)
4th Author's Name Ryo Masumura
4th Author's Affiliation NIPPON TELEGRAPH AND TELEPHONE CORPORATION(NTT)
Date 2022-11-30
Paper # NLC2022-14,SP2022-34
Volume (vol) vol.122
Number (no) NLC-287,SP-288
Page pp.pp.27-32(NLC), pp.27-32(SP),
#Pages 6
Date of Issue 2022-11-22 (NLC, SP)