Presentation | 2022-11-30 Semi-supervised joint training of text to speech and automatic speech recognition using unpaired text data Naoki Makishima, Satoshi Suzuki, Atsushi Ando, Ryo Masumura, |
---|---|
PDF Download Page | ![]() |
Abstract(in Japanese) | (See Japanese page) |
Abstract(in English) | This paper presents a novel joint training of text to speech (TTS) and automatic speech recognition (ASR) with small amounts of speech-text paired data and large amounts of text data. In conventional cycle-consistency-based methods, the TTS model and the ASR model are trained so that the text obtained by speech synthesis of text data and speech recognition of the synthesized speech matches the original text. However, this method leads to an overfitting of the synthesized speech to the ASR model, which results in the synthesized speech that (1) lacks speaker characteristics and (2) is easily recognizable. This problem not only degrades the quality of the synthesized speech but also limits the improvement of speech recognition performance. In this paper, we propose a learning method based on (1) speaker consistency loss and (2) step-wise optimization to solve this problem. Experimental results demonstrate the efficacy of the proposed method. |
Keyword(in Japanese) | (See Japanese page) |
Keyword(in English) | automatic speech recognition / text to speech / semi-supervised learning |
Paper # | NLC2022-14,SP2022-34 |
Date of Issue | 2022-11-22 (NLC, SP) |
Conference Information | |
Committee | NLC / IPSJ-NL / SP / IPSJ-SLP |
---|---|
Conference Date | 2022/11/29(3days) |
Place (in Japanese) | (See Japanese page) |
Place (in English) | |
Topics (in Japanese) | (See Japanese page) |
Topics (in English) | |
Chair | Mitsuo Yoshida(Univ. of Tsukuba) / 須藤 克仁(奈良先端科学技術大学院大学) / Tomoki Toda(Nagoya Univ.) / 戸田 智基(名古屋大学) |
Vice Chair | Hiroki Sakaji(Univ. of Tokyo) / Takeshi Kobayakawa(NHK) |
Secretary | Hiroki Sakaji(NTT) / Takeshi Kobayakawa(Hiroshima Univ. of Economics) / (株式会社デンソーアイティーラボラトリ) / (北海学園大学) / (東京農工大学) |
Assistant | Kanjin Takahashi(Sansan) / Yasuhiro Ogawa(Nagoya Univ.) / / Ryo Aihara(Mitsubishi Electric) / Daisuke Saito(Univ. of Tokyo) |
Paper Information | |
Registration To | Technical Committee on Natural Language Understanding and Models of Communication / Special Interest Group on Natural Language / Technical Committee on Speech / Special Interest Group on Spoken Language Processing |
---|---|
Language | JPN |
Title (in Japanese) | (See Japanese page) |
Sub Title (in Japanese) | (See Japanese page) |
Title (in English) | Semi-supervised joint training of text to speech and automatic speech recognition using unpaired text data |
Sub Title (in English) | |
Keyword(1) | automatic speech recognition |
Keyword(2) | text to speech |
Keyword(3) | semi-supervised learning |
1st Author's Name | Naoki Makishima |
1st Author's Affiliation | NIPPON TELEGRAPH AND TELEPHONE CORPORATION(NTT) |
2nd Author's Name | Satoshi Suzuki |
2nd Author's Affiliation | NIPPON TELEGRAPH AND TELEPHONE CORPORATION(NTT) |
3rd Author's Name | Atsushi Ando |
3rd Author's Affiliation | NIPPON TELEGRAPH AND TELEPHONE CORPORATION(NTT) |
4th Author's Name | Ryo Masumura |
4th Author's Affiliation | NIPPON TELEGRAPH AND TELEPHONE CORPORATION(NTT) |
Date | 2022-11-30 |
Paper # | NLC2022-14,SP2022-34 |
Volume (vol) | vol.122 |
Number (no) | NLC-287,SP-288 |
Page | pp.pp.27-32(NLC), pp.27-32(SP), |
#Pages | 6 |
Date of Issue | 2022-11-22 (NLC, SP) |