Presentation | 2020-12-02 Fast End-to-End Speech Recognition with CTC and Mask Predict Yosuke Higuchi, Hirofumi Inaguma, Shinji Watanabe, Tetsuji Ogawa, Tetsunori Kobayashi, |
---|---|
PDF Download Page | ![]() |
Abstract(in Japanese) | (See Japanese page) |
Abstract(in English) | We present a fast non-autoregressive (NAR) end-to-end automatic speech recognition (E2E-ASR) framework, which generates a sequence by refining outputs of the connectionist temporal classification (CTC) via mask prediction. Many of the previous studies on E2E-ASR focus on an textit{autoregressive} (AR) model: each output token is generated by conditioning on previously generated tokens, at the cost of requiring as many iterations as the output length. On the other hand, NAR models can simultaneously generate tokens within a constant number of iterations, which results in significant inference time reduction and better suits end-to-end ASR model for real-world scenarios. In this work, we train an E2E-ASR model with joint objectives of CTC and mask prediction. During inference, the greedy CTC output is refined by mask prediction, where errors in the CTC output are recovered by taking account of conditional dependence between output tokens. Experimental results on different speech recognition tasks show that the proposed model achieves fast inference time ($<$0.1 RTF using CPU), outperforming a standard CTC model and achieving competitive results to the AR models. |
Keyword(in Japanese) | (See Japanese page) |
Keyword(in English) | end-to-end speech recognition / connectionist temporal classification / non-autoregressive sequence generation |
Paper # | NLC2020-13,SP2020-16 |
Date of Issue | 2020-11-25 (NLC, SP) |
Conference Information | |
Committee | NLC / IPSJ-NL / SP / IPSJ-SLP |
---|---|
Conference Date | 2020/12/2(2days) |
Place (in Japanese) | (See Japanese page) |
Place (in English) | Online |
Topics (in Japanese) | (See Japanese page) |
Topics (in English) | |
Chair | Kazutaka Shimada(Kyushu Inst. of Tech.) / 関根 聡(理研) / Hisashi Kawai(NICT) / 北岡 教英(豊技大) |
Vice Chair | Mitsuo Yoshida(Toyohashi Univ. of Tech.) / Takeshi Kobayakawa(NHK) |
Secretary | Mitsuo Yoshida(Univ. of Tokyo) / Takeshi Kobayakawa(Hiroshima Univ. of Economics) / (デンソーITラボ) / (小樽商科大) / (茨城大) |
Assistant | Kanjin Takahashi(Sansan) / Ko Mitsuda(NTT) / / Yusuke Ijima(NTT) |
Paper Information | |
Registration To | Technical Committee on Natural Language Understanding and Models of Communication / Special Interest Group on Natural Language / Technical Committee on Speech / Special Interest Group on Spoken Language Processing |
---|---|
Language | JPN |
Title (in Japanese) | (See Japanese page) |
Sub Title (in Japanese) | (See Japanese page) |
Title (in English) | Fast End-to-End Speech Recognition with CTC and Mask Predict |
Sub Title (in English) | |
Keyword(1) | end-to-end speech recognition |
Keyword(2) | connectionist temporal classification |
Keyword(3) | non-autoregressive sequence generation |
1st Author's Name | Yosuke Higuchi |
1st Author's Affiliation | Waseda University(Waseda Univ.) |
2nd Author's Name | Hirofumi Inaguma |
2nd Author's Affiliation | Kyoto Univeristy(Kyoto Univ.) |
3rd Author's Name | Shinji Watanabe |
3rd Author's Affiliation | Johns Hopkins University(JHU) |
4th Author's Name | Tetsuji Ogawa |
4th Author's Affiliation | Waseda University(Waseda Univ.) |
5th Author's Name | Tetsunori Kobayashi |
5th Author's Affiliation | Waseda University(Waseda Univ.) |
Date | 2020-12-02 |
Paper # | NLC2020-13,SP2020-16 |
Volume (vol) | vol.120 |
Number (no) | NLC-270,SP-271 |
Page | pp.pp.1-6(NLC), pp.1-6(SP), |
#Pages | 6 |
Date of Issue | 2020-11-25 (NLC, SP) |