Presentation 2020-12-02
Fast End-to-End Speech Recognition with CTC and Mask Predict
Yosuke Higuchi, Hirofumi Inaguma, Shinji Watanabe, Tetsuji Ogawa, Tetsunori Kobayashi,
PDF Download Page PDF download Page Link
Abstract(in Japanese) (See Japanese page)
Abstract(in English) We present a fast non-autoregressive (NAR) end-to-end automatic speech recognition (E2E-ASR) framework, which generates a sequence by refining outputs of the connectionist temporal classification (CTC) via mask prediction. Many of the previous studies on E2E-ASR focus on an textit{autoregressive} (AR) model: each output token is generated by conditioning on previously generated tokens, at the cost of requiring as many iterations as the output length. On the other hand, NAR models can simultaneously generate tokens within a constant number of iterations, which results in significant inference time reduction and better suits end-to-end ASR model for real-world scenarios. In this work, we train an E2E-ASR model with joint objectives of CTC and mask prediction. During inference, the greedy CTC output is refined by mask prediction, where errors in the CTC output are recovered by taking account of conditional dependence between output tokens. Experimental results on different speech recognition tasks show that the proposed model achieves fast inference time ($<$0.1 RTF using CPU), outperforming a standard CTC model and achieving competitive results to the AR models.
Keyword(in Japanese) (See Japanese page)
Keyword(in English) end-to-end speech recognition / connectionist temporal classification / non-autoregressive sequence generation
Paper # NLC2020-13,SP2020-16
Date of Issue 2020-11-25 (NLC, SP)

Conference Information
Committee NLC / IPSJ-NL / SP / IPSJ-SLP
Conference Date 2020/12/2(2days)
Place (in Japanese) (See Japanese page)
Place (in English) Online
Topics (in Japanese) (See Japanese page)
Topics (in English)
Chair Kazutaka Shimada(Kyushu Inst. of Tech.) / 関根 聡(理研) / Hisashi Kawai(NICT) / 北岡 教英(豊技大)
Vice Chair Mitsuo Yoshida(Toyohashi Univ. of Tech.) / Takeshi Kobayakawa(NHK)
Secretary Mitsuo Yoshida(Univ. of Tokyo) / Takeshi Kobayakawa(Hiroshima Univ. of Economics) / (デンソーITラボ) / (小樽商科大) / (茨城大)
Assistant Kanjin Takahashi(Sansan) / Ko Mitsuda(NTT) / / Yusuke Ijima(NTT)

Paper Information
Registration To Technical Committee on Natural Language Understanding and Models of Communication / Special Interest Group on Natural Language / Technical Committee on Speech / Special Interest Group on Spoken Language Processing
Language JPN
Title (in Japanese) (See Japanese page)
Sub Title (in Japanese) (See Japanese page)
Title (in English) Fast End-to-End Speech Recognition with CTC and Mask Predict
Sub Title (in English)
Keyword(1) end-to-end speech recognition
Keyword(2) connectionist temporal classification
Keyword(3) non-autoregressive sequence generation
1st Author's Name Yosuke Higuchi
1st Author's Affiliation Waseda University(Waseda Univ.)
2nd Author's Name Hirofumi Inaguma
2nd Author's Affiliation Kyoto Univeristy(Kyoto Univ.)
3rd Author's Name Shinji Watanabe
3rd Author's Affiliation Johns Hopkins University(JHU)
4th Author's Name Tetsuji Ogawa
4th Author's Affiliation Waseda University(Waseda Univ.)
5th Author's Name Tetsunori Kobayashi
5th Author's Affiliation Waseda University(Waseda Univ.)
Date 2020-12-02
Paper # NLC2020-13,SP2020-16
Volume (vol) vol.120
Number (no) NLC-270,SP-271
Page pp.pp.1-6(NLC), pp.1-6(SP),
#Pages 6
Date of Issue 2020-11-25 (NLC, SP)