CTCとマスク推定に基づく推論速度の速いEnd-to-End音声認識

樋口 陽祐; 稲熊 寛文; 渡部 晋治; 小川 哲司; 小林 哲則

Presentation	2020-12-02 Fast End-to-End Speech Recognition with CTC and Mask Predict Yosuke Higuchi, Hirofumi Inaguma, Shinji Watanabe, Tetsuji Ogawa, Tetsunori Kobayashi,
PDF Download Page	PDF download Page Link
Abstract(in Japanese)	(See Japanese page)
Abstract(in English)	We present a fast non-autoregressive (NAR) end-to-end automatic speech recognition (E2E-ASR) framework, which generates a sequence by refining outputs of the connectionist temporal classification (CTC) via mask prediction. Many of the previous studies on E2E-ASR focus on an textit{autoregressive} (AR) model: each output token is generated by conditioning on previously generated tokens, at the cost of requiring as many iterations as the output length. On the other hand, NAR models can simultaneously generate tokens within a constant number of iterations, which results in significant inference time reduction and better suits end-to-end ASR model for real-world scenarios. In this work, we train an E2E-ASR model with joint objectives of CTC and mask prediction. During inference, the greedy CTC output is refined by mask prediction, where errors in the CTC output are recovered by taking account of conditional dependence between output tokens. Experimental results on different speech recognition tasks show that the proposed model achieves fast inference time ($<$0.1 RTF using CPU), outperforming a standard CTC model and achieving competitive results to the AR models.
Keyword(in Japanese)	(See Japanese page)
Keyword(in English)	end-to-end speech recognition / connectionist temporal classification / non-autoregressive sequence generation
Paper #	NLC2020-13,SP2020-16
Date of Issue	2020-11-25 (NLC, SP)

Conference Information
Committee	NLC / IPSJ-NL / SP / IPSJ-SLP
Conference Date	2020/12/2(2days)
Place (in Japanese)	(See Japanese page)
Place (in English)	Online
Topics (in Japanese)	(See Japanese page)
Topics (in English)
Chair	Kazutaka Shimada(Kyushu Inst. of Tech.) / 関根聡(理研) / Hisashi Kawai(NICT) / 北岡教英(豊技大)
Vice Chair	Mitsuo Yoshida(Toyohashi Univ. of Tech.) / Takeshi Kobayakawa(NHK)
Secretary	Mitsuo Yoshida(Univ. of Tokyo) / Takeshi Kobayakawa(Hiroshima Univ. of Economics) / (デンソーITラボ) / (小樽商科大) / (茨城大)
Assistant	Kanjin Takahashi(Sansan) / Ko Mitsuda(NTT) / / Yusuke Ijima(NTT)

Paper Information
Registration To	Technical Committee on Natural Language Understanding and Models of Communication / Special Interest Group on Natural Language / Technical Committee on Speech / Special Interest Group on Spoken Language Processing
Language	JPN
Title (in Japanese)	(See Japanese page)
Sub Title (in Japanese)	(See Japanese page)
Title (in English)	Fast End-to-End Speech Recognition with CTC and Mask Predict
Sub Title (in English)
Keyword(1)	end-to-end speech recognition
Keyword(2)	connectionist temporal classification
Keyword(3)	non-autoregressive sequence generation
1st Author's Name	Yosuke Higuchi
1st Author's Affiliation	Waseda University(Waseda Univ.)
2nd Author's Name	Hirofumi Inaguma
2nd Author's Affiliation	Kyoto Univeristy(Kyoto Univ.)
3rd Author's Name	Shinji Watanabe
3rd Author's Affiliation	Johns Hopkins University(JHU)
4th Author's Name	Tetsuji Ogawa
4th Author's Affiliation	Waseda University(Waseda Univ.)
5th Author's Name	Tetsunori Kobayashi
5th Author's Affiliation	Waseda University(Waseda Univ.)
Date	2020-12-02
Paper #	NLC2020-13,SP2020-16
Volume (vol)	vol.120
Number (no)	NLC-270,SP-271
Page	pp.pp.1-6(NLC), pp.1-6(SP),
#Pages	6
Date of Issue	2020-11-25 (NLC, SP)