Conformerを用いた早期結合型マルチモーダル音声認識モデルの提案

Presentation	2022-10-22 Conformer based early fusion model for audio-visual speech recognition Nobukazu Aoki, Shun Sawada, Hidefumi Ohmura, Kouichi Katsurada,
PDF Download Page	PDF download Page Link
Abstract(in Japanese)	(See Japanese page)
Abstract(in English)	Previous studies of late fusion models with conformer encoders use independent encoders for both visual and audio information, which may prevent the encoders from capturing the low-level relation of both information. In this study, we investigate an end-to-end audio-visual speech recognition model with early fusion using a conformer encoder to improve its performance. We aim at utilizing the information of both modalities in the low-level process of feature extraction. The experimental results show that the accuracy of early fusion in recognition rate under low SNR outperforms that of late fusion proposed in the previous studies. We also confirmed that the total number of parameters in the model can be reduced by introducing early fusion.
Keyword(in Japanese)	(See Japanese page)
Keyword(in English)	Audio-visual speech recognition / Conformer model / Early fusion
Paper #	SP2022-28,WIT2022-3
Date of Issue	2022-10-15 (SP, WIT)

Conference Information
Committee	SP / WIT / IPSJ-SLP
Conference Date	2022/10/22(1days)
Place (in Japanese)	(See Japanese page)
Place (in English)	Kyoto University
Topics (in Japanese)	(See Japanese page)
Topics (in English)
Chair	Tomoki Toda(Nagoya Univ.) / Shinji Sakou(Nagoya Inst. of Tech.) / Tomoki Toda(Nagoya Univ.)
Vice Chair	/ Tomohiro Amemiya(Univ. of Tokyo)
Secretary	(NTT) / Tomohiro Amemiya(Univ. of Electro-Comm.) / (Saitama Industrial Tech. Center)
Assistant	Ryo Aihara(Mitsubishi Electric) / Daisuke Saito(Univ. of Tokyo) / Minako Hosono(AIST) / Aki Sugano(Nagoya Univ.) / Tomoyasu Komori(NHK)

Paper Information
Registration To	Technical Committee on Speech / Technical Committee on Well-being Information Technology / Special Interest Group on Spoken Language Processing
Language	JPN
Title (in Japanese)	(See Japanese page)
Sub Title (in Japanese)	(See Japanese page)
Title (in English)	Conformer based early fusion model for audio-visual speech recognition
Sub Title (in English)
Keyword(1)	Audio-visual speech recognition
Keyword(2)	Conformer model
Keyword(3)	Early fusion
1st Author's Name	Nobukazu Aoki
1st Author's Affiliation	Tokyo University of Science(Tokyo Univ. of Sci.)
2nd Author's Name	Shun Sawada
2nd Author's Affiliation	Tokyo University of Science(Tokyo Univ. of Sci.)
3rd Author's Name	Hidefumi Ohmura
3rd Author's Affiliation	Tokyo University of Science(Tokyo Univ. of Sci.)
4th Author's Name	Kouichi Katsurada
4th Author's Affiliation	Tokyo University of Science(Tokyo Univ. of Sci.)
Date	2022-10-22
Paper #	SP2022-28,WIT2022-3
Volume (vol)	vol.122
Number (no)	SP-221,WIT-222
Page	pp.pp.8-13(SP), pp.8-13(WIT),
#Pages	6
Date of Issue	2022-10-15 (SP, WIT)