Presentation 2022-10-22
Conformer based early fusion model for audio-visual speech recognition
Nobukazu Aoki, Shun Sawada, Hidefumi Ohmura, Kouichi Katsurada,
PDF Download Page PDF download Page Link
Abstract(in Japanese) (See Japanese page)
Abstract(in English) Previous studies of late fusion models with conformer encoders use independent encoders for both visual and audio information, which may prevent the encoders from capturing the low-level relation of both information. In this study, we investigate an end-to-end audio-visual speech recognition model with early fusion using a conformer encoder to improve its performance. We aim at utilizing the information of both modalities in the low-level process of feature extraction. The experimental results show that the accuracy of early fusion in recognition rate under low SNR outperforms that of late fusion proposed in the previous studies. We also confirmed that the total number of parameters in the model can be reduced by introducing early fusion.
Keyword(in Japanese) (See Japanese page)
Keyword(in English) Audio-visual speech recognition / Conformer model / Early fusion
Paper # SP2022-28,WIT2022-3
Date of Issue 2022-10-15 (SP, WIT)

Conference Information
Committee SP / WIT / IPSJ-SLP
Conference Date 2022/10/22(1days)
Place (in Japanese) (See Japanese page)
Place (in English) Kyoto University
Topics (in Japanese) (See Japanese page)
Topics (in English)
Chair Tomoki Toda(Nagoya Univ.) / Shinji Sakou(Nagoya Inst. of Tech.) / Tomoki Toda(Nagoya Univ.)
Vice Chair / Tomohiro Amemiya(Univ. of Tokyo)
Secretary (NTT) / Tomohiro Amemiya(Univ. of Electro-Comm.) / (Saitama Industrial Tech. Center)
Assistant Ryo Aihara(Mitsubishi Electric) / Daisuke Saito(Univ. of Tokyo) / Minako Hosono(AIST) / Aki Sugano(Nagoya Univ.) / Tomoyasu Komori(NHK)

Paper Information
Registration To Technical Committee on Speech / Technical Committee on Well-being Information Technology / Special Interest Group on Spoken Language Processing
Language JPN
Title (in Japanese) (See Japanese page)
Sub Title (in Japanese) (See Japanese page)
Title (in English) Conformer based early fusion model for audio-visual speech recognition
Sub Title (in English)
Keyword(1) Audio-visual speech recognition
Keyword(2) Conformer model
Keyword(3) Early fusion
1st Author's Name Nobukazu Aoki
1st Author's Affiliation Tokyo University of Science(Tokyo Univ. of Sci.)
2nd Author's Name Shun Sawada
2nd Author's Affiliation Tokyo University of Science(Tokyo Univ. of Sci.)
3rd Author's Name Hidefumi Ohmura
3rd Author's Affiliation Tokyo University of Science(Tokyo Univ. of Sci.)
4th Author's Name Kouichi Katsurada
4th Author's Affiliation Tokyo University of Science(Tokyo Univ. of Sci.)
Date 2022-10-22
Paper # SP2022-28,WIT2022-3
Volume (vol) vol.122
Number (no) SP-221,WIT-222
Page pp.pp.8-13(SP), pp.8-13(WIT),
#Pages 6
Date of Issue 2022-10-15 (SP, WIT)