Presentation | 2022-10-22 Conformer based early fusion model for audio-visual speech recognition Nobukazu Aoki, Shun Sawada, Hidefumi Ohmura, Kouichi Katsurada, |
---|---|
PDF Download Page | PDF download Page Link |
Abstract(in Japanese) | (See Japanese page) |
Abstract(in English) | Previous studies of late fusion models with conformer encoders use independent encoders for both visual and audio information, which may prevent the encoders from capturing the low-level relation of both information. In this study, we investigate an end-to-end audio-visual speech recognition model with early fusion using a conformer encoder to improve its performance. We aim at utilizing the information of both modalities in the low-level process of feature extraction. The experimental results show that the accuracy of early fusion in recognition rate under low SNR outperforms that of late fusion proposed in the previous studies. We also confirmed that the total number of parameters in the model can be reduced by introducing early fusion. |
Keyword(in Japanese) | (See Japanese page) |
Keyword(in English) | Audio-visual speech recognition / Conformer model / Early fusion |
Paper # | SP2022-28,WIT2022-3 |
Date of Issue | 2022-10-15 (SP, WIT) |
Conference Information | |
Committee | SP / WIT / IPSJ-SLP |
---|---|
Conference Date | 2022/10/22(1days) |
Place (in Japanese) | (See Japanese page) |
Place (in English) | Kyoto University |
Topics (in Japanese) | (See Japanese page) |
Topics (in English) | |
Chair | Tomoki Toda(Nagoya Univ.) / Shinji Sakou(Nagoya Inst. of Tech.) / Tomoki Toda(Nagoya Univ.) |
Vice Chair | / Tomohiro Amemiya(Univ. of Tokyo) |
Secretary | (NTT) / Tomohiro Amemiya(Univ. of Electro-Comm.) / (Saitama Industrial Tech. Center) |
Assistant | Ryo Aihara(Mitsubishi Electric) / Daisuke Saito(Univ. of Tokyo) / Minako Hosono(AIST) / Aki Sugano(Nagoya Univ.) / Tomoyasu Komori(NHK) |
Paper Information | |
Registration To | Technical Committee on Speech / Technical Committee on Well-being Information Technology / Special Interest Group on Spoken Language Processing |
---|---|
Language | JPN |
Title (in Japanese) | (See Japanese page) |
Sub Title (in Japanese) | (See Japanese page) |
Title (in English) | Conformer based early fusion model for audio-visual speech recognition |
Sub Title (in English) | |
Keyword(1) | Audio-visual speech recognition |
Keyword(2) | Conformer model |
Keyword(3) | Early fusion |
1st Author's Name | Nobukazu Aoki |
1st Author's Affiliation | Tokyo University of Science(Tokyo Univ. of Sci.) |
2nd Author's Name | Shun Sawada |
2nd Author's Affiliation | Tokyo University of Science(Tokyo Univ. of Sci.) |
3rd Author's Name | Hidefumi Ohmura |
3rd Author's Affiliation | Tokyo University of Science(Tokyo Univ. of Sci.) |
4th Author's Name | Kouichi Katsurada |
4th Author's Affiliation | Tokyo University of Science(Tokyo Univ. of Sci.) |
Date | 2022-10-22 |
Paper # | SP2022-28,WIT2022-3 |
Volume (vol) | vol.122 |
Number (no) | SP-221,WIT-222 |
Page | pp.pp.8-13(SP), pp.8-13(WIT), |
#Pages | 6 |
Date of Issue | 2022-10-15 (SP, WIT) |