音声対話型UI間の協調動作のための音声発生源判別法に適した特徴量と深層学習モデル

前田 健吾; 吉田 孝博

講演名	2022-10-21 音声対話型UI間の協調動作のための音声発生源判別法に適した特徴量と深層学習モデル前田健吾(東京理科大), 吉田孝博(東京理科大),
PDFダウンロードページ	PDFダウンロードページへ
抄録(和)	今後，ユーザの利用環境において音声対話型のユーザインタフェース（UI）を搭載した機器が複数存在するようになると，それぞれの機器が正しく動作するためには，人が直接発した音声と機器から再生された音声を判別する技術が必要となる．そのため当研究室の先行研究では，Convolutional Neural Network（CNN）とMel-Frequency Cepstral Coefficients（MFCC）を用いた音声発生源判別法を提案した．しかし，先行研究では，音声発生源判別に適した特徴量や深層学習のモデルに関しては検討が不足していた．そこで本研究では，音声発生源判別法に適した特徴量と深層学習のモデルについて調査するために，複数種の特徴量と深層学習モデルについて，音声発生源判別実験により精度を比較した．その結果，MFCCは次元数を増やすほど判別精度が向上することから，スペクトルの微細構造も音声発生源判別法においては重要であることを確認した．また，事前学習済みモデルに対して深層学習の手法の一つであるファインチューニングを行ったモデルが音声発生源判別法においても有効であることを確認した．
抄録(英)	Under the situation that plural devices equipped with a voice user interface exist in the user’s environment in the near future, technology to discriminate between directly uttered speech by the user and playbacked speech from a device will be necessary for each device to work correctly. Therefore, our previous study proposed a speech source discrimination method using a Convolutional Neural Network (CNN) and Mel-Frequency Cepstral Coefficients (MFCC). However, features and deep learning models suitable for the speech source discrimination method have not been researched in previous studies. Therefore, in this study, we compared and evaluated several features and deep learning models by their speech source discrimination accuracy to investigate suitable features and deep learning models for the speech source discrimination method. From the experimental results, we confirmed that the rich feature that includes the fine structure of the spectrum is effective for the speech source discrimination method, since the discrimination accuracy of MFCC improves as the number of dimensions increase. We also confirmed that using a pre-learned model with re-learning by fine-tuning is also effective for the speech source discrimination method.
キーワード(和)	音声対話型UI / 音源判別 / 畳み込みニューラルネットワーク / ファインチューニング
キーワード(英)	Voice user interface / Speech source discrimination / Convolutional neural network / Fine-tuning
資料番号	PRMU2022-27
発行日	2022-10-14 (PRMU)

研究会情報
研究会	PRMU
開催期間	2022/10/21(から2日開催)
開催地（和）	日本科学未来館
開催地（英）	Miraikan - The National Museum of Emerging Science and Innovation
テーマ（和）	人に関わる認識・理解
テーマ（英）	Recognition and understanding related to people
委員長氏名（和）	内田誠一(九大)
委員長氏名（英）	Seiichi Uchida(Kyushu Univ.)
副委員長氏名（和）	舩冨卓哉(奈良先端大) / 安倍満(デンソーアイティーラボラトリ)
副委員長氏名（英）	Takuya Funatomi(NAIST) / Mitsuru Anpai(Denso IT Lab.)
幹事氏名（和）	山口光太(サイバーエージェント) / 松井勇佑(東大)
幹事氏名（英）	Kouta Yamaguchi(CyberAgent) / Yusuke Matsui(Univ. of Tokyo)
幹事補佐氏名（和）	井上中順(東工大) / 川西康友(理研)
幹事補佐氏名（英）	Nakamasa Inoue(Tokyo Inst. of Tech.) / Yasutomo Kawanishi(Riken)

講演論文情報詳細
申込み研究会	Technical Committee on Pattern Recognition and Media Understanding
本文の言語	JPN
タイトル（和）	音声対話型UI間の協調動作のための音声発生源判別法に適した特徴量と深層学習モデル
サブタイトル（和）
タイトル（英）	Features and Deep Learning Models Suitable for Speech Source Discrimination Method in Plural Voice User Interfaces Environment
サブタイトル（和）
キーワード(1)（和/英）	音声対話型UI / Voice user interface
キーワード(2)（和/英）	音源判別 / Speech source discrimination
キーワード(3)（和/英）	畳み込みニューラルネットワーク / Convolutional neural network
キーワード(4)（和/英）	ファインチューニング / Fine-tuning
第 1 著者氏名（和/英）	前田健吾 / Kengo Maeda
第 1 著者所属（和/英）	東京理科大学(略称：東京理科大) Tokyo University of Science(略称：TUS)
第 2 著者氏名（和/英）	吉田孝博 / Takahiro Yoshida
第 2 著者所属（和/英）	東京理科大学(略称：東京理科大) Tokyo University of Science(略称：TUS)
発表年月日	2022-10-21
資料番号	PRMU2022-27
巻番号（vol）	vol.122
号番号（no）	PRMU-223
ページ範囲	pp.29-34(PRMU),
ページ数	6
発行日	2022-10-14 (PRMU)