音声と画像のconfusion networkを用いたマルチモーダル音声認識

上澤 泰; 石川 雅人; 田村 哲嗣; 速水 悟

講演名	2007/11/21 音声と画像のconfusion networkを用いたマルチモーダル音声認識上澤泰, 石川雅人, 田村哲嗣, 速水悟,
PDFダウンロードページ	PDFダウンロードページへ
抄録(和)	音声と画像の結果統合法によるマルチモーダル音声認識では,音声・画像を個別に認識した段階で,統合に必要な情報が的確に含まれた中間表現が重要である.近年,単語グラフのアークをクラスタリングすることで得られるCN (confusion network)が,音声認識の中間表現として用いられていることに加え,出力された各CNを統合するCNC (confusion network combination)が,複数の音声認識器の結果を統合する方法として提案されている.各認識器の誤り傾向が異なるほどCNCによる統合の効果が期待でき,音声と口唇動画像を用いたマルチモーダル音声認識においても有効な統合法であると考えられる.そこで,本研究では,音声と画像をCNCにより統合し,雑音下での音声認識性能の改善を試みた.その性能について評価した.またCNを行う2つの方法や,信頼度スコアとの関連についても検討した.
抄録(英)	In multimodal speech recognition, hypotheses from speech and visual recognizers are usually integrated afterwards when both recognition process have been finished. As speech recognition and visual recognition are done separately, intermediate representation of hypotheses for audio (speech) and visual information is very important issue. Recently, confusion networks (CN) are used as intermediate representation of hypotheses in speech recognition. In addition, confusion network combination (CNC), which integrates multiple confusion networks, have been proposed as a method to integrate hypotheses which are derived from multiple recognition processes. Integration by CNC produces better recognition performance when each recognition process has different property in recognition errors. As multimodal speech recognition integrates audio and visual recognition processes, it is expected that CNC will produce improvement in recognition performance. Therefore, in this paper, audio and visual recognition results were integrated as CNC and were applied to multimodal speech recognition in noisy environment to get improvement in recognition performance. Two methods for combination of CN described. Relationship with confidence scores and recognition correctness is discussed.
キーワード(和)	マルチモーダル音声認識 / コンフュージョンネットワーク / コンフュージョンネットワークコンビネーション
キーワード(英)	Multimodal Speech Recognition / Confusion Network (CN) / Confusion Network Combination (CNC)
資料番号	SP2007-92
発行日

研究会情報
研究会	SP
開催期間	2007/11/21(から1日開催)
開催地（和）
開催地（英）
テーマ（和）
テーマ（英）
委員長氏名（和）
委員長氏名（英）
副委員長氏名（和）
副委員長氏名（英）
幹事氏名（和）
幹事氏名（英）
幹事補佐氏名（和）
幹事補佐氏名（英）

講演論文情報詳細
申込み研究会	Speech (SP)
本文の言語	JPN
タイトル（和）	音声と画像のconfusion networkを用いたマルチモーダル音声認識
サブタイトル（和）
タイトル（英）	Multimodal speech recognition using audio and visual confusion networks
サブタイトル（和）
キーワード(1)（和/英）	マルチモーダル音声認識 / Multimodal Speech Recognition
キーワード(2)（和/英）	コンフュージョンネットワーク / Confusion Network (CN)
キーワード(3)（和/英）	コンフュージョンネットワークコンビネーション / Confusion Network Combination (CNC)
第 1 著者氏名（和/英）	上澤泰 / Tai KAMISAWA
第 1 著者所属（和/英）	岐阜大学大学院工学研究科 Graduate school of Engineering, Gifu University
第 2 著者氏名（和/英）	石川雅人 / Masato ISHIKAWA
第 2 著者所属（和/英）	岐阜大学工学部 Faculty of Engineering, Gifu University
第 3 著者氏名（和/英）	田村哲嗣 / Satoshi TAMURA
第 3 著者所属（和/英）	岐阜大学工学部 Faculty of Engineering, Gifu University
第 4 著者氏名（和/英）	速水悟 / Satoru HAYAMIZU
第 4 著者所属（和/英）	岐阜大学工学部 Faculty of Engineering, Gifu University
発表年月日	2007/11/21
資料番号	SP2007-92
巻番号（vol）	vol.107
号番号（no）	356
ページ範囲	pp.-
ページ数	6
発行日