高次の音素環境依存モデルを統合した重み付き有限状態トランスデューサの効率的生成法(大語彙音声認識)(第6回音声言語シンポジウム)

シュスター マイク; 堀 貴明

講演名	2004/12/15 高次の音素環境依存モデルを統合した重み付き有限状態トランスデューサの効率的生成法(大語彙音声認識)(第6回音声言語シンポジウム) シュスターマイク, 堀貴明,
PDFダウンロードページ	PDFダウンロードページへ
抄録(和)	本稿では,トライフォン(triphone)を越える高次の音素環境依存モデルを統合した重み付き有限状態トランスデューサ(Weighted Finite State Transducer : WFST)を効率的に生成するアルゴリズムについて述べる.音声認識用のWFSTを構築する従来の手法では,高次の音素環境依存モデルを扱う際にいくつかの問題に直面する.そして,場合によっては計算量やメモリの制約から構築そのものが不可能になることがある.本稿では,まず,従来の構築方法の非効率性について議論した後で,音声認識用WFSTの構築において必要とされる,HMM状態列を音素列へ変換するWFSTを音素決定木から直接生成する効率的なアルゴリズムを提案する.そして,そのアルゴリズムが非常に高速かつ省メモリで動作し,かつ,最終的に構築されるWFSTのサイズをも小さくすることを示す.提案手法により構築されたWFSTを,日本語話し言葉コーパスを用いて,サイズ,認識精度,認識速度の観点から評価した.そして,高次の音素環境依存モデルを組み込んだ単語内・単語間の音素環境を考慮するワンパス時間同期探索が容易に実現され,かつ,それがtriphoneの場合に比べてわずかなオーバヘッドで動作することを示す.最後に,単語内・単語間のquinphoneを適用した実時間音声認識が125MBのメモリかつ9%のサーチエラーで動作することを確認した.
抄録(英)	This paper describes an algorithm for efficient building of Weighted Finite State Transducers for speech recognition when high-order context-dependent models of order K > 3 (triphones) with tied states are used. After discussing some inefficiencies of the standard compilation method which make the use of high-order context-dependent models cumbersome and sometimes even impossible because of memory constraints, we show how an algorithm to build a part of the needed composed transducers directly from the decision trees in combination with an improved compilation process can lead to much faster, simpler and more memory-efficient compilation. In our case it also resulted in substantially smaller final networks. With the described algorithm it is simple to use high-order full cross-word models with little overhead directly within a one-pass time-synchronous search, which we test comparing resulting final network sizes, recognition rates and speed on a large, spontaneous Japanese speech database. Using the proposed algorithm it is possible to do real-time recognition using full cross-word quinphones with a large acoustic model in about 125MB of memory at about 9% search error.
キーワード(和)	音声認識 / 重み付き有限状態トランスデューサ / 音素環境依存モデル
キーワード(英)	Speech recognition / search / weighted finite state transducers
資料番号	NLC2004-83,SP2004-123
発行日

研究会情報
研究会	NLC
開催期間	2004/12/15(から1日開催)
開催地（和）
開催地（英）
テーマ（和）
テーマ（英）
委員長氏名（和）
委員長氏名（英）
副委員長氏名（和）
副委員長氏名（英）
幹事氏名（和）
幹事氏名（英）
幹事補佐氏名（和）
幹事補佐氏名（英）

講演論文情報詳細
申込み研究会	Natural Language Understanding and Models of Communication (NLC)
本文の言語	ENG
タイトル（和）	高次の音素環境依存モデルを統合した重み付き有限状態トランスデューサの効率的生成法(大語彙音声認識)(第6回音声言語シンポジウム)
サブタイトル（和）
タイトル（英）	Efficient Generation of high-order context-dependent Weighted Finite State Transducers for Speech Recognition
サブタイトル（和）
キーワード(1)（和/英）	音声認識 / Speech recognition
キーワード(2)（和/英）	重み付き有限状態トランスデューサ / search
キーワード(3)（和/英）	音素環境依存モデル / weighted finite state transducers
第 1 著者氏名（和/英）	シュスターマイク / Mike SCHUSTER
第 1 著者所属（和/英）	日本電信電話(株)NTTコミュニケーション科学基礎研究所 Nippon Telegraph and Telephone Corporation, NTT Communication Science Laboratories
第 2 著者氏名（和/英）	堀貴明 / Takaaki HORI
第 2 著者所属（和/英）	日本電信電話(株)NTTコミュニケーション科学基礎研究所 Nippon Telegraph and Telephone Corporation, NTT Communication Science Laboratories
発表年月日	2004/12/15
資料番号	NLC2004-83,SP2004-123
巻番号（vol）	vol.104
号番号（no）	540
ページ範囲	pp.-
ページ数	5
発行日