統計的系列-フレーム写像に基づく音声変換(一般セッション,クロスモーダル)

喬 宇; 齋藤 大輔; 峯松 信明

講演名	2010-01-22 統計的系列-フレーム写像に基づく音声変換(一般セッション,クロスモーダル) 喬宇, 齋藤大輔, 峯松信明,
PDFダウンロードページ	PDFダウンロードページへ
抄録(和)	話者変換の目的はある話者の声を別の話者の声に変換することである。これは二つの話者区間において音声時系列のマッピング関数を求めることとして考えられる。GMMを用いた統計的マッピング方法[1],[2]は話者変換のタスクにおいてよく使われている。ただし、GMMを用いた変換技術はフレームからフレームへのマッピング関数を使用しているので、音声時系列のコンテキスト情報が十分には使われていない。HMMは音声時系列の有効なモデルであり、音声認識や音声合成においてよく使われている。本研究はHMMを用いた音声変換を研究対象とする。我々はHMMを用いた回帰、シーケンスからフレームの変換関数を導出した。先行のHMMを用いた音声変換方法[3]～[5]は強制切り出し(forced alignment)によって音声を分割し、各区間に対して変換を行う。それらの方法と異なって,我々の変換関数は線形変換の重みつけの和として導出される。重みは各フレームのHMM事後確率である。変換パラメータを推定するために、我々は最小2乗誤差基準及びと最大尤度基準を提案した。実験結果は提案手法の有効性を示した。
抄録(英)	Voice conversion, a task to transform one speaker's voice to another's, can be regarded as a problem to find a mapping function between voice spaces of two speakers. GMM-based statistical mapping methods [1], [2] have been widely used for voice conversion. However, the classical GMM-based techniques make use of a frame-to-frame mapping function, which largely ignores the contextual information existing over a speech sequence and usually causes over-smoothness of converted speech. It is well known that HMM yields an efficient method to model the density of a whole speech sequence and has found successes in speech recognition and synthesis. Inspired by this fact, this paper studies how to use HMM for voice conversion. We derive an HMM-based sequence-to-frame mapping function with statistical analysis. Different from previous HMM-based voice conversion methods [3] [5] that used forced alignment for segmentation and transform frames aligned to a state with its associated linear transformation, our method has a soft mapping function as a weighted summation of linear transformations. The weights are calculated as the HMM posterior probabilities of frames. We also propose and compare two methods to learn the parameters of our mapping functions, namely least square error estimation and maximum likelihood estimation. We carried out experiments to examine the proposed HMM-based method for voice conversion.
キーワード(和)	音声変換 / 線形回帰 / シーケンスからフレームへ変換 / HMM
キーワード(英)	Voice conversion / linear regression / sequence-to-frame mapping / HMM
資料番号	CQ2009-98,PRMU2009-197,SP2009-138,MVE2009-120
発行日

研究会情報
研究会	PRMU
開催期間	2010/1/14(から1日開催)
開催地（和）
開催地（英）
テーマ（和）
テーマ（英）
委員長氏名（和）
委員長氏名（英）
副委員長氏名（和）
副委員長氏名（英）
幹事氏名（和）
幹事氏名（英）
幹事補佐氏名（和）
幹事補佐氏名（英）

講演論文情報詳細
申込み研究会	Pattern Recognition and Media Understanding (PRMU)
本文の言語	ENG
タイトル（和）	統計的系列-フレーム写像に基づく音声変換(一般セッション,クロスモーダル)
サブタイトル（和）
タイトル（英）	Statistical sequence-to-frame mapping techniques for voice conversion
サブタイトル（和）
キーワード(1)（和/英）	音声変換 / Voice conversion
キーワード(2)（和/英）	線形回帰 / linear regression
キーワード(3)（和/英）	シーケンスからフレームへ変換 / sequence-to-frame mapping
キーワード(4)（和/英）	HMM / HMM
第 1 著者氏名（和/英）	喬宇 / Yu QIAO
第 1 著者所属（和/英）	東京大学大学院情報理工学系研究科 Grad. School of Info. Sci. and Tech., Univ. of Tokyo
第 2 著者氏名（和/英）	齋藤大輔 / Daisuke SAITO
第 2 著者所属（和/英）	東京大学大学院工学系研究科 Grad. School of Engineering, Univ. of Tokyo
第 3 著者氏名（和/英）	峯松信明 / Nobuaki MINEMATSU
第 3 著者所属（和/英）	東京大学大学院情報理工学系研究科 Grad. School of Info. Sci. and Tech., Univ. of Tokyo
発表年月日	2010-01-22
資料番号	CQ2009-98,PRMU2009-197,SP2009-138,MVE2009-120
巻番号（vol）	vol.109
号番号（no）	374
ページ範囲	pp.-
ページ数	6
発行日