サンプリング技術を利用した文章類似性評価(文書分類・翻訳)

山田 一郎; 中田 洋平; 松井 淳; 松本 隆; 三浦 菊佳; 住吉 英樹; 八木 伸行

講演名	2007/7/17 サンプリング技術を利用した文章類似性評価(文書分類・翻訳) 山田一郎, 中田洋平, 松井淳, 松本隆, 三浦菊佳, 住吉英樹, 八木伸行,
PDFダウンロードページ	PDFダウンロードページへ
抄録(和)	テレビ番組のナレーションでは、「場所紹介」や「人物紹介」など特定の事柄を表現するために同じような言い回しが多用される。このような言い回しを含む文章区間が抽出できれば、対応する番組映像区間の場所紹介や人物紹介といったメタデータを付与することができる。本稿では、番組のクローズドキャプションから特定の事柄を表現する文章に類似した文章を抽出するために、文章間の類似性を評価する手法を提案する。提案手法では文章を構文解析した結果、得られる木構造中の部分木を特徴とし、この特徴をサンプリングして学習するGibbs Boostアルゴリズムを用いて文章間の類似性を評価する。紀行番組のクローズドキャプションを対象として、場所を映像とともに説明する定型表現文章区間にある文章との類似性を評価する実験を行い、提案手法の有効性を確認した。
抄録(英)	In the closed captions, there are a lot of typical expressions to express specific things, for example, first introduction of a guest in a talk show or explanation of a place in travel program. Such information helps us to put metadata to the corresponding scenes. This paper proposes a method to evaluate the similarity between multiple sentences in order to extract a section in which sentences are similar to the typical expressions expressing specific things. The first step generates tree structures from input section of sentences and extracts subtrees from these tree structures. We use Gibbsboost algorithm which samples these subtrees for features and learns the features to evaluate the similarity. In the experiment of judging whether a section of sentences is similar to the section which explains a place with video targeting closed captions of TV programs concerned with travel, we show the effectiveness of our method.
キーワード(和)	メタデータ生成 / 特定表現抽出 / 木構造解析 / ギブスブースト / サンプリング
キーワード(英)	Metadata generation / Typical expression extraction / Tree Structure analysis / GibbsBoost Algorithm / sampling
資料番号	NLC2007-22
発行日

研究会情報
研究会	NLC
開催期間	2007/7/17(から1日開催)
開催地（和）
開催地（英）
テーマ（和）
テーマ（英）
委員長氏名（和）
委員長氏名（英）
副委員長氏名（和）
副委員長氏名（英）
幹事氏名（和）
幹事氏名（英）
幹事補佐氏名（和）
幹事補佐氏名（英）

講演論文情報詳細
申込み研究会	Natural Language Understanding and Models of Communication (NLC)
本文の言語	JPN
タイトル（和）	サンプリング技術を利用した文章類似性評価(文書分類・翻訳)
サブタイトル（和）
タイトル（英）	Evaluation of the Similarity between Multiple Sentences using Sampling Techniques
サブタイトル（和）
キーワード(1)（和/英）	メタデータ生成 / Metadata generation
キーワード(2)（和/英）	特定表現抽出 / Typical expression extraction
キーワード(3)（和/英）	木構造解析 / Tree Structure analysis
キーワード(4)（和/英）	ギブスブースト / GibbsBoost Algorithm
キーワード(5)（和/英）	サンプリング / sampling
第 1 著者氏名（和/英）	山田一郎 / Ichiro YAMADA
第 1 著者所属（和/英）	NHK放送技術研究所 NHK Science & Technical Research Laboratories
第 2 著者氏名（和/英）	中田洋平 / Yohei NAKADA
第 2 著者所属（和/英）	早稲田大学大学院理工学研究科 Dept. of Electrical Engineering and Bioscience, Waseda University
第 3 著者氏名（和/英）	松井淳 / Atsushi MATSUI
第 3 著者所属（和/英）	NHK放送技術研究所:早稲田大学大学院理工学研究科 NHK Science & Technical Research Laboratories:Dept. of Electrical Engineering and Bioscience, Waseda University
第 4 著者氏名（和/英）	松本隆 / Takashi MATSUMOTO
第 4 著者所属（和/英）	早稲田大学大学院理工学研究科 Dept. of Electrical Engineering and Bioscience, Waseda University
第 5 著者氏名（和/英）	三浦菊佳 / Kikuka MIURA
第 5 著者所属（和/英）	NHK放送技術研究所 NHK Science & Technical Research Laboratories
第 6 著者氏名（和/英）	住吉英樹 / Hideki SUMIYOSHI
第 6 著者所属（和/英）	NHK放送技術研究所 NHK Science & Technical Research Laboratories
第 7 著者氏名（和/英）	八木伸行 / Nobuyuki YAGI
第 7 著者所属（和/英）	NHK放送技術研究所 NHK Science & Technical Research Laboratories
発表年月日	2007/7/17
資料番号	NLC2007-22
巻番号（vol）	vol.107
号番号（no）	158
ページ範囲	pp.-
ページ数	6
発行日