既知の単語の分散表現を用いた未知の複合語の分散表現の推定法

高木 涼太; 風間 一洋; 榊 剛史

講演名	2019-09-28 既知の単語の分散表現を用いた未知の複合語の分散表現の推定法高木涼太(和歌山大), 風間一洋(和歌山大), 榊剛史(ホットリンク),
PDFダウンロードページ	PDFダウンロードページへ
抄録(和)	近年，単語の意味を低次元のベクトル表現として扱う分散表現が広く使われている.この分散表現を求める手法は様々であるが，例えば Mikolov らが提案した word2vec では，単語とその周辺の単語をニューラルネットワークで学習し，学習結果の中間層の重みのベクトルを分散表現として用いる.しかし，word2vec では学習に用いた単語の分散表現しか得られないために，既知の単語を組み合わせた複合語であっても再学習する必要があるが，そのためには多大な追加コストが必要になる.このような再学習を避けるために，本稿では，学習済みの分散表現データと単語に関する統計値を用いて，未知の複合語の分散表現を比較的高い精度で推定する手法を提案する.実際には，日本語の複合語を構成する名詞間の修飾関係に着目して，単名詞 2-gram の連接頻度から重み付けを行う.さらに，単語の分散表現から文の分散表現を求めるために用いられるベクトルの単純平均や Arora らの手法などと類似度・MRR を比較し，提案手法の有効性を示す.
抄録(英)	In recent years, the distributed expression, which treats the meaning of a word as a low-dimensional vector expression, is widely used. There are various methods to obtain the distributed expression of a word. For example, Mikolov et al. proposed word2vec that learns the surrounding words of a word by a neural network, and outputs the vector of weights in the middle layer of the learning result as a distributed representation, However, an unknown compound word should be re-learn even if it is a known word sequences because only distributed repre- sentations of words, which are used for learning, are available in word2vec. It requires a lot of additional cost. We propose a method to estimate the distributed expression of unknown compound words with relatively high accuracy, using distributed expression data that has been already learned and a simple statistical indicator. In practice, we focus on the modification relation between nouns that constitute a Japanese compound words and weight distributed expression vectors by compound noun frequency of noun 2-grams. Additionally, we compare the similarity and the MRR of the proposed method with those of other methods that are used to obtain the distributed expression of a sentence using the distributed expression of words such as the simple average method and the method proposed by Arora et al. etc. We show the effectiveness of the proposed method.
キーワード(和)	複合語 / 分散表現 / word2vec / 名詞連接頻度 / 修飾関係
キーワード(英)	compound words / distributed representation / word2vec / compound noun frequency / modification relation
資料番号	NLC2019-27
発行日	2019-09-20 (NLC)

研究会情報
研究会	NLC / IPSJ-DC
開催期間	2019/9/27(から2日開催)
開催地（和）	フューチャー株式会社
開催地（英）	Future Corporation
テーマ（和）	第15回テキストアナリティクス・シンポジウム
テーマ（英）	The Thirteenth Text Analytics Symposium
委員長氏名（和）	榊剛史(ホットリンク) / 秋元良仁(凸版印刷)
委員長氏名（英）	Takeshi Sakaki(Hottolink) / Ryoji Akimoto(Toppan Printing)
副委員長氏名（和）	吉田光男(豊橋技科大) / 嶋田和孝(九工大)
副委員長氏名（英）	Mitsuo Yoshida(Toyohashi Univ. of Tech.) / Kazutaka Shimada(Kyushu Inst. of Tech.)
幹事氏名（和）	渡辺靖彦(龍谷大) / 東中竜一郎(NTT) / 大場みち子(はこだて未来大) / 高橋慈子(ハーティネス) / 中挾知延子(東洋大) / 野々山秀文(セコム)
幹事氏名（英）	Yasuhiko Watanabe(Ryukoku Univ.) / Ryuichiro Higashinaka(NTT) / Michiko Oba(Future Univ. Hakodate) / Shigeko Takahashi(Heartiness) / Chieko Nakabasami(Toyo Univ.) / Hidefumi Nonoyama(Secom)
幹事補佐氏名（和）	小早川健(NHK) / 坂地泰紀(東大)
幹事補佐氏名（英）	Takeshi Kobayakawa(NHK) / Hiroki Sakaji(Univ. of Tokyo)

講演論文情報詳細
申込み研究会	Technical Committee on Natural Language Understanding and Models of Communication / Special Interest Group on Document Communication
本文の言語	JPN
タイトル（和）	既知の単語の分散表現を用いた未知の複合語の分散表現の推定法
サブタイトル（和）
タイトル（英）	Estimating Distributed Expressions of Unknown Compound Word Using Distributed Expressions of Known Words
サブタイトル（和）
キーワード(1)（和/英）	複合語 / compound words
キーワード(2)（和/英）	分散表現 / distributed representation
キーワード(3)（和/英）	word2vec / word2vec
キーワード(4)（和/英）	名詞連接頻度 / compound noun frequency
キーワード(5)（和/英）	修飾関係 / modification relation
第 1 著者氏名（和/英）	高木涼太 / Ryota Takagi
第 1 著者所属（和/英）	和歌山大学(略称：和歌山大) Wakayama University(略称：Wakayama Univ)
第 2 著者氏名（和/英）	風間一洋 / Kazuhiro Kazama
第 2 著者所属（和/英）	和歌山大学(略称：和歌山大) Wakayama University(略称：Wakayama Univ)
第 3 著者氏名（和/英）	榊剛史 / Takeshi Sakaki
第 3 著者所属（和/英）	株式会社ホットリンク(略称：ホットリンク) Hotto Link Inc.(略称：Hotto Link)
発表年月日	2019-09-28
資料番号	NLC2019-27
巻番号（vol）	vol.119
号番号（no）	NLC-212
ページ範囲	pp.103-108(NLC),
ページ数	6
発行日	2019-09-20 (NLC)