削減コーパスのパープレキシティ(第15回情報論的学習理論ワークショップ)

Presentation	2012-11-07 Perplexity on Reduced Corpora Hayato KOBAYASHI,
PDF Download Page	PDF download Page Link
Abstract(in Japanese)	(See Japanese page)
Abstract(in English)	This paper studies a relationship between perplexity and vocabulary size on a corpus (or documents), which is reduced to improve computational performance. We prove that perplexity of κ-gram models and topic models approximately follows a power law with respect to reduced vocabulary size under some condition, when a corpus follows Zipf's law. This gives a theoretical evidence for our intuition that low-frequency words may not make a large contribution to the performance of statistical models. We verify the correctness of our theory on synthetic corpora and examine a gap between theory and practice on real corpora.
Keyword(in Japanese)	(See Japanese page)
Keyword(in English)	corpus reduction / N-gram model / topic model / Zipf's law / power law
Paper #	IBISML2012-38
Date of Issue

Paper Information
Registration To	Information-Based Induction Sciences and Machine Learning (IBISML)
Language	ENG
Title (in Japanese)	(See Japanese page)
Sub Title (in Japanese)	(See Japanese page)
Title (in English)	Perplexity on Reduced Corpora
Sub Title (in English)
Keyword(1)	corpus reduction
Keyword(2)	N-gram model
Keyword(3)	topic model
Keyword(4)	Zipf's law
Keyword(5)	power law
1st Author's Name	Hayato KOBAYASHI
1st Author's Affiliation	Corporate Research and Development Center, Toshiba Corporation()
Date	2012-11-07
Paper #	IBISML2012-38
Volume (vol)	vol.112
Number (no)	279
Page	pp.pp.-
#Pages	8
Date of Issue