Presentation 2012-11-07
Perplexity on Reduced Corpora
Hayato KOBAYASHI,
PDF Download Page PDF download Page Link
Abstract(in Japanese) (See Japanese page)
Abstract(in English) This paper studies a relationship between perplexity and vocabulary size on a corpus (or documents), which is reduced to improve computational performance. We prove that perplexity of κ-gram models and topic models approximately follows a power law with respect to reduced vocabulary size under some condition, when a corpus follows Zipf's law. This gives a theoretical evidence for our intuition that low-frequency words may not make a large contribution to the performance of statistical models. We verify the correctness of our theory on synthetic corpora and examine a gap between theory and practice on real corpora.
Keyword(in Japanese) (See Japanese page)
Keyword(in English) corpus reduction / N-gram model / topic model / Zipf's law / power law
Paper # IBISML2012-38
Date of Issue

Conference Information
Committee IBISML
Conference Date 2012/10/31(1days)
Place (in Japanese) (See Japanese page)
Place (in English)
Topics (in Japanese) (See Japanese page)
Topics (in English)
Chair
Vice Chair
Secretary
Assistant

Paper Information
Registration To Information-Based Induction Sciences and Machine Learning (IBISML)
Language ENG
Title (in Japanese) (See Japanese page)
Sub Title (in Japanese) (See Japanese page)
Title (in English) Perplexity on Reduced Corpora
Sub Title (in English)
Keyword(1) corpus reduction
Keyword(2) N-gram model
Keyword(3) topic model
Keyword(4) Zipf's law
Keyword(5) power law
1st Author's Name Hayato KOBAYASHI
1st Author's Affiliation Corporate Research and Development Center, Toshiba Corporation()
Date 2012-11-07
Paper # IBISML2012-38
Volume (vol) vol.112
Number (no) 279
Page pp.pp.-
#Pages 8
Date of Issue