Presentation | 2012-11-07 Perplexity on Reduced Corpora Hayato KOBAYASHI, |
---|---|
PDF Download Page | PDF download Page Link |
Abstract(in Japanese) | (See Japanese page) |
Abstract(in English) | This paper studies a relationship between perplexity and vocabulary size on a corpus (or documents), which is reduced to improve computational performance. We prove that perplexity of κ-gram models and topic models approximately follows a power law with respect to reduced vocabulary size under some condition, when a corpus follows Zipf's law. This gives a theoretical evidence for our intuition that low-frequency words may not make a large contribution to the performance of statistical models. We verify the correctness of our theory on synthetic corpora and examine a gap between theory and practice on real corpora. |
Keyword(in Japanese) | (See Japanese page) |
Keyword(in English) | corpus reduction / N-gram model / topic model / Zipf's law / power law |
Paper # | IBISML2012-38 |
Date of Issue |
Conference Information | |
Committee | IBISML |
---|---|
Conference Date | 2012/10/31(1days) |
Place (in Japanese) | (See Japanese page) |
Place (in English) | |
Topics (in Japanese) | (See Japanese page) |
Topics (in English) | |
Chair | |
Vice Chair | |
Secretary | |
Assistant |
Paper Information | |
Registration To | Information-Based Induction Sciences and Machine Learning (IBISML) |
---|---|
Language | ENG |
Title (in Japanese) | (See Japanese page) |
Sub Title (in Japanese) | (See Japanese page) |
Title (in English) | Perplexity on Reduced Corpora |
Sub Title (in English) | |
Keyword(1) | corpus reduction |
Keyword(2) | N-gram model |
Keyword(3) | topic model |
Keyword(4) | Zipf's law |
Keyword(5) | power law |
1st Author's Name | Hayato KOBAYASHI |
1st Author's Affiliation | Corporate Research and Development Center, Toshiba Corporation() |
Date | 2012-11-07 |
Paper # | IBISML2012-38 |
Volume (vol) | vol.112 |
Number (no) | 279 |
Page | pp.pp.- |
#Pages | 8 |
Date of Issue |