Presentation 2005/7/16
Language Model Adaptation with a Word List and a Raw Corpus
Shinsuke MORI,
PDF Download Page PDF download Page Link
Abstract(in Japanese) (See Japanese page)
Abstract(in English) In this paper, we discuss stochastic language model adaptation methods given a word list and a raw corpus. In this situation, a general method is to segment the raw corpus by a word segmenter equipped with a word list, correct the output sentences annotated with word boundary information by hand, and build a model from the segmented corpus. In this sentence-by-sentence error correction method, however, the annotator encounters difficult points and this results in a decrease of the productivity. In addition, it is not sure that sentence-by-sentence error correction from the beginning is the best way to dispense a limited work force. In this paper, we propose to take a word as a correction unit and concentrically correct the positions in which words in the list appear. This method allows us to avoid the above difficulty and go straight to capture the statistical behavior of specific words in the application field. In the experiments, we used a variety of methods to prepare a segmented corpus and compared the language models from the corpora in predictive power and Kana-kanji conversion accuracy. The results showed that concentrating on the error correction around the words in the list, we can build a better language model with less effort.
Keyword(in Japanese) (See Japanese page)
Keyword(in English) Kana-Kanji Convertor / Speech Recognition / Language Model / Corpus
Paper # NLC2005-23
Date of Issue

Conference Information
Committee NLC
Conference Date 2005/7/16(1days)
Place (in Japanese) (See Japanese page)
Place (in English)
Topics (in Japanese) (See Japanese page)
Topics (in English)
Vice Chair

Paper Information
Registration To Natural Language Understanding and Models of Communication (NLC)
Language JPN
Title (in Japanese) (See Japanese page)
Sub Title (in Japanese) (See Japanese page)
Title (in English) Language Model Adaptation with a Word List and a Raw Corpus
Sub Title (in English)
Keyword(1) Kana-Kanji Convertor
Keyword(2) Speech Recognition
Keyword(3) Language Model
Keyword(4) Corpus
1st Author's Name Shinsuke MORI
1st Author's Affiliation IBM Research, Tokyo Research Laboratory, IBM Japan, Ltd.()
Date 2005/7/16
Paper # NLC2005-23
Volume (vol) vol.105
Number (no) 204
Page pp.pp.-
#Pages 7
Date of Issue