Presentation | 2005/7/16 Language Model Adaptation with a Word List and a Raw Corpus Shinsuke MORI, |
---|---|
PDF Download Page | PDF download Page Link |
Abstract(in Japanese) | (See Japanese page) |
Abstract(in English) | In this paper, we discuss stochastic language model adaptation methods given a word list and a raw corpus. In this situation, a general method is to segment the raw corpus by a word segmenter equipped with a word list, correct the output sentences annotated with word boundary information by hand, and build a model from the segmented corpus. In this sentence-by-sentence error correction method, however, the annotator encounters difficult points and this results in a decrease of the productivity. In addition, it is not sure that sentence-by-sentence error correction from the beginning is the best way to dispense a limited work force. In this paper, we propose to take a word as a correction unit and concentrically correct the positions in which words in the list appear. This method allows us to avoid the above difficulty and go straight to capture the statistical behavior of specific words in the application field. In the experiments, we used a variety of methods to prepare a segmented corpus and compared the language models from the corpora in predictive power and Kana-kanji conversion accuracy. The results showed that concentrating on the error correction around the words in the list, we can build a better language model with less effort. |
Keyword(in Japanese) | (See Japanese page) |
Keyword(in English) | Kana-Kanji Convertor / Speech Recognition / Language Model / Corpus |
Paper # | NLC2005-23 |
Date of Issue |
Conference Information | |
Committee | NLC |
---|---|
Conference Date | 2005/7/16(1days) |
Place (in Japanese) | (See Japanese page) |
Place (in English) | |
Topics (in Japanese) | (See Japanese page) |
Topics (in English) | |
Chair | |
Vice Chair | |
Secretary | |
Assistant |
Paper Information | |
Registration To | Natural Language Understanding and Models of Communication (NLC) |
---|---|
Language | JPN |
Title (in Japanese) | (See Japanese page) |
Sub Title (in Japanese) | (See Japanese page) |
Title (in English) | Language Model Adaptation with a Word List and a Raw Corpus |
Sub Title (in English) | |
Keyword(1) | Kana-Kanji Convertor |
Keyword(2) | Speech Recognition |
Keyword(3) | Language Model |
Keyword(4) | Corpus |
1st Author's Name | Shinsuke MORI |
1st Author's Affiliation | IBM Research, Tokyo Research Laboratory, IBM Japan, Ltd.() |
Date | 2005/7/16 |
Paper # | NLC2005-23 |
Volume (vol) | vol.105 |
Number (no) | 204 |
Page | pp.pp.- |
#Pages | 7 |
Date of Issue |