単語リストと生コーパスによる確率的言語モデルの分野適応(言語の統計モデル)

森 信介

Presentation	2005/7/16 Language Model Adaptation with a Word List and a Raw Corpus Shinsuke MORI,
PDF Download Page	PDF download Page Link
Abstract(in Japanese)	(See Japanese page)
Abstract(in English)	In this paper, we discuss stochastic language model adaptation methods given a word list and a raw corpus. In this situation, a general method is to segment the raw corpus by a word segmenter equipped with a word list, correct the output sentences annotated with word boundary information by hand, and build a model from the segmented corpus. In this sentence-by-sentence error correction method, however, the annotator encounters difficult points and this results in a decrease of the productivity. In addition, it is not sure that sentence-by-sentence error correction from the beginning is the best way to dispense a limited work force. In this paper, we propose to take a word as a correction unit and concentrically correct the positions in which words in the list appear. This method allows us to avoid the above difficulty and go straight to capture the statistical behavior of specific words in the application field. In the experiments, we used a variety of methods to prepare a segmented corpus and compared the language models from the corpora in predictive power and Kana-kanji conversion accuracy. The results showed that concentrating on the error correction around the words in the list, we can build a better language model with less effort.
Keyword(in Japanese)	(See Japanese page)
Keyword(in English)	Kana-Kanji Convertor / Speech Recognition / Language Model / Corpus
Paper #	NLC2005-23
Date of Issue

Conference Information
Committee	NLC
Conference Date	2005/7/16(1days)
Place (in Japanese)	(See Japanese page)
Place (in English)
Topics (in Japanese)	(See Japanese page)
Topics (in English)
Chair
Vice Chair
Secretary
Assistant

Paper Information
Registration To	Natural Language Understanding and Models of Communication (NLC)
Language	JPN
Title (in Japanese)	(See Japanese page)
Sub Title (in Japanese)	(See Japanese page)
Title (in English)	Language Model Adaptation with a Word List and a Raw Corpus
Sub Title (in English)
Keyword(1)	Kana-Kanji Convertor
Keyword(2)	Speech Recognition
Keyword(3)	Language Model
Keyword(4)	Corpus
1st Author's Name	Shinsuke MORI
1st Author's Affiliation	IBM Research, Tokyo Research Laboratory, IBM Japan, Ltd.()
Date	2005/7/16
Paper #	NLC2005-23
Volume (vol)	vol.105
Number (no)	204
Page	pp.pp.-
#Pages	7
Date of Issue