A Rule-based Approach for Khmer Word Extraction

Van Channa; Kameyama Wataru

大会名称
2010年情報科学技術フォーラム(FIT)
大会コ－ド
F
開催年
2010
発行日
2010/8/20
セッション番号
1G
セッション名
言語解析
講演日
2010/09/07
講演場所(会議室等)
G会場（総合学習プラザ1F　第11講義室）
講演番号
E-007
タイトル
A Rule-based Approach for Khmer Word Extraction
著者名
Van Channa, Kameyama Wataru,
キーワード
Khmer, Word Extraction, Rule-based Approach
抄録
This paper presents a trainable rule-based approach to extract Khmer words from the text. A rule set is created by the rule training process based on a Khmer text corpus. The word longest matching algorithm and the SEQUITUR algorithm are applied to detect and extract the rules of the frequent co-occurrence strings found the corpus. The entropy of the rules and the mutual information of each string in the rules are calculated and they are used to determine the strength of each rule to be a word. The obtained rule set is used to extract the words from the text. The precision and recall of the proposed approach are 89.37% and 95.50%, respectively.
本文pdf
PDF download (199.5KB)