Presentation 2007/7/17
Extracting Low Frequency Terms Using Substring Perplexities
Yasuhide MIURA, Hiroshi MASUICHI,
PDF Download Page PDF download Page Link
Abstract(in Japanese) (See Japanese page)
Abstract(in English) This paper describes a extraction method of low frequency domain specific terms, using substring perplexities. When a string is given, n-grams of characters that compose the string are extracted, and their perplexities in a given corpus are calculated. Similarly, n-grams of characters that appear beside the string and their perplexities are extracted. The ratio of these two kinds of perplexities is set as a score that represents the word fitness of the string. As an experiment, n-grams that compose entries in a disease dictionary and a anatomy dictionary, and appear 5 times or less in the corpus of about 67,000 medical texts are scored with the proposed method. In comparison, the same n-grams are scored with TermExtract. The result is, the average accuracy of 70.4% is gained with 1-gram scoring, and 83.5% is gained with 2-gram scoring, and is better compared with 70.6% of that of TermExtract.
Keyword(in Japanese) (See Japanese page)
Keyword(in English) Perplexity / Term Extraction / Named Entity Extraction
Paper # NLC2007-24
Date of Issue

Conference Information
Committee NLC
Conference Date 2007/7/17(1days)
Place (in Japanese) (See Japanese page)
Place (in English)
Topics (in Japanese) (See Japanese page)
Topics (in English)
Chair
Vice Chair
Secretary
Assistant

Paper Information
Registration To Natural Language Understanding and Models of Communication (NLC)
Language JPN
Title (in Japanese) (See Japanese page)
Sub Title (in Japanese) (See Japanese page)
Title (in English) Extracting Low Frequency Terms Using Substring Perplexities
Sub Title (in English)
Keyword(1) Perplexity
Keyword(2) Term Extraction
Keyword(3) Named Entity Extraction
1st Author's Name Yasuhide MIURA
1st Author's Affiliation Corporate Research Group, Fuji Xerox Co., Ltd.()
2nd Author's Name Hiroshi MASUICHI
2nd Author's Affiliation Corporate Research Group, Fuji Xerox Co., Ltd.
Date 2007/7/17
Paper # NLC2007-24
Volume (vol) vol.107
Number (no) 158
Page pp.pp.-
#Pages 6
Date of Issue