Presentation 1995/7/20
Unknown Word Extraction from Corpora Using n-gram Statistics
Shinsuke Mori, Makoto Nagao,
PDF Download Page PDF download Page Link
Abstract(in Japanese) (See Japanese page)
Abstract(in English) Dictionaries are indispensable for NLP as a source of information of grammatical functions or meanings of words. Much endeavor is being made to reinforce their vocabulary. Given continuous increase of new words or technical terms, building a dictionary takes vast effort and unknown words are inevitable at any step of analysis and this causes a grand problem. To solve this problem, we propose a method to extract words from a corpus and estimate part-of-speeches(POSs) which they belong to simultaneously using n-gram statistics, based on the supposition that distributions of strings preceding or following words belonging to the same POS are similar. Experiments have shown that this method is effective to infer the POS of unknown words and build a dictionary.
Keyword(in Japanese) (See Japanese page)
Keyword(in English) Unknown Word / Part-of-speech / Dictionary / Corpus / n-gram statistics
Paper #
Date of Issue

Conference Information
Committee NLC
Conference Date 1995/7/20(1days)
Place (in Japanese) (See Japanese page)
Place (in English)
Topics (in Japanese) (See Japanese page)
Topics (in English)
Chair
Vice Chair
Secretary
Assistant

Paper Information
Registration To Natural Language Understanding and Models of Communication (NLC)
Language JPN
Title (in Japanese) (See Japanese page)
Sub Title (in Japanese) (See Japanese page)
Title (in English) Unknown Word Extraction from Corpora Using n-gram Statistics
Sub Title (in English)
Keyword(1) Unknown Word
Keyword(2) Part-of-speech
Keyword(3) Dictionary
Keyword(4) Corpus
Keyword(5) n-gram statistics
1st Author's Name Shinsuke Mori
1st Author's Affiliation Department of Electrical Engineering, Kyoto University()
2nd Author's Name Makoto Nagao
2nd Author's Affiliation Department of Electrical Engineering, Kyoto University
Date 1995/7/20
Paper #
Volume (vol) vol.95
Number (no) 168
Page pp.pp.-
#Pages 6
Date of Issue