Presentation | 1995/5/12 Automatic Extraction of Uninterrupted and Interrupted Collocations from Very Large Japanese Corpora using N-gram Statistics Satoru Ikehara, Satoshi Shirai, Tsukasa Kawaoka, |
---|---|
PDF Download Page | PDF download Page Link |
Abstract(in Japanese) | (See Japanese page) |
Abstract(in English) | In order to extract rigid expressions with a high frequency of use, new algorithms that can efficiently extract both uninterrupted and interrupted collocations from very large Japanese corpora have been proposed. More recently, the technique of applying n-gram statistics for uninterrupted collocation has been proposed. This enables the extraction of collocations in the order of string length and frequency of use. But this method posed problems in that large volumes of fractional and unnecessary expressions are included. To solve this problem, this paper proposes a new algorithm that restrains the extraction of unnecessary expressions. This is followed by the proposal of a method that extracts interrupted collocations combining the uninterrupted collocations thus obtained. These new methods are applied to newspaper articles containing 8.92 million characters. In the case of uninterrupted collocations with string length of 2 or more characters and whose frequency of appearance is 2 or more times, there were 4.4 million expressions (total frequency of 31.2 million times) extracted by the conventional method. In contrast, the new method reduced this to 0.97 million types (total frequency of 2.6 million times) revealing a substantial reduction in fractional and unnecessary expressions. In the case of interrupted collocational substring extractions, combining the substring with frequency of 10 times or more extracted by the first method, yielded 6.5 thousand types of pairs of substrings with the total frequency of 21.8 thousand times. |
Keyword(in Japanese) | (See Japanese page) |
Keyword(in English) | Collocation / Corpora / Automatic Extraction / N-gram / Sentence Pattern |
Paper # | |
Date of Issue |
Conference Information | |
Committee | NLC |
---|---|
Conference Date | 1995/5/12(1days) |
Place (in Japanese) | (See Japanese page) |
Place (in English) | |
Topics (in Japanese) | (See Japanese page) |
Topics (in English) | |
Chair | |
Vice Chair | |
Secretary | |
Assistant |
Paper Information | |
Registration To | Natural Language Understanding and Models of Communication (NLC) |
---|---|
Language | JPN |
Title (in Japanese) | (See Japanese page) |
Sub Title (in Japanese) | (See Japanese page) |
Title (in English) | Automatic Extraction of Uninterrupted and Interrupted Collocations from Very Large Japanese Corpora using N-gram Statistics |
Sub Title (in English) | |
Keyword(1) | Collocation |
Keyword(2) | Corpora |
Keyword(3) | Automatic Extraction |
Keyword(4) | N-gram |
Keyword(5) | Sentence Pattern |
1st Author's Name | Satoru Ikehara |
1st Author's Affiliation | NTT Communication Science Laboratories() |
2nd Author's Name | Satoshi Shirai |
2nd Author's Affiliation | NTT Communication Science Laboratories |
3rd Author's Name | Tsukasa Kawaoka |
3rd Author's Affiliation | Faculty of Engineering, Dousisha University |
Date | 1995/5/12 |
Paper # | |
Volume (vol) | vol.95 |
Number (no) | 29 |
Page | pp.pp.- |
#Pages | 8 |
Date of Issue |