Presentation | 2001/7/10 Deleting Useless Parts in Semi-structured Documents using Alternation Counts Yasuhiro Yamada, Daisuke Ikeda, Sachio Hirokawa, |
---|---|
PDF Download Page | PDF download Page Link |
Abstract(in Japanese) | (See Japanese page) |
Abstract(in English) | We propose a new technique of text mining for huge semi-structured data on WWW. We assume that substrings which appear frequently on semi-structured data are useless. We delete such useless parts according to a new statistic measure alternation count which we introduce in this paper. Novelty of our approach is to find non-useless parts but not to find useful parts. This technique does not need any background knowledge, does not depend on the language, and is robust for noises. We aaplied our approach to news articles on WWW, and succeeded deleting useless parts with high accuracy from input data without any background knowledge. |
Keyword(in Japanese) | (See Japanese page) |
Keyword(in English) | text mining / alternation counts / semi-structured document / record extraction |
Paper # | NLC2001-29 |
Date of Issue |
Conference Information | |
Committee | NLC |
---|---|
Conference Date | 2001/7/10(1days) |
Place (in Japanese) | (See Japanese page) |
Place (in English) | |
Topics (in Japanese) | (See Japanese page) |
Topics (in English) | |
Chair | |
Vice Chair | |
Secretary | |
Assistant |
Paper Information | |
Registration To | Natural Language Understanding and Models of Communication (NLC) |
---|---|
Language | JPN |
Title (in Japanese) | (See Japanese page) |
Sub Title (in Japanese) | (See Japanese page) |
Title (in English) | Deleting Useless Parts in Semi-structured Documents using Alternation Counts |
Sub Title (in English) | |
Keyword(1) | text mining |
Keyword(2) | alternation counts |
Keyword(3) | semi-structured document |
Keyword(4) | record extraction |
1st Author's Name | Yasuhiro Yamada |
1st Author's Affiliation | Graduate School of Information Science and Electrical Engineering, Kyushu University() |
2nd Author's Name | Daisuke Ikeda |
2nd Author's Affiliation | Computing and Communications Center, Kyushu University |
3rd Author's Name | Sachio Hirokawa |
3rd Author's Affiliation | Computing and Communications Center, Kyushu University |
Date | 2001/7/10 |
Paper # | NLC2001-29 |
Volume (vol) | vol.101 |
Number (no) | 190 |
Page | pp.pp.- |
#Pages | 8 |
Date of Issue |