Presentation 2001/7/10
Deleting Useless Parts in Semi-structured Documents using Alternation Counts
Yasuhiro Yamada, Daisuke Ikeda, Sachio Hirokawa,
PDF Download Page PDF download Page Link
Abstract(in Japanese) (See Japanese page)
Abstract(in English) We propose a new technique of text mining for huge semi-structured data on WWW. We assume that substrings which appear frequently on semi-structured data are useless. We delete such useless parts according to a new statistic measure alternation count which we introduce in this paper. Novelty of our approach is to find non-useless parts but not to find useful parts. This technique does not need any background knowledge, does not depend on the language, and is robust for noises. We aaplied our approach to news articles on WWW, and succeeded deleting useless parts with high accuracy from input data without any background knowledge.
Keyword(in Japanese) (See Japanese page)
Keyword(in English) text mining / alternation counts / semi-structured document / record extraction
Paper # NLC2001-29
Date of Issue

Conference Information
Committee NLC
Conference Date 2001/7/10(1days)
Place (in Japanese) (See Japanese page)
Place (in English)
Topics (in Japanese) (See Japanese page)
Topics (in English)
Chair
Vice Chair
Secretary
Assistant

Paper Information
Registration To Natural Language Understanding and Models of Communication (NLC)
Language JPN
Title (in Japanese) (See Japanese page)
Sub Title (in Japanese) (See Japanese page)
Title (in English) Deleting Useless Parts in Semi-structured Documents using Alternation Counts
Sub Title (in English)
Keyword(1) text mining
Keyword(2) alternation counts
Keyword(3) semi-structured document
Keyword(4) record extraction
1st Author's Name Yasuhiro Yamada
1st Author's Affiliation Graduate School of Information Science and Electrical Engineering, Kyushu University()
2nd Author's Name Daisuke Ikeda
2nd Author's Affiliation Computing and Communications Center, Kyushu University
3rd Author's Name Sachio Hirokawa
3rd Author's Affiliation Computing and Communications Center, Kyushu University
Date 2001/7/10
Paper # NLC2001-29
Volume (vol) vol.101
Number (no) 190
Page pp.pp.-
#Pages 8
Date of Issue