Presentation 2004-10-19
Information Extraction from Web Pages Using a Tree Edit Distance Measure
Tetsuji KUBOYAMA, Tetsuhiro MIYAHARA,
PDF Download Page PDF download Page Link
Abstract(in Japanese) (See Japanese page)
Abstract(in English) Recent research efforts on extracting information from Web pages have mainly focused on semi-automatic and automatic approaches to generating Web wrappers. This paper aim at establishing a structure-based approach to finding a common structured pattern from semistructured data such as HTML documents and XML documents through approximate tree matching by a tree edit distance measure for generating Web wrappers. The common structured pattern is generated by finding a similarity among parsed trees of Web pages, and merging these trees by alignment of trees. Each node of the pattern tree is weighted according to its frequency of occurrence in the tree. We present a method for generating Web wrappers from manually edited Web pages including a number of grammatical mistakes in HTML, redundant or missing fragments.
Keyword(in Japanese) (See Japanese page)
Keyword(in English) Web wrapper / information extraction / tree edit distance
Paper # DE2004-117,DC2004-32
Date of Issue

Conference Information
Committee DE
Conference Date 2004/10/12(1days)
Place (in Japanese) (See Japanese page)
Place (in English)
Topics (in Japanese) (See Japanese page)
Topics (in English)
Chair
Vice Chair
Secretary
Assistant

Paper Information
Registration To Data Engineering (DE)
Language ENG
Title (in Japanese) (See Japanese page)
Sub Title (in Japanese) (See Japanese page)
Title (in English) Information Extraction from Web Pages Using a Tree Edit Distance Measure
Sub Title (in English)
Keyword(1) Web wrapper
Keyword(2) information extraction
Keyword(3) tree edit distance
1st Author's Name Tetsuji KUBOYAMA
1st Author's Affiliation Center for Collaborative Research, The University of Tokyo()
2nd Author's Name Tetsuhiro MIYAHARA
2nd Author's Affiliation Faculty of Information Sciences, Hiroshima City University
Date 2004-10-19
Paper # DE2004-117,DC2004-32
Volume (vol) vol.104
Number (no) 345
Page pp.pp.-
#Pages 6
Date of Issue