Presentation | 2004-10-19 Information Extraction from Web Pages Using a Tree Edit Distance Measure Tetsuji KUBOYAMA, Tetsuhiro MIYAHARA, |
---|---|
PDF Download Page | ![]() |
Abstract(in Japanese) | (See Japanese page) |
Abstract(in English) | Recent research efforts on extracting information from Web pages have mainly focused on semi-automatic and automatic approaches to generating Web wrappers. This paper aim at establishing a structure-based approach to finding a common structured pattern from semistructured data such as HTML documents and XML documents through approximate tree matching by a tree edit distance measure for generating Web wrappers. The common structured pattern is generated by finding a similarity among parsed trees of Web pages, and merging these trees by alignment of trees. Each node of the pattern tree is weighted according to its frequency of occurrence in the tree. We present a method for generating Web wrappers from manually edited Web pages including a number of grammatical mistakes in HTML, redundant or missing fragments. |
Keyword(in Japanese) | (See Japanese page) |
Keyword(in English) | Web wrapper / information extraction / tree edit distance |
Paper # | DE2004-117,DC2004-32 |
Date of Issue |
Conference Information | |
Committee | DE |
---|---|
Conference Date | 2004/10/12(1days) |
Place (in Japanese) | (See Japanese page) |
Place (in English) | |
Topics (in Japanese) | (See Japanese page) |
Topics (in English) | |
Chair | |
Vice Chair | |
Secretary | |
Assistant |
Paper Information | |
Registration To | Data Engineering (DE) |
---|---|
Language | ENG |
Title (in Japanese) | (See Japanese page) |
Sub Title (in Japanese) | (See Japanese page) |
Title (in English) | Information Extraction from Web Pages Using a Tree Edit Distance Measure |
Sub Title (in English) | |
Keyword(1) | Web wrapper |
Keyword(2) | information extraction |
Keyword(3) | tree edit distance |
1st Author's Name | Tetsuji KUBOYAMA |
1st Author's Affiliation | Center for Collaborative Research, The University of Tokyo() |
2nd Author's Name | Tetsuhiro MIYAHARA |
2nd Author's Affiliation | Faculty of Information Sciences, Hiroshima City University |
Date | 2004-10-19 |
Paper # | DE2004-117,DC2004-32 |
Volume (vol) | vol.104 |
Number (no) | 345 |
Page | pp.pp.- |
#Pages | 6 |
Date of Issue |