OCR誤りに対してロバストな文書画像を対象としたモデルベースト情報抽出

石谷 康人; 中村 敏弘

Presentation	2002/3/7 Model-based Information Extraction Method Tolerant of OCR Errors for Document Images Yasuto ISHITANI, Toshihiro NAKAMURA,
PDF Download Page	PDF download Page Link
Abstract(in Japanese)	(See Japanese page)
Abstract(in English)	A new method for information extraction from document images is proposed in this paper as the basis for a document reader which can extract required keywords and their logical relationship from various printed documents. The proposed method consists of robust keyword matching, global document matching, and postprocessing for keyword matching errors. First, robust keyword matching between a set of text lines extracted from an input image and a set of keywords defined in the keyword dictionary is carried out. This keyword matching uses a keyword dictionary that includes incorrect words with typical OCR errors and segments of words to deal with OCR errors. Next, document matching is invoked between keyword matching results in the input document and word models defined in each document model. Each document model consists of a set of word models with their logical relationship described in terms of a tree structure. This model matching extracts required keywords and their logical relationship from the input document and determines the most suitable model for the input document. Finally, postprocessing for recovering matching errors and modifying matching results using heuristic rules defined in the model is applied to keyword matching results. This comprehensive approach solves word segmentation problems accurately even if a document has unknown words, compound words, or incorrect words due to OCR errors. Experimental results obtained for 100 documents show that the method is robust and effective for various document structures.
Keyword(in Japanese)	(See Japanese page)
Keyword(in English)	Information extraction / Document image analysis / Model-matching / Association graph / Maximal Clique
Paper #	NLC2001-95
Date of Issue

Conference Information
Committee	NLC
Conference Date	2002/3/7(1days)
Place (in Japanese)	(See Japanese page)
Place (in English)
Topics (in Japanese)	(See Japanese page)
Topics (in English)
Chair
Vice Chair
Secretary
Assistant

Paper Information
Registration To	Natural Language Understanding and Models of Communication (NLC)
Language	JPN
Title (in Japanese)	(See Japanese page)
Sub Title (in Japanese)	(See Japanese page)
Title (in English)	Model-based Information Extraction Method Tolerant of OCR Errors for Document Images
Sub Title (in English)
Keyword(1)	Information extraction
Keyword(2)	Document image analysis
Keyword(3)	Model-matching
Keyword(4)	Association graph
Keyword(5)	Maximal Clique
1st Author's Name	Yasuto ISHITANI
1st Author's Affiliation	Toshiba Corporation()
2nd Author's Name	Toshihiro NAKAMURA
2nd Author's Affiliation	Toshiba Corporation
Date	2002/3/7
Paper #	NLC2001-95
Volume (vol)	vol.101
Number (no)	711
Page	pp.pp.-
#Pages	8
Date of Issue