Presentation | 2002/3/7 Model-based Information Extraction Method Tolerant of OCR Errors for Document Images Yasuto ISHITANI, Toshihiro NAKAMURA, |
---|---|
PDF Download Page | PDF download Page Link |
Abstract(in Japanese) | (See Japanese page) |
Abstract(in English) | A new method for information extraction from document images is proposed in this paper as the basis for a document reader which can extract required keywords and their logical relationship from various printed documents. The proposed method consists of robust keyword matching, global document matching, and postprocessing for keyword matching errors. First, robust keyword matching between a set of text lines extracted from an input image and a set of keywords defined in the keyword dictionary is carried out. This keyword matching uses a keyword dictionary that includes incorrect words with typical OCR errors and segments of words to deal with OCR errors. Next, document matching is invoked between keyword matching results in the input document and word models defined in each document model. Each document model consists of a set of word models with their logical relationship described in terms of a tree structure. This model matching extracts required keywords and their logical relationship from the input document and determines the most suitable model for the input document. Finally, postprocessing for recovering matching errors and modifying matching results using heuristic rules defined in the model is applied to keyword matching results. This comprehensive approach solves word segmentation problems accurately even if a document has unknown words, compound words, or incorrect words due to OCR errors. Experimental results obtained for 100 documents show that the method is robust and effective for various document structures. |
Keyword(in Japanese) | (See Japanese page) |
Keyword(in English) | Information extraction / Document image analysis / Model-matching / Association graph / Maximal Clique |
Paper # | NLC2001-95 |
Date of Issue |
Conference Information | |
Committee | NLC |
---|---|
Conference Date | 2002/3/7(1days) |
Place (in Japanese) | (See Japanese page) |
Place (in English) | |
Topics (in Japanese) | (See Japanese page) |
Topics (in English) | |
Chair | |
Vice Chair | |
Secretary | |
Assistant |
Paper Information | |
Registration To | Natural Language Understanding and Models of Communication (NLC) |
---|---|
Language | JPN |
Title (in Japanese) | (See Japanese page) |
Sub Title (in Japanese) | (See Japanese page) |
Title (in English) | Model-based Information Extraction Method Tolerant of OCR Errors for Document Images |
Sub Title (in English) | |
Keyword(1) | Information extraction |
Keyword(2) | Document image analysis |
Keyword(3) | Model-matching |
Keyword(4) | Association graph |
Keyword(5) | Maximal Clique |
1st Author's Name | Yasuto ISHITANI |
1st Author's Affiliation | Toshiba Corporation() |
2nd Author's Name | Toshihiro NAKAMURA |
2nd Author's Affiliation | Toshiba Corporation |
Date | 2002/3/7 |
Paper # | NLC2001-95 |
Volume (vol) | vol.101 |
Number (no) | 711 |
Page | pp.pp.- |
#Pages | 8 |
Date of Issue |