文字認識が困難な文献史料画像の解析のための文字画像クラスタリング手法

伊藤 空; 寺沢 憲吾

Presentation	2018-09-20 Character Image Clustering for Analyzing Machine-Unreadable Historical Document Images Sora Ito, Kengo Terasawa,
PDF Download Page	PDF download Page Link
Abstract(in Japanese)	(See Japanese page)
Abstract(in English)	For utilization of digital archives which store and publish a lot of historical document images, we think that being shown their indexes or tagged keywords is useful. So, in our laboratory, we are developing a system which extracts keywords from machine-unreadable historical document images without character recognition. In this keyword extraction system, first, we discretize feature vectors by clustering character images expressed by the feature vector. Next, we express sentences with sequences of discretized feature vectors and analyze them. With such a system, we can realize keyword extraction without character recognition. While clustering, if ``separation of clusters'' where one character class is separated into some clusters occurs, the accuracy of keyword extraction decreases. Another problem, In the case of too many character images separated from historical document images, it is difficult to cluster them at once because of computing costs. To solve these problems, in this study, we suggest a clustering method which restrains the separation of clusters and can be adapted in case of too many character images.
Keyword(in Japanese)	(See Japanese page)
Keyword(in English)	Historical document / Clustering / Document analysis
Paper #	PRMU2018-46,IBISML2018-23
Date of Issue	2018-09-13 (PRMU, IBISML)

Conference Information
Committee	PRMU / IBISML / IPSJ-CVIM
Conference Date	2018/9/20(2days)
Place (in Japanese)	(See Japanese page)
Place (in English)
Topics (in Japanese)	(See Japanese page)
Topics (in English)
Chair	Shinichi Sato(NII) / Hisashi Kashima(Kyoto Univ.)
Vice Chair	Yoshihisa Ijiri(Omron) / Toru Tamaki(Hiroshima Univ.) / Masashi Sugiyama(Univ. of Tokyo) / Koji Tsuda(Univ. of Tokyo)
Secretary	Yoshihisa Ijiri(NEC) / Toru Tamaki(Osaka Univ.) / Masashi Sugiyama(Nagoya Inst. of Tech.) / Koji Tsuda(AIST)
Assistant	Go Irie(NTT) / Yoshitaka Ushiku(Univ. of Tokyo) / Tomoharu Iwata(NTT) / Shigeyuki Oba(Kyoto Univ.)

Paper Information
Registration To	Technical Committee on Pattern Recognition and Media Understanding / Technical Committee on Infomation-Based Induction Sciences and Machine Learning / Special Interest Group on Computer Vision and Image Media
Language	JPN
Title (in Japanese)	(See Japanese page)
Sub Title (in Japanese)	(See Japanese page)
Title (in English)	Character Image Clustering for Analyzing Machine-Unreadable Historical Document Images
Sub Title (in English)
Keyword(1)	Historical document
Keyword(2)	Clustering
Keyword(3)	Document analysis
1st Author's Name	Sora Ito
1st Author's Affiliation	Future University Hakodate(FUN)
2nd Author's Name	Kengo Terasawa
2nd Author's Affiliation	Future University Hakodate(FUN)
Date	2018-09-20
Paper #	PRMU2018-46,IBISML2018-23
Volume (vol)	vol.118
Number (no)	PRMU-219,IBISML-220
Page	pp.pp.67-72(PRMU), pp.67-72(IBISML),
#Pages	6
Date of Issue	2018-09-13 (PRMU, IBISML)