Presentation 2008/7/10
Document Clustering for Social Problem Detection and Cluster Evaluation Measures
Taiichi HASHIMOTO, Koji MURAKAMI, Takashi INUI, Kazuo UTSUMI, Masamichi ISHIKAWA,
PDF Download Page PDF download Page Link
Abstract(in Japanese) (See Japanese page)
Abstract(in English) Document clustering that is one of core technology of text mining is useful for macro analysis of large scale of documents. However it is difficult that analyst efficiently knows which clusters include important information from the result of document clustering. This paper presents a method to support a detection of social problems using newspaper articles. The proposed method is based on a hierarchical clustering algorithm. The hierarchical clustering algorithm is able to generate a dendrogram of clusters according to the similarity of document vectors. The document vector is calculated on length and position of term in the document. And we define two new measures to detect important clusters from the dendrogram. One is called density which is a measure of relevancy of documents in the cluster. The density is calculated from rate of term that documents in cluster shared. The other is called centrality which is a measure of relevancy of clusters. The centrality is calculated from depth of shared ancestor of clusters in the dendrogram and the number of documents in the cluster. We conducted experiments to evaluate the proposed method using NIKKEI newspaper articles which describe to the organizational hazards caused by Japanese industries and found that the proposed method is able to detect important clusters from the dendrogram genetrated by the hierarchical clustering.
Keyword(in Japanese) (See Japanese page)
Keyword(in English) document clustering / hierarchical clustering algorithm / cluster evaluation
Paper # NLC2008-8
Date of Issue

Conference Information
Committee NLC
Conference Date 2008/7/10(1days)
Place (in Japanese) (See Japanese page)
Place (in English)
Topics (in Japanese) (See Japanese page)
Topics (in English)
Chair
Vice Chair
Secretary
Assistant

Paper Information
Registration To Natural Language Understanding and Models of Communication (NLC)
Language JPN
Title (in Japanese) (See Japanese page)
Sub Title (in Japanese) (See Japanese page)
Title (in English) Document Clustering for Social Problem Detection and Cluster Evaluation Measures
Sub Title (in English)
Keyword(1) document clustering
Keyword(2) hierarchical clustering algorithm
Keyword(3) cluster evaluation
1st Author's Name Taiichi HASHIMOTO
1st Author's Affiliation Tokyo Institute of Technology()
2nd Author's Name Koji MURAKAMI
2nd Author's Affiliation Nara Institute of Science and Technology
3rd Author's Name Takashi INUI
3rd Author's Affiliation Tokyo Institute of Technology
4th Author's Name Kazuo UTSUMI
4th Author's Affiliation Tokyo Institute of Technology
5th Author's Name Masamichi ISHIKAWA
5th Author's Affiliation Tokyo Institute of Technology
Date 2008/7/10
Paper # NLC2008-8
Volume (vol) vol.108
Number (no) 141
Page pp.pp.-
#Pages 6
Date of Issue