HDDを用いた省メモリK-meansクラスタリング(テーマセッション,大規模データベースとパターン認識)

大池 洋史; 岸 和芳; 和田 俊和

講演名	2013-02-21 HDDを用いた省メモリK-meansクラスタリング(テーマセッション,大規模データベースとパターン認識) 大池洋史, 岸和芳, 和田俊和,
PDFダウンロードページ	PDFダウンロードページへ
抄録(和)	本報告では、大規模なデータに適用しても大量のメモリを必要としないK-meansクラスタリング手法を提案する。例えば、一般物体認識等で使用されるcode bookを作成する場合は、大量の局所特徴ベクトルに対するK-meansクラスタリングが必要となる。この際に、全データをメモリに展開する通常のK-meansクラスタリングでは、大規模な問題に適用することは出来ない。本報告では、許容されるメモリ使用量の範囲内で、データを補助記憶装置(HDD)からメモリに逐次的にロードしながら計算を行うK-meansクラスタリングの計算方法を提案する。この手法は複数回のデータ走査を行う方法であり、1回目のパスでは、HDDからデータを順番に読み込みながらクラスタリングを行うことで、大数の法則に則った漸近的なクラスタ中心の移動を実現する。この際に、クラスタ中心を決定するための十分統計量と、各データがどのクラスタに所属したかを記録しておき、2回目以降のパスで、所属クラスタが変化した場合に、クラスタ中心の更新を行う。この更新のタイミングを調整することで反復計算の回数を削減し、処理速度を向上させる。
抄録(英)	This report presents an "external" k-means clustering on HDD K-means clustering is widely used for many applications. For example, codebook creation for Bag of Visual Words requires k-means clustering on huge amount of local feature vectors to obtain Visual Words (codebook entries). Standard "internal" k-means clustering loads the whole vector data on the main memory and performs clustering. This working memory can explode for huge amount of data. As a solution of this problem, we propose an "external" clustering algorithm on HDD. This is a multi-path algorithm, which scans the whole data in each path In the first stage, cluster centroids are updated gradually, providing the data sequentially. Through this path, the number and the sum of the data are recorded for each cluster, and the belonging cluster is recorded for each data. In the following paths, each data is provided and the cluster center is updated for those data that changes belonging cluster. By adjusting this update frequency, the number of distance computation can be reduced and the performance can be improved.
キーワード(和)	K-meansクラスタリング / 省メモリ / 大規模データ
キーワード(英)	K-means Clustering / memory-efficient / large-scale database
資料番号	PRMU2012-140
発行日

研究会情報
研究会	PRMU
開催期間	2013/2/14(から1日開催)
開催地（和）
開催地（英）
テーマ（和）
テーマ（英）
委員長氏名（和）
委員長氏名（英）
副委員長氏名（和）
副委員長氏名（英）
幹事氏名（和）
幹事氏名（英）
幹事補佐氏名（和）
幹事補佐氏名（英）

講演論文情報詳細
申込み研究会	Pattern Recognition and Media Understanding (PRMU)
本文の言語	JPN
タイトル（和）	HDDを用いた省メモリK-meansクラスタリング(テーマセッション,大規模データベースとパターン認識)
サブタイトル（和）
タイトル（英）	Memory Efficient K-means Clustering using HDD
サブタイトル（和）
キーワード(1)（和/英）	K-meansクラスタリング / K-means Clustering
キーワード(2)（和/英）	省メモリ / memory-efficient
キーワード(3)（和/英）	大規模データ / large-scale database
第 1 著者氏名（和/英）	大池洋史 / Hiroshi OIKE
第 1 著者所属（和/英）	和歌山大学システム工学部 Faculty of System Engineering, Wakayama University
第 2 著者氏名（和/英）	岸和芳 / Kazuyoshi KISHI
第 2 著者所属（和/英）	和歌山大学システム工学部 Faculty of System Engineering, Wakayama University
第 3 著者氏名（和/英）	和田俊和 / Toshikazu Wada
第 3 著者所属（和/英）	和歌山大学システム工学部 Faculty of System Engineering, Wakayama University
発表年月日	2013-02-21
資料番号	PRMU2012-140
巻番号（vol）	vol.112
号番号（no）	441
ページ範囲	pp.-
ページ数	6
発行日