View article

[PDF] from psu.edu

A data-clustering algorithm on distributed memory multiprocessors

Authors

Inderjit S Dhillon, Dharmendra S Modha

Publication date

2002/5/17

Book

Large-scale parallel data mining

Pages

245-260

Publisher

Springer Berlin Heidelberg

Description

To cluster increasingly massive data sets that are common today in data and text mining, we propose a parallel implementation of the k-means clustering algorithm based on the message passing model. The proposed algorithm exploits the inherent data-parallelism in the k-means algorithm. We analytically show that the speedup and the scaleup of our algorithm approach the optimal as the number of data points increases. We implemented our algorithm on an IBM POWERparallel SP2 with a maximum of 16 nodes. On typical test data sets, we observe nearly linear relative speedups, for example, 15.62 on 16 nodes, and essentially linear scaleup in the size of the data set and in the number of clusters desired. For a 2 gigabyte test data set, our implementation drives the 16 node SP2 at more than 1.8 gigaflops.

Total citations

Cited by 641

20012002200320042005200620072008200920102011201220132014201520162017201820192020202120222023202411 17 31 30 25 29 30 38 37 39 25 41 39 38 25 26 21 26 19 23 18 14 14 9

Scholar articles

A data-clustering algorithm on distributed memory multiprocessors

IS Dhillon, DS Modha - Large-scale parallel data mining, 2002