View article

[PDF] from arxiv.org

Scalable k-means++

Authors

Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, Sergei Vassilvitskii

Publication date

2012/3/29

Journal

arXiv preprint arXiv:1203.6402

Description

Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k-means that have mostly focused on the post-initialization phases of k-means. We prove that our proposed initialization algorithm k-means|| obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on real-world large-scale data demonstrates that k-means|| outperforms k-means++ in both sequential and parallel settings.

Total citations

Cited by 950

201120122013201420152016201720182019202020212022202320244 7 27 49 54 94 93 105 112 102 88 66 85 50

Scholar articles

Scalable k-means++

B Bahmani, B Moseley, A Vattani, R Kumar… - arXiv preprint arXiv:1203.6402, 2012