Authors
Yue Wang, Vivek Narasayya, Yeye He, Surajit Chaudhuri
Publication date
2022/2/1
Journal
Proceedings of the VLDB Endowment
Volume
15
Issue
6
Pages
1132-1145
Publisher
VLDB Endowment
Description
The Agglomerative Hierarchical Clustering (AHC) algorithm is widely used in real-world applications. As data volumes continue to grow, efficient scale-out techniques for AHC are becoming increasingly important. In this paper, we propose a Partition-based distributed Agglomerative Hierarchical Clustering (PACk) algorithm using novel distance-based partitioning and distance-aware merging techniques. We have developed an efficient implementation of PACk on Spark. Compared to the state-of-the-art distributed AHC algorithm, PACk achieves 2X to 19X (median=9X) speedup across a variety of synthetic and real-world datasets.
Total citations
20222023202411