View article

[PDF] from springer.com

Concept decompositions for large sparse text data using clustering

Authors

Inderjit S Dhillon, Dharmendra S Modha

Publication date

2001/1

Journal

Machine learning

Volume

Pages

143-175

Publisher

Kluwer Academic Publishers

Description

Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors–a few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical k-means algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the high-dimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain “fractal-like” and “self-similar” behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are …

Total citations

Cited by 1969

20002001200220032004200520062007200820092010201120122013201420152016201720182019202020212022202320249 16 36 68 79 79 68 73 83 80 97 91 115 119 101 113 108 98 93 90 87 86 64 58 28

Scholar articles

Concept decompositions for large sparse text data using clustering

IS Dhillon, DS Modha - Machine learning, 2001

Cited by 1969 Related articles All 18 versions