Authors
Inderjit S Dhillon, Dharmendra S Modha
Publication date
2001/1
Journal
Machine learning
Volume
42
Pages
143-175
Publisher
Kluwer Academic Publishers
Description
Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors–a few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical k-means algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the high-dimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain “fractal-like” and “self-similar” behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are …
Total citations
2000200120022003200420052006200720082009201020112012201320142015201620172018201920202021202220232024916366879796873838097911151191011131089893908786645828