View article

[PDF] from aaai.org

Authors

Alexander Strehl, Joydeep Ghosh, Raymond Mooney

Publication date

2000/7/30

Journal

Workshop on artificial intelligence for web search (AAAI 2000)

Volume

Pages

Description

Clustering of web documents enables (semi-) automated categorization, and facilitates certain types of search. Any clustering method has to embed the documents in a suitable similarity space. While several clustering methods and the associated similarity measures have been proposed in the past, there is no systematic comparative study of the impact of similarity metrics on cluster quality, possibly because the popular cost criteria do not readily translate across qualitatively different metrics. We observe that in domains such as YAHOO that provide a categorization by human experts, a useful criteria for comparisons across similarity metrics is indeed available. We then compare four popular similarity measures (Euclidean, cosine, Pearson correlation and extended Jaccard) in conjunction with several clustering techniques (random, self-organizing feature map, hyper-graph partitioning, generalized kmeans, weighted graph partitioning), on high dimensionai sparse data representing web documents. Performance is measured against a human-imposed classification into news categories and industry categories. We conduct a number of experiments and use t-tests to assure statistical significance of results. Cosine and extended Jaccard similarities emerge as the best measures to capture human categorization behavior, while Euclidean performs poorest. Also, weighted graph partitioning approaches are clearly superior to all others.

Total citations

Cited by 1135

20002001200220032004200520062007200820092010201120122013201420152016201720182019202020212022202320244 9 29 32 34 52 50 51 42 48 51 56 72 62 73 88 71 70 37 45 51 26 28 28 11

Scholar articles

A Strehl, J Ghosh, R Mooney - Workshop on artificial intelligence for web search (AAAI …, 2000

Cited by 1135 Related articles All 10 versions