View article

[PDF] from psu.edu

Adaptive duplicate detection using learnable string similarity measures

Authors

Mikhail Bilenko, Raymond J Mooney

Publication date

2003/8/24

Book

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages

39-48

Description

The problem of identifying approximately duplicate records in databases is an essential step for data cleaning and data integration processes. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we present a framework for improving duplicate detection using trainable measures of textual similarity. We propose to employ learnable text distance functions for each database field, and show that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain. We present two learnable text similarity measures suitable for this task: an extended variant of learnable string edit distance, and a novel vector-space based measure that employs a Support Vector Machine (SVM) for training. Experimental results on a range of datasets show that our framework can improve duplicate …

Total citations

Cited by 1391

20032004200520062007200820092010201120122013201420152016201720182019202020212022202320244 29 59 69 61 77 62 77 74 106 95 84 76 77 90 64 66 55 48 46 31 22

Scholar articles

Adaptive duplicate detection using learnable string similarity measures

M Bilenko, RJ Mooney - Proceedings of the ninth ACM SIGKDD international …, 2003

Cited by 1391 Related articles All 17 versions