Authors
Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, Nan Tang
Publication date
2016/8/1
Journal
Proceedings of the VLDB Endowment
Volume
9
Issue
12
Pages
993-1004
Publisher
VLDB Endowment
Description
Data cleaning has played a critical role in ensuring data quality for enterprise applications. Naturally, there has been extensive research in this area, and many data cleaning algorithms have been translated into tools to detect and to possibly repair certain classes of errors such as outliers, duplicates, missing values, and violations of integrity constraints. Since different types of errors may coexist in the same data set, we often need to run more than one kind of tool. In this paper, we investigate two pragmatic questions: (1) are these tools robust enough to capture most errors in real-world data sets? and (2) what is the best strategy to holistically run multiple tools to optimize the detection effort? To answer these two questions, we obtained multiple data cleaning tools that utilize a variety of error detection techniques. We also collected five real-world data sets, for which we could obtain both the raw data and the ground …
Total citations
20162017201820192020202120222023202412127354840464230
Scholar articles
Z Abedjan, X Chu, D Deng, RC Fernandez, IF Ilyas… - Proceedings of the VLDB Endowment, 2016