Authors
Gustavo EAPA Batista, Maria Carolina Monard
Publication date
2002/12/30
Conference
HIS
Volume
87
Issue
251-260
Pages
48
Description
Data quality is a major concern in Machine Learning and other correlated areas such as Knowledge Discovery from Databases (KDD). As most Machine Learning algorithms induce knowledge strictly from data, the quality of the knowledge extracted is largely determined by the quality of the underlying data. One relevant problem in data quality is the presence of missing data. Despite the frequent occurrence of missing data, many Machine Learning algorithms handle missing data in a rather naive way. Missing data treatment should be carefully thought, otherwise bias might be introduced into the knowledge induced. In this work, we analyse the use of the k-nearest neighbour as an imputation method. Imputation is a term that denotes a procedure that replaces the missing values in a data set by some plausible values. Our analysis indicates that missing data imputation based on the k-nearest neighbour algorithm can outperform the internal methods used by C4. 5 and CN2 to treat missing data.
Total citations
2003200420052006200720082009201020112012201320142015201620172018201920202021202220232024222525543551725343551405764687133