Authors
Feifei Tu, Jiaxin Zhu, Qimu Zheng, Minghui Zhou
Publication date
2018/10/26
Conference
Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
Pages
307-318
Publisher
ACM
Description
Issue tracking data have been used extensively to aid in predicting or recommending software development practices. Issue attributes typically change over time, but users may use data from a separate time of data collection rather than the time of their application scenarios. We, therefore, investigate data leakage, which results from ignoring the chronological order in which the data were produced. Information leaked from the "future" makes prediction models misleadingly optimistic. We examine existing literature to confirm the existence of data leakage and reproduce three typical studies (detecting duplicate issues, localizing issues, and predicting issue-fix time) adjusted for appropriate data to quantify the impact of the data leakage. We confirm that 11 out of 58 studies have leakage problem, while 44 are suspected. We observe biased results caused by data leakage while the extent is not striking. Attributes of …
Total citations
20192020202120222023202463612117
Scholar articles
F Tu, J Zhu, Q Zheng, M Zhou - Proceedings of the 2018 26th ACM Joint Meeting on …, 2018