Authors
Yi-Heng Zhu, Jun Hu, Xiao-Ning Song, Dong-Jun Yu
Publication date
2019/4/3
Journal
Journal of chemical information and modeling
Volume
59
Issue
6
Pages
3057-3071
Publisher
American Chemical Society
Description
Accurate identification of protein–DNA binding sites is significant for both understanding protein function and drug design. Machine-learning-based methods have been extensively used for the prediction of protein–DNA binding sites. However, the data imbalance problem, in which the number of nonbinding residues (negative-class samples) is far larger than that of binding residues (positive-class samples), seriously restricts the performance improvements of machine-learning-based predictors. In this work, we designed a two-stage imbalanced learning algorithm, called ensembled hyperplane-distance-based support vector machines (E-HDSVM), to improve the prediction performance of protein–DNA binding sites. The first stage of E-HDSVM designs a new iterative sampling algorithm, called hyperplane-distance-based under-sampling (HD-US), to extract multiple subsets from the original imbalanced data set …
Total citations
2019202020212022202320243611141611