Authors
Thomas G Dietterich
Publication date
1998/10/1
Source
Neural computation
Volume
10
Issue
7
Pages
1895-1923
Publisher
MIT Press
Description
This article reviews five approximate statistical tests for determining whether one learning algorithm outperforms another on a particular learning task. These test sare compared experimentally to determine their probability of incorrectly detecting a difference when no difference exists (type I error). Two widely used statistical tests are shown to have high probability of type I error in certain situations and should never be used: a test for the difference of two proportions and a paired-differences t test based on taking several random train-test splits. A third test, a paired-differences t test based on 10-fold cross-validation, exhibits somewhat elevated probability of type I error. A fourth test, McNemar's test, is shown to have low type I error. The fifth test is a new test, 5 × 2 cv, based on five iterations of twofold cross-validation. Experiments show that this test also has acceptable type I error. The article also measures the …
Total citations
19981999200020012002200320042005200620072008200920102011201220132014201520162017201820192020202120222023202420566761619196105140134162181166171155187172199220192182228310354313326152