Authors
Eduardo Berrocal, Leonardo Bautista-Gomez, Sheng Di, Zhiling Lan, Franck Cappello
Publication date
2015/6/15
Book
Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing
Pages
275-278
Description
Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. Consequently, the number of soft errors is expected to increase dramatically in the coming years. In this respect, techniques that leverage certain properties of iterative HPC applications (such as the smoothness of the evolution of a particular dataset) can be used to detect silent errors at the application level. In this paper, we present a pointwise detection model with two phases: one involving the prediction of the next expected value in the time series for each data point, and another determining a range (i.e., normal value interval) surrounding the predicted next-step value. We show that dataset correlation can be used to detect corruptions indirectly and limit the size of the data set to monitor, taking advantage of the underlying physics of the simulation. Our results show …
Total citations
201520162017201820192020202120222023202491613216105352
Scholar articles
E Berrocal, L Bautista-Gomez, S Di, Z Lan, F Cappello - Proceedings of the 24th International Symposium on …, 2015