View article

[HTML] from springer.com

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

Authors

Ifeanyi P Egwutuoha, David Levy, Bran Selic, Shiping Chen

Publication date

2013/9

Journal

The Journal of Supercomputing

Volume

Pages

1302-1326

Publisher

Springer US

Description

In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a growing concern for long-running applications. In this paper, we briefly review the failure rates of HPC systems and also survey the fault tolerance approaches for HPC systems and issues with these approaches. Rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed because they are widely used for long-running applications on HPC systems. Specifically, the feature requirements of rollback-recovery are discussed and a taxonomy is developed for over twenty popular checkpoint/restart solutions. The intent of this paper is to aid researchers in the domain as well as to facilitate development of new …

Total citations

Cited by 342

2013201420152016201720182019202020212022202320244 19 33 36 45 56 34 33 19 21 18 16

Scholar articles

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

IP Egwutuoha, D Levy, B Selic, S Chen - The Journal of Supercomputing, 2013