Authors
Ifeanyi P Egwutuoha, Shiping Chen, David Levy, Bran Selic
Publication date
2012/5/13
Conference
2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Pages
709-710
Publisher
IEEE
Description
Cloud computing offers new capacity and flexibility solution to high performance computing (HPC) applications with provisioning of a large number of virtual machines for computational intensive applications. Fault tolerance allows HPC systems on cloud with multiple of nodes to complete execution of computational intensive applications in the present of fault. The most commonly used fault tolerance techniques for HPC is checkpoint/restart. However, checkpoint/restart increases the wall clock time of the execution of applications which increases the execution cost. In this paper we present a fault tolerance framework for high performance computing in Cloud. This framework proposes using process level redundancy (PLR) techniques to reduce the wall clock time of the execution of computational intensive applications.
Total citations
201220132014201520162017201820192020202120222023202411166881010811622
Scholar articles
IP Egwutuoha, S Chen, D Levy, B Selic - 2012 12th IEEE/ACM International Symposium on …, 2012