View article

[PDF] from hal.science

Optimization of cloud task processing with checkpoint-restart mechanism

Authors

Sheng Di, Yves Robert, Frédéric Vivien, Derrick Kondo, Cho-Li Wang, Franck Cappello

Publication date

2013/11/17

Book

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Pages

1-12

Description

In this paper, we aim at optimizing fault-tolerance techniques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal number of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but also attractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster environment with hundreds of virtual machines and Berkeley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula …

Total citations

Cited by 105

20122013201420152016201720182019202020212022202320241 1 8 15 18 11 14 7 7 2 6 7 1

Scholar articles

Optimization of cloud task processing with checkpoint-restart mechanism

S Di, Y Robert, F Vivien, D Kondo, CL Wang… - Proceedings of the International Conference on High …, 2013