Authors
Sven Köhler, Sean Riddle, Daniel Zinn, Timothy McPhillips, Bertram Ludäscher
Publication date
2011
Conference
Scientific and Statistical Database Management: 23rd International Conference, SSDBM 2011, Portland, OR, USA, July 20-22, 2011. Proceedings 23
Pages
207-224
Publisher
Springer Berlin Heidelberg
Description
Scientific workflow systems frequently are used to execute a variety of long-running computational pipelines prone to premature termination due to network failures, server outages, and other faults. Researchers have presented approaches for providing fault tolerance for portions of specific workflows, but no solution handles faults that terminate the workflow engine itself when executing a mix of stateless and stateful workflow components. Here we present a general framework for efficiently resuming workflow execution using information commonly captured by workflow systems to record data provenance. Our approach facilitates fast workflow replay using only such commonly recorded provenance data. We also propose a checkpoint extension to standard provenance models to significantly reduce the computation needed to reset the workflow to a consistent state, thus resulting in much shorter re-execution …
Total citations
201120122013201420152016201720182019143231312
Scholar articles
S Köhler, S Riddle, D Zinn, T McPhillips, B Ludäscher - Scientific and Statistical Database Management: 23rd …, 2011