View article

[PDF] from researchgate.net

Versioning for End-to-End Machine Learning Pipelines

Authors

Tom van der Weide, Dimitris Papadopoulos, Oleg Smirnov, Michal Zielinski, Tim van Kasteren

Publication date

2017/5/14

Conference

Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning

Pages

Publisher

ACM / SIGMOD

Description

End-to-end machine learning pipelines that run in shared environments are challenging to implement. Production pipelines typically consist of multiple interdependent processing stages. Between stages, the intermediate results are persisted to reduce redundant computation and to improve robustness. Those results might come in the form of datasets for data processing pipelines or in the form of model coefficients in case of model training pipelines. Reusing persisted results improves efficiency but at the same time creates complicated dependencies. Every time one of the processing stages is changed, either due to code change or due to parameters change, it becomes difficult to find which datasets can be reused and which should be recomputed.

In this paper we build upon previous work to produce derivations of datasets to ensure that multiple versions of a pipeline can run in parallel while minimizing the …

Total citations

Cited by 34

201720182019202020212022202320241 4 2 2 3 10 7 3

Scholar articles

Versioning for end-to-end machine learning pipelines

T Van Der Weide, D Papadopoulos, O Smirnov… - Proceedings of the 1st Workshop on Data Management …, 2017