Authors
Tom van der Weide, Dimitris Papadopoulos, Oleg Smirnov, Michal Zielinski, Tim van Kasteren
Publication date
2017/5/14
Conference
Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning
Pages
9
Publisher
ACM / SIGMOD
Description
End-to-end machine learning pipelines that run in shared environments are challenging to implement. Production pipelines typically consist of multiple interdependent processing stages. Between stages, the intermediate results are persisted to reduce redundant computation and to improve robustness. Those results might come in the form of datasets for data processing pipelines or in the form of model coefficients in case of model training pipelines. Reusing persisted results improves efficiency but at the same time creates complicated dependencies. Every time one of the processing stages is changed, either due to code change or due to parameters change, it becomes difficult to find which datasets can be reused and which should be recomputed.
In this paper we build upon previous work to produce derivations of datasets to ensure that multiple versions of a pipeline can run in parallel while minimizing the …
Total citations
20172018201920202021202220232024142231073
Scholar articles
T Van Der Weide, D Papadopoulos, O Smirnov… - Proceedings of the 1st Workshop on Data Management …, 2017