Authors
Tom Van der Weide, Oleg Smirnov, Michal Zielinski, Dimitris Papadopoulos, Tim van Kasteren
Publication date
2016
Journal
Machine Learning Systems workshop at NIPS
Description
Real-world Machine Learning pipelines that run in shared environments are challenging to implement. Production pipelines typically consist of multiple interdependent processing stages. Between stages, the intermediate results are persisted to reduce redundant computation and to improve robustness. Reusing persisted datasets improves efficiency but at the same time creates complicated dependencies. Every time one of the processing stages is changed, either due to code change or due to parameters change, it becomes difficult to find which datasets can be reused and which should be recomputed.
In this paper we propose how to produce derivations of datasets to ensure that multiple versions of a stage can run in parallel while minimizing the amount of redundant computations. Furthermore, we show how to use such versioning to reduce the time spent in experimentation on setting up the data and keeping track of results.
Total citations
Scholar articles
T Van der Weide, O Smirnov, M Zielinski… - Machine Learning Systems workshop at NIPS, 2016