View article

[PDF] from researchgate.net

Versioned machine learning pipelines for batch experimentation

Authors

Tom Van der Weide, Oleg Smirnov, Michal Zielinski, Dimitris Papadopoulos, Tim van Kasteren

Publication date

2016

Journal

Machine Learning Systems workshop at NIPS

Description

Real-world Machine Learning pipelines that run in shared environments are challenging to implement. Production pipelines typically consist of multiple interdependent processing stages. Between stages, the intermediate results are persisted to reduce redundant computation and to improve robustness. Reusing persisted datasets improves efficiency but at the same time creates complicated dependencies. Every time one of the processing stages is changed, either due to code change or due to parameters change, it becomes difficult to find which datasets can be reused and which should be recomputed.

In this paper we propose how to produce derivations of datasets to ensure that multiple versions of a stage can run in parallel while minimizing the amount of redundant computations. Furthermore, we show how to use such versioning to reduce the time spent in experimentation on setting up the data and keeping track of results.

Total citations

Cited by 2

20172

Scholar articles

Versioned machine learning pipelines for batch experimentation

T Van der Weide, O Smirnov, M Zielinski… - Machine Learning Systems workshop at NIPS, 2016

Cited by 2 Related articles