Usage of version control systems (VCS) such as Git, which is an established software engineering practice, is challenging for machine learning (ML) projects. Artifacts produced by ML pipelines, such as datasets, pre-processed data, trained models, are often large in size. Once generated, they have to be stored on a disk since reproducing them over and over is expensive. Unfortunately, traditional VCSs have restrictions on handling such large artifacts. Not using version control instead makes reproducibility of results unreliable.
DVC (Data Version Control) not only version-controls large artifacts but also keeps track of the commands that are run to produce them. It detects changes made to the input data and knows which steps in the pipeline have to be rerun to keep the final result up-to-date. By adopting DVC machine learning community can make a big step towards the reproducibility of research.