Any machine learning project involves continuously evolving code, datasets, and models. This process requires data versioning for tracking and managing the changes that are made to the dataset. ArtiV is smart version control system for large files, especially suited for use on machine learning projects with large amounts of data and metadata. This article presents a modern approach to data versioning for machine learning projects using ArtiV.
This article was originally published by InfuseAI.
ArtiV works with your own storage such as local, NFS, or S3. It is open source, ready to use, and actively maintained. You can find out more on GitHub, and install with Brew:
brew tap infuseai/artiv
brew install artiv
The Data-Centric Approach
The development of a machine learning application is based around the constant evolution of code, datasets, and models. During the development process, where you choose to focus your attention can have a big influence on the results you will achieve. For instance, in Kaggle Competitions you work on the same dataset, and focus on selecting the best model framework and adjusting parameters to achieve the best performance. According to Andrew Ng this is the “model-centric” method. However, Ng believes that the “data-centric” approach is the better method — Rather than changing the model framework, you should focus on cleaning the dataset, adjusting it, and strengthening it. This will lead to better results. As the saying goes “garbage in, garbage out”, or “from big data to good data”.
The process of strengthening your dataset requires tracking and managing the changes you make. In traditional software development, there are tools such as Git — a distributed version control system (VCS), that can manage the versions of your software and data. However, due to the nature of such software, the complete repository must be downloaded when cloned, they are unsuitable for machine learning applications that deal in huge amounts of data and metadata.
Data is Everywhere
In machine learning, data is everywhere. For instance, the dataset used for training needs to be tracked. Then you’ve got the data that is produced from that training such as experiment parameters, metrics, logs etc. Even the models are classed as data. Then, when the ‘golden model’ is finally selected and released, model versioning comes into play, producing more data. Add to that your training code and application code, it’s clear — everything is data. That’s why for a machine learning application version control is an extremely difficult problem to solve.
Solving the data versioning problem for machine learning requires the following important considerations:
- Huge datasets — Git isn’t suitable, you might want to store your data on S3, NFS, or your own MinIO server.
- Multi-repo in nature — The data produced from datasets, models, experiments, and code, is all valuable, as is version-tracking this data. A single repository can’t track all of this, and tagging is not a viable solution.
- Lineage tracing — The dependencies at each stage also need to be tracked. When your product’s environment produces an undesirable output, it could be any number of inputs that caused the issue. How do you go about debugging it? Which model version is your deployment, or batch prediction, using? Which dataset, training code, base model, or random seeds were used to create the model? Where did the dataset come from?
- Automated testing and error debugging — Without versioning, it’s next to impossible to perform automated testing on datasets and models. When your tests produce differing results, you need to be able to find the root cause of the change.
Managing Large Datasets
All of this is to help explain the importance of data versioning. Right now there aren’t many solutions, let’s consider a couple:
- Using S3 or NFS folders for version control. Does that bring back memories of early software development? It’s actually not as bad as that, probably 99% of developers still do this in some form, then zip or tar the folders for faster transfer. The problem is that there’s no version control. There’s no way to conveniently add a commit message like with Git, or view a commit log. Furthermore, there’s no way to compare versions. When it comes time to make some small changes — maybe just adding a few images, or moving some files around, you need to save a whole new version. This method might be a convenient way to access your data, but when it comes to managing the evolution of your dataset, it’s anything but.
- Another possibility is Git LFS, or DVC. To use Git LFS, though, the Git server must support it. If you’re using Github, or GitLab, then LFS is a convenient choice, but if you want to use your own S3 storage, then it’s not that easy. DVC does things differently. On the client side you use a DVC command to manage your data files, then the metadata is stored in a Git repo. Keeping track of things can be a little confusing as you need to understand what things are stored in the git repo, and what is stored elsewhere. Switching between DVC and Git commands can also be a headache.
Data ❤ Versioning
This made us think — why not store the version control information with the data itself? If you store the dataset in NFS, then put the versioning data there, too. If you store your data in S3, then just use S3 as your version control repo. We can just take the distributed out of ‘Git distributed VCS’, and make a centralized VCS — your storage, wherever that may be. Then use one command set to manage it. Simple, and more intuitive.
This is why we created ArtiV — The idea for ArtiV comes from the solutions outlined above.
ArtiV is a command line tool that supports using either local or S3 storage for your version control repository. To use S3 with ArtiV, all you need to do is prepare a credentials file, as you would for AWS CLI, then you can start using ArtiV to manage your versioned data. Behind the scenes, ArtiV stores data similar to Git, except it’s stored on S3. Locally all you need to do is define a ‘workspace’, then on top of this you can organize your data with push, pull, log, diff, list, tag. Even more convenient, for a model training environment, you can use commands similar to wget and scp to pull and push data from the repo. Simple and intuitive is our philosophy.
ArtiV is completely open source. We’re currently hard at work developing it. We hope that machine learning practitioners will try it out and find it useful. For the future, we’ve got lots of machine learning-related features planned.
ArtiV is easy to grasp in around 5 minutes — we’re aiming for a balance between ease-of-use and feature richness.
The main features of ArtiV:
- Use your own storage: If you store data in NFS or S3, you can use your existing storage.
- No additional server required: ArtiV is a CLI tool. No server or gateway is required to install or operate.
- Multiple backend support: Currently we support local, NFS (by local repo), and s3. More coming soon.
- Reproducible: A commit is stored in a single file and cannot be changed. There is no way to add/remove/modify a single file in a commit.
- Expose your data publicly: Expose your repository with an HTTP endpoint, then you can download your data from it
- Smart storage and transfer: For duplicate files, there is only one instance stored in the artifact repository. If a file has been uploaded by other commits, no upload is required because we know the file is already there in the repository. Under the hood, we use content-addressable storage to put the objects.
Watch ArtiV in action to see for yourself: