Data is the energy that drives machine learning. The key to successful ML is accurately labeled data that machines can decipher. A data pipeline is essentially a series of actions or steps that is usually automated in order to move and merge data from various sources for the purpose of analysis and visualization. Data pipelines are crucial for quick data analysis and insights. Being able to consolidate all data in one central location allows you to extract meaningful information from your data.
This article was originally published by Dataloop.
From Data Pipeline To Data Loop Phases
The entire purpose of the data pipeline is for the model itself, and the quality of the data loop, to have a direct effect on the model’s performance. Each model has its own data loop, a continuous learning intelligent machine. But just like in school exams, good teaching requires constant feedback. At Dataloop, we’ve worked with millions of data items, and we’re well aware of the negative outcomes of using “bad” data. “Bad data” can refer to data that hasn’t been properly annotated, maybe there isn’t enough of it or other factors that may lead to data issues that directly affect the model. In order to continuously improve a model, it is essential to constantly feed it with accurate data in order to keep and increase the confidence level. High-quality data reduces the volumes of the datasets required in order to reach top model accuracy, saving significant labeling resources. On the other hand, poor data leads to poor ML models with poor accuracy rates. Therefore, proper QA becomes the first step of building powerful computer vision application
At Dataloop, we recommend implementing the following phases:
Phase 1: Data Acquisition
In this phase we gather new data sources for the model, changing data variance in a meaningful way. This usually happens when more users are added, new customers are acquired, change of locations, new web pages are crawled, etc.
In many cases, there are also solution-specific reasons for variance changes such as:
- Sensor changes for hardware companies
- Weather conditions for outdoor applications
- The content-type (for web analysis models)
Phase 2: Sub-Sample Data
When new data is acquired from various sources, usually a huge amount of data is received. While more than one source can handle the given resources at hand, usually sub-sampling is required, i.e. choosing the data that’s going to be worked on.
Let’s take an example: a cat detector. If we were to take 100 videos of a neighbor’s cat, each one with 10000 frames, we’d have a million images of the neighbor’s cat. Labeling these million images will be time-consuming and wasteful since this data is going to be of very limited variance (same cats, same environment, and same camera).
This is where sub-sampling comes in; it is the most basic way to ease information redundancy and it proves very effective, even when a simple random subsample is used. Sub-sampling can take far more complex behaviors, especially when we need to sub-sample in stages from the edge of the cloud. Take a look at the following device topology, a popular topology in the “internet of things”:
Since collection, transition, management, and storage of data are expensive collecting the entire network’s data isn’t realistic, and instead, you can sub-sample the network with different nodes:
- The device sub-samples the data to be sent to the base
- The base sub-samples the data to be sent to the cloud
- The cloud sub-samples the data to be sent to labeling
Phase 3: Label New Samples
In the context of the base, data loop labeling refers to any human labeling activity. The labeling process often combines semi-automatic capabilities powering an algorithm accelerated by human work in order to achieve efficiency and quality.
Phase 4: Train the New Model
Training is the process in which a model is created that answers a question from the data. You are essentially creating a machine that can produce annotations (answers) with the unlabeled data, mimicking the past examples it has seen.
It is reasonable to assume that in many cases, the training phase will become a click of a button in the coming years. Let me elaborate here a bit. From now on the focus is going to be about the data loop, and the data prep, as well as the data manipulation. Training and model architecture will just be “a click away.” You will no longer have to expend so much effort in order to train a model or choose an architecture, as all your efforts will be focused on the data phase.
The Data Loop’s Life Cycle Adaptations
The basic data loop is a bit different along the lifecycle (maturity) stages of an AI model:
The Research Phase
The research phase is about stabilizing the recipe and ontology. In this phase, experiments and scoring are the most significant activities.
The main characteristics of this phase are:
- Mostly using existing data or collection mechanisms.
- Fast iterations with new labeling recipes on existing data.
- Many training sessions.
- Stabilization of model scoring.
- Getting satisfying lab results.
The research loop has three levels of loop nesting:
The Scaling Phase
In this phase, data collection and variance are the most significant activities. This phase is about increasing dataset coverage (information collection) and generalizing our lab results on a large number of users, cases, and customers.
The main characteristics of this phase:
- Most of the data is collected from our users/customers.
- Modeling is mainly about minor tweaks and re-training.
- The recipe is stable, but ontology might extend slightly.
- Edge cases and anomalies get significant focus.
- Real-time labeling for strict monitoring might be involved.
Please note: Real-time refers to the labeling results that are sent to the customer as a business value, so the turnaround cycle varies from seconds to days.
The scaling loop has two levels of loop nesting if monitoring is applied (one for speed performance and one for accuracy):
Weak vs. Full Labeling
Scaling in production means we work with live customers’ data and our deployments can have high sensitivity for errors from the customer side. To overcome this we will often conduct two parallel operations:
- Latency optimized: This phase is about supplying fast feedback to customers. Since human labeling is slow (compared to machines), there is a need to take fast action in order to share clean reports/alerts with customers.
- Precision optimized: Once a sample has passed through the production recipe it will usually have a single, weak label – for example, is there a cat in the image? For training recipes, we usually need a much more accurate label, like a bounding box or a polygon around the cat.
The Monitoring Phase
Once the model result is sufficient then there is no need for full-training labeling cycles. This is where edge cases and corner cases are introduced.
In monitoring, there will usually be a small random sample, scoring the model with a “yes/no score,” making sure its performance is not degrading over time.
The monitoring loop has two goals: to identify if there have been any changes in the world data distribution and if our target signal has not been reflected in the training data that are used to generate our model. The labeling used in this phase is usually a simple validation of the model result.
Once a drift has been detected, its root cause is likely to change the ontology or recipe, causing a temporary fallback to scale or research.
During the solution lifecycle transition, it is very common for the model to fall back to the previous phase due to the preset requirements from a product, or due to a meaningful accuracy degradation (model drift). We differentiate between two fallback types: major and minor changes
Minor Change – Model Architecture that is Compatible With a Change
A minor change is a change that requires a modification in our ontology but leaves our model architecture in place. For example, we want to add a new type of cat to our well-working cat detection app.
The minor change represents a small change in the signal features’ distribution properties and its signal-to-noise ratio.
Major Change – New Modeling Architecture is Required
A major change takes place when we need to change the entire recipe; this usually also means re-labeling a significant part of our datasets. For example, we want to separate between lynx and bobcat: The signal separation is now focused on ears, tail, and legs and requires new labeling instructions and datasets.
The major change represents a new signal definition, in many cases leading to new distribution and new signal-to-noise ratios.
Let’s take a step back and just recap the data loop phases. We talked about data acquisition, sub-sampling data, labeling new samples, and training the new model. What’s next?
Here at Dataloop
At Dataloop, we offer an end-to-end platform for everything from labeling to data pipelines and beyond. Automation is at our forefront allowing you to reduce error rates, encourage repeatability, and speed up your ROI. Our ‘human in the loop’ function is the key to providing validation needs that are intertwined for business growth and success.
With continuous validation from your production environment, your organization gets a machine learning model that is constantly learning from its own behavior, assuring it behaves the same in the wild, as it does in the lab.