Five Big Problems With Labeled Data
The widespread adoption of artificial intelligence and computer vision across industries—from manufacturing, to retail, security, agriculture, healthcare, and beyond—has increased the demand for labeled data exponentially. As the key to training models that will transform the way we work, it makes perfect sense that labeled data and data annotation would suddenly be in such high demand.
But at Zumo Labs, many of our incoming customers have a common pain point: labeled training data is presenting itself as a significant bottleneck. How is it that data wrangling (that is, sourcing labeled data and managing the training data pipeline) can take up to 80% of AI project time by some estimates (per Cognilytica)? Even if that figure is exaggerated by a factor of two, perhaps by exasperated engineers who would rather be making headway on their projects, it’s still far too much time. In our work, we’ve pretty concretely identified five big problems with labeled data.
PROBLEM 1: Labeled Data Must Be Sourced or Produced
The first issue is perhaps the most glaring—labeled data must be sourced or produced. Where are you getting your labeled data? This will depend on your specific use case, but there are only so many options. If you’re lucky, you may be able to find a publicly available labeled dataset that’s “good enough” for the problem you’re trying to solve. But if you’re like most folks, your model is going to require a custom solution; you’ll need to collect raw data from your own cameras and then label it.
And what are your options for labeling it? You can label that data in house, which requires you to build out a team of subject matter experts, or you can turn to a third party vendor for this step, such as Amazon Mechanical Turk or a data annotation service. These annotation services are usually limited to simple annotations such as bounding boxes and basic categories, those which are easy for a human labeler to do at scale. If your problem requires nuanced or proprietary labels (a CAD model of an engine with 75+ subcomponents, for example) you will have no choice but to build and train a team in house. That’s what Tesla has done, for example.
An added challenge here is that you can only source (and subsequently label) data that already exists. That means if your cameras are for a piece of hardware that has yet to be manufactured, or you want to detect incredibly rare failures (that you simply don’t have enough examples of to train a robust detection model), you’re out of luck.
PROBLEM 2: Quality of Data Labeling Is Lacking
Assuming you’re able to source a representative and sufficiently balanced batch of images, the next big problem is the quality of the data labeling. Precise labels are critical to the performance of a model, but publicly available datasets have consistently shown that they can contain questionable—and sometimes alarming—labeling issues, as documented in this article about ImageNet containing slurs. While labeling services often have internal quality assurance checks in place, a customer is still dependent on the domain knowledge of the labeler and the person responsible for QA.
Quality goes beyond the accuracy of the labels though. The precision of a bounding box matters in training as well. Human labelers may draw a bounding box to the best of their ability, but they can’t always fairly assess images that present obstacles such as occlusion. Likewise, without the full context of the images contained within the dataset, or even how the dataset will be used, the quality of their labeling may suffer.
PROBLEMS 3 & 4: Data Labeling Is Slow and Costly
Because data labeling requires a human in the loop, the twin issues of both price and speed become immediately apparent. To address price, data labeling is not cheap. To use a computer vision engineer’s time on labeling rather than on machine learning is an inefficient use of resources. But building out a dedicated labeling team in house is also a huge gamble, especially if you have a less-than-predictable future volume of labeling needs. Meanwhile, the cost of using a third party labeling service often maps closely to the volume of images you need labeled, because each additional image means a little more work for a labeler.
Training on labeled data can also only happen as fast as you’re able to acquire labeled data, which is dependent on both your ability to capture and share the data in question, and the turnaround time of your chosen labeler. If for whatever reason you need to increase the size of the dataset, say to introduce new edge cases, you must once again run through the full cycle. This introduces a challenging bottleneck.
PROBLEM 5: Privacy Is Not Guaranteed
Finally, where datasets including humans are concerned, storing labeled data introduces unnecessary privacy considerations. If you’re working with a publicly available labeled dataset such as MegaFace, you might assume you’re in the clear. But since that dataset consists primarily of images scraped from the internet, there are real liabilities. Facebook was sued for violating the Illinois Biometric Information Privacy Act (“BIPA”) after scraping images of users to train its Tag Suggestion feature. BIPA, notoriously one of the most restrictive state laws around biometric privacy, requires written consent before collecting someone’s fingerprints, retina or iris scan, voiceprint, scan of hand, or face geometry. Facebook paid $550 million to settle this suit last year.
Progress on better artificial intelligence and higher quality computer vision models is hampered by a dependency on sourced and labeled data. We believe that there’s another better way.
SOLUTION: Synthetic Data
Synthetic data, in this case computer generated imagery that’s simulating the specifics of sourced data, is the solution to nearly all of the specific failings of labeled data. Since the data is simulated and will never contain real humans, compliance with privacy laws such as BIPA, GDPR, or CCPA is guaranteed. It is, over time, both cheaper and faster than sourcing and labeling data. The quality is unparalleled, thanks to pixel perfect annotations and ground truth (which solves for occlusion, creates perfect depth maps, etc.) And perhaps best of all, synthetic data addresses the issue of data scarcity—you do not need to wait to collect a critical volume of real data to get started on a problem.
They say control your controllables. For a long time, dataset creation and annotation was not a realistic controllable for most businesses. But synthetic data makes that so. Generate your own synthetic data in house, and take control of your entire machine learning training stack. Or reach out to us at Zumo Labs if you’d like a guided tour of just what synthetic data is capable of.