What Is Synthetic Data? Or, The Treachery of Images
For Christmas I purchased my daughter a copy of Neural Networks for Babies. I thought it would be funny and cute, but also practical because I work in synthetic data at Zumo Labs, and (without a formal machine learning education) I occasionally feel as though my own understanding is that of a 9-month-old. Ultimately, I was disappointed that the book didn’t explain how neural networks are trained. But perhaps that’s an unfair expectation when a book’s first page is, “This is a ball.”
Since I’ve had to learn how to discuss training computer vision models (in a way that even I can comprehend), I thought I’d share here. While I’m at it, I’ll also explain what exactly synthetic training data is and when you might want to train on synthetic datasets. Disclaimer: I am not a machine learning engineer or data scientist, so please understand that—like Neural Networks for Babies—this high level overview may be imperfect or incomplete.
René Magritte’s most famous work is The Treachery of Images. It is a painting of a pipe with a caption beneath saying “Ceci n’est pas une pipe,” or “This is not a pipe.” The surrealist work drew the ire of folks who thought, actually pal, that looks like a pipe to me. Magritte, a pedantic troll ahead of his time, defended himself, saying “How people reproached me for it! And yet, could you stuff my pipe? No, it’s just a representation, is it not? So if I had written on my picture ‘This is a pipe,’ I’d have been lying!”
This story illustrates how the human computer — our big old brain — tends to process information. If you are sighted, you glance at something, compare it against a database of things you’ve seen before (and their accompanying labels), and gain an understanding of what exactly it is you’re looking at. It all happens in an instant, and it’s how we’re able to recognize a pipe when we see one.
But unlike humans, computers won’t organically accumulate a database of labeled reference images as they grow. If you want a computer to be able to perform object detection tasks — say identifying the location of an object within an image, or predicting an object’s name — you first need to teach that computer what those things look like and what they’re called.
This is where I will artfully gloss over the deep learning part. Suffice to say, it is no longer necessary to build your own object detection model. There are many different publicly available algorithms that can accomplish that task, featuring exciting acronyms like SSD, R-CNN, and YOLO. All of these have been patiently explained to me, and I now have, at best, a Neural Networks for Babies understanding of each.
The TL;DR is that these models generally work pretty well. But in order for them to work, they first need to be trained on data. In some cases you can find models — ResNet-101 or MobileNet — which have already been trained on a publicly available dataset like ImageNet. That popular dataset has over a million images and can teach models about 1000 object categories. But if you’re trying to use computer vision to solve a specific problem and recognize something in particular, the chances are you’re going to have to train the model yourself. And for that, you’ll need a custom dataset.
Do you know what object is not included in the ImageNet dataset? A pipe. If you wanted to build a pipe detector, and had to train your model to recognize a pipe, you’d need a dataset chock full of images of pipes. When you can’t find an existing dataset that meets your needs, you’ve got roughly three options: scrape it, make it, or fake it.
Scraping a dataset together is exactly what it sounds like. You’re combing Flickr or Google image search for the images you need. At best it’s time consuming, and at worst it’s ethically murky when you consider image or even likeness rights. (You better hope you don’t get an Illinoisian in there.) While you may get a good range of images this way, whether or not the resulting dataset will work for your purposes is another question altogether.
Some ambitious teams may choose to make their datasets in house. For a pipe, this would be relatively easy. Buy a pipe, and take a ton of photos of it. Wait, actually, you’d need to buy a bunch of pipes or else the computer will only recognize that one specific pipe (that's called overfitting your model). And you’d probably need to take photos of those pipes in a bunch of different environments and contexts, in hands, hanging from mouths, etc., lest the computer come to believe pipes can only exist on tables. Okay, see, now this is why people scrape their datasets.
But wait, why not fake it? Humans can tell the difference between a physical pipe and an image of a pipe — like, we get it Magritte! — but computers will only ever process images. That means a computer doesn’t really know or care whether that image is a real photo you took or a still that you generated using CGI. This sort of training data, custom 3D modeled and generated for a given computer vision problem, is known as synthetic data.
Synthetic data has a ton of upsides. It is flexible and abundant. It solves data scarcity, which is an interesting situation where pictures of the thing you’re trying to detect are just super hard to come by. There are no privacy issues since there are never real people in synthetic datasets. And you don’t need to painstakingly (or more likely, pay to have someone else) label the images you scraped together from Google, since synthetic images are generated with your desired labels at the outset. (I’ve actually written about the downsides of labeled data before.)
Also, once again, the computer couldn’t care less that you’re feeding it the Impossible burger equivalent of a hamburger. The computer just needs to learn the characteristics and contexts of a pipe so that it can identify pipes when it sees them in the future. Synthetic data works great for that. While you can train exclusively on synthetic data, current research suggests the best model performance actually comes from a hybrid approach (using both sourced and synthetic images). Incidentally, that’s kind of how we learn right? My daughter can say ball now (dad brag), and I have to believe her ability to recognize both the real ball and an illustrated ball is because she was trained on both.
As for Magritte, as persnickety as he was about it, I think he’s right. It’s not a pipe. It’s just synthetic data.
If you want to learn more about computer vision, we publish a weekly newsletter called The Juice. Sign up here.