Synthetic Data In A Nutshell
This is the Three Data Point Thursday, making your business smarter with data & AI.
Want to share anything with me? Hit me up on Twitter @sbalnojan or Linkedin.
Let’s dive in!
Synthetic data is…. fake. That’s it. It's made-up stuff, not worth a penny. And yet, it looks like this is when synthetic data is at the breach of going from “theoretically useful” to practical reality.
It’s a lot like VR Headsets; everyone sees how valuable they can be - in theory. However, there are obvious practical limitations right now: the software isn’t there, and the hardware is neither. We all know it’s going to come at some point in time, but not yet.
Everyone would love to have a cheap VR headset; it just doesn’t exist yet.
For synthetic data, two years ago, the situation was very similar; everyone knew it was helpful, but only a handful of professionals could get it right. But now…
Changing Currents
Two forces push for new currents for synthetic data. One is generative AI (yeah, that one again), provides high-quality synthetic data quickly. The second is the growing gap between the data we have readily available and the demand we have for data & machine learning models & products on top of it.
Both of these two currents make one thing clear: Synthetic data is becoming useful, just like VR headsets. If you want to jump on early, now is the time.
What is synthetic data?
It's fake data, period.
By fake, I mean it is generated by a machine - meant to imitate another type of data that is usually created in a different way.
The real beef, however, is in knowing how it is created because slight variations in recipes result in vastly different dishes.
Generation using a probability distribution: The idea is pretty simple - you have a dataset of dice throws of six-faced dice for a couple of thousand throws. Then, you might as well use the probability distribution you know (or model one after the data you have) and generate any number of dice throws from it. The same principle applies to every single data set. Model a prob. Distribution after it, then use that to generate random new occurrences that conform to the same data measurements (e.g., the variance).
Generation through deep learning (or any ML, really): Using machine learning systems to generate synthetic data is very similar to the probability distribution approach, with one key difference: you don’t need to know the probability distribution. No specific modeling is needed, just some meta-modeling regarding choosing a proper ML model and the hyperparameters.
GANs: GANs are trained specifically to mimic a certain dataset, so to train those, you’ll need a pretty large dataset to start. But the objective of the GAN is as close as you can get to actually generating synthetic data. The task of a GAN is to generate new data points that look like they belong to the old ones as well as possible.
Generative AI: With generative AI, you’re leveraging the power of a huge dataset and can tackle zero-shot problems quickly, situations where you have little data at all.
Both generative AI approaches and unique modeling using probability distributions work well without any data to start out at all - while GANs and ML approaches need a significant dataset to start.
Nothing is stopping you, though, from combining all these approaches. For instance, you can use GANs to generate 10x a dataset you already have 100,000 samples of, and generative AI to generate one specific attribute you only have for a couple hundred of these samples.
What do you do with synthetic data?
The true question is not what you do currently with it, but what the potentials are; because this is technology right on the verge of becoming useful, you need to look just beyond the rim.
We know there is potential for synthetic data; Amazon used synthetic data to train its palm scanner because, as you might imagine, images of the human palm are not as frequent as machine learners might wish. In particular, generative AI was used to create variations, subtle changes, and edge cases for plam imagery.
OpenAI also used synthetic textual data to improve the skills of an image generation model.
Snowflake has now started to list synthetic data sets in its data marketplace.
The key idea is straightforward: Everything is better with lots of data! Synthetic data is easy to generate, so all use cases with relatively little data are good candidates for synthetic data supplements.
When you could profit from synthetic data
I believe synthetic data can be used in all fields: data science, data engineering, data-heavy software development, and machine learning & AI. I also think it is a mistake to think of synthetic data only in terms of machine learning; most opportunities lie outside this domain.
Here are a bunch of examples where synthetic data can be useful in different fields
These examples are just to spike your imagination; there’s plenty more once you start to think through it!
But how?
“But how?” That’s the catch and, in my opinion, the business opportunity. While, as you’ve seen above, there are many technical options to generate synthetic data, they still need quite a bit of data science knowledge and particular frameworks or DIY tools.
Towards AI lists a bunch of frameworks you can use to start generating synthetic data, but I’m personally still waiting for a few (product) breakthroughs in this area.
So what?
There is an opportunity here, both to improve how you build and test products and from an entrepreneurial perspective. I’d love to chat if you have some thoughts on synth data to share with me!
Here are some special goodies for my readers:
👉 The Data-Heavy Product Idea Checklist - Got a new product idea for a data-heavy product? Then, use this 26-point checklist to see whether it’s good!
The Practical Alternative Data Brief - The exec. Summary on (our perspective on) alternative data - filled with exercises to play around with.
The Alt Data Inspiration List - Dozens of ideas on alternative data sources to get you inspired!