Embrace the Dark Side of Data

How to turn unwanted data into magical products people will love.

Sven Balnojan PhD

May 09, 2024

This is the Three Data Point Thursday, making your business smarter with data & AI.

Actionable Insights

If you only have a few minutes, here’s how you can build better data products with proprietary data.

10x data products have a strategy. 10x data products aren’t so much better because they iteratively evolved slowly over time, driven by discovery calls. In my experience, they are 10x better because they have one thing: a deliberately planted good data strategy.
The right proprietary data might appear worthless. While the strategy behind these products might be deliberate, when it comes to proprietary data, there’s no knowing in advance when your data might become valuable. Adobe collected behavioral data for almost 10 years before turning it into a magical product with the help of technology advances.
Collect everything and close it off. When it comes to business, you need to protect your assets. Data is just bits; it is and always will be cheap to store, so collect everything (within the legal limits)! Keep it closed off, and don’t give data away for free.
Watch out for unique data with a cross-functional team. The true power of proprietary data lies in unique proprietary data. The level of uniqueness provides a lever for your company. With all the unique data you have, always try to find a way to make it useful, using technological advances or combining it with more (non-unique or procured) data sets. To make this work, you’ll need a cross-functional team that knows your data deeply, knows the tech, and knows the customer.

Like this? Then, pre-order the booklet!

I’m preparing a short e-book on common strategies for data-heavy products. If you like this article, head over to “The 10 Data Product Strategies” and pre-order it so I know I should hurry up and write it!

Pre order within 48h for early access

We still suck at building data-powered product experiences. In January 2023, ChatGPT wasn’t even three months old. With over 100 million users, it became one of the quickest-growing online applications in the world. In February 2022, the company DbtLabs received a 4 billion USD valuation for an open-source project providing mostly templated SQL (in my words, definitely not theirs).

And yet, after spending over a decade in the data space, I can confidently say we don’t yet know how to repeat such successes. I still have to do my taxes myself, mostly. I need to draw my weird comics myself, even though a 6-year-old could do that, but not a machine, apparently. We’ve been complaining about loading dashboards in analytics for over a decade, and next to nothing has changed.

We know how to build iPhone-like 10x products but not how to build ChatGPT-like 10x data products.

I think there’s one single reason for that: What separates good data products from great data products is a good data product strategy, one that uses leverage!

What makes 10x data products different? It is a good data strategy. 10x data products aren’t so much better because they iteratively evolved slowly over time, driven by discovery calls. In my experience, they are 10x better because they have one thing: a deliberately planted good data strategy.

Good data strategies consist of two core ingredients:

A source of pure power, like the scientific advances of ChatGPT (the transformer architecture + Human assisted reinforcement learning) and the data of the whole internet, the first GPT models were trained on.
A guiding policy that helps to channel this power, like the structure of OpenAI and millions of funding that allowed for deep and proprietary research for years, coupled with the belief that the transformer architecture is the thing + belief in the new power of good old industrial research labs.

Only a few products truly have such a good data strategy, most miss one of the two ingredients or the link between. 90% of builders of products don’t really think about power, leverage, or how to execute on it. They get stuck with incremental value. Nothing wrong with that, but it’s not enough to build 10x data products.

Now I think one of the most fundamental and most overlooked strategies is using your own unique data, proprietary data.

So today, I want to spend some time discussing how you can build an amazing product around your own proprietary data.

Like magic

“Almost like magic” is how one of the newer Photoshop features is described. The software Adobe Photoshop is, without a doubt, the market leader in professional digital art. It comes complemented with dozens of other tools like Adobe Illustrator, Adobe Premiere, and many others to cover everything inside the digital visuals space, from photo editing to creating animations to drawing. However, Photoshop likely has the largest penetration among image editing use cases, such as photo post-editing, hence the name.

Adobe generative fill supercharges photo editing, making any amateur work look like an expert. It allows to outpaint, that means to enlarge a photo, like a close up picture of a cup on a table, to turn into a photo of a complete room with the cup on the table. It allows to merge images with different background into one coherent scene, or to add parts to an image you come up with on the fly.

Especially powerful and often used in photo-editing is inpainting; this process refers to selecting a specific part of the image, and then letting Photoshop redo this part based on a prompt you provide. You can, for instance, select the neckline of a woman in a photo and prompt Photoshop to place a beautiful necklace there.

Interpretation

All of us know the technology behind generative fill: it’s generative AI. An AI technique that had its breakthroughs in the early 2020s and is capable of generating text and, importantly, images based on inputs (although the word generative in gAI doesn’t refer to this property, it’s what they are famous for.)

Adobe is the only company with generative photo features so advanced they would be called “magic.” That might not be surprising, given that Adobe is the market leader. There’s more to this story. While the technology to make the generative image features possible is accessible via APIs to most market participants in the photo editing space, it is necessary to fine-tune this technology to these use cases, like inpainting or outpainting.

And that’s where Adobe struck gold. As the market leader, used by over 90% of digital art professionals, Adobe has amassed data. Indeed, it has been collecting data at scale since early 2011, when Adobe switched to a cloud-based software model. Of course, this subscription model includes file hosting and as such, opened up a war chest of data for Adobe. To top it up, Adobe has offered one of the largest stock image catalogs in the world since 2015. Needless to say, Adobe now has the detailed logs of every photo-editing professional on how they manually out and in paint, how they carefully replace parts of an image, and, of course, of the most extensive professional image collection in the world (likely). In other words, they have proprietary data.

Understand: Adobe might have gotten early access to the technology; what matters is that six months later, the technology was accessible to every business in the market. However, what every other business lacked was the level of data needed to fine-tune the models to create magical features. If your proprietary data is unique, there is likely a way to leverage it into magical features for a certain group of people.

Keys to execution

Proprietary literally means having ownership. That’s been the game of the business world forever; after all, capitalism is about individuals owning wealth. Companies know amassing data must be somehow valuable, but most of them drown in data. There’s even the term “dark data” to describe data that is collected, which is valuable but usually ignored. In fact, most companies ignore most proprietary data they already have for tons of reasons.

One is that only actions based on data are valuable, not the data in itself. You can’t trade data like oil (not yet.) Data value is also contextaware. Think for instance about logging data, almost every company collects log data fro two purposes: For auditorial necessity and for developers who want to hunt down bugs.

And yet it’s that exact data, the dark logging data, that gives rise to Adobe's magical feature.

Collect all data. Adobe likely never set out to collect data with the idea of creating these features in the future. They moved to a cloud-based model to save their business, not to collect data at all. One of the lessons this teaches us is that the power of proprietary data can be hidden deep inside logs, inside dark data we don’t even consider worthy. So what do we do about that? Simple, we collect data, lots of data. If in doubt, we opt to collect it. I’m not talking about personalized data; of course, the features described above never needed to record individuals, and they seldom matter to unleash the power of proprietary data. If you build something unique, collect the data behind it. If you have access to a unique set of users, collect the data. Collect and never throw away your data (given the laws in your jurisdiction); it’s as simple as that.

*https://imgflip.com/i/8ode41: Gollum knows surprisingly much about protecting unique things like data. Every company should be a little bit more like Gollum.*

Close off your unique data. Certain kinds of proprietary data will never carry leverage: Those that are not unique to your company. That includes all data sets you can buy, APIs you can access, and openly available data sets. So the question is: What are the proprietary data sets that carry the power to propel your data product into a 10x data product? It’s the ones unique to your business! Those are either collected by you, not yet collected by you, but could be collected by you, OR a combination of a set of datasets. What matters is that no one on earth has to be able to get your dataset with a reasonable effort or cost. So, while I’m a fan of openness, this also means you need to not just collect but protect your data to death. Don’t release datasets for free (or paid, if in doubt); keep them safe.

Watch out for all unique datasets. On the other hand, regularly be on the watchout for your datasets. I’m not suggesting you catalog them all, but I do propose a quarterly workshop brainstorming through all of them, identifying unique properties among them, and potentially additional collection or combination opportunities.

Create a cross-functional squad for data discovery. Another secret to powerful proprietary datasets is that, by definition, the ones that are important to you will be completely unknown to the rest of the world. No one will talk about them because they are proprietary, so there’s next to no use in googling or asking around in industry conferences to find your unique datasets. What is useful, however, is to look for black spots on the map. To look for the use cases people are yearning to fill but lack the data for (data you might have or might be able to collect.) To keep an eye out for those possibilities, it is necessary to have cross-functional collaboration between the most data-savvy, the engineers inside your company, and the people closest to customers. Get them into a room on a quarterly basis, and check brainstorm for potential.

Be in discovery mode for use cases in adjacent fields. Adobe has been on the cutting edge of machine learning for years. Adobe Premiere, its market-leading video editing program, has had machine learning-based speech transcription for years (a great feature we will discuss later!). Adobe probably knew the time was coming when technology worked its way across from speech-to-text into text-to-image. To facilitate a similar awareness, you need to stay on the cutting edge of two things: (1) technology in the data space but, more importantly, (2) use cases inside your industry or adjacent industries that rely heavily on data. It requires a lot of attention and focus, but it’s worth it in the end.

When the time comes, like it did for Adobe in 2023, you want to ensure this foundation is in place:

You have (collected) the data! And you have a team that knows how to find it.
You know how to use it because you saw the use case coming for some time and were only waiting for the data/ the technology.

Exercise to get started right now!

There’s a simple way of coming up with lots of potential new product ideas using proprietary data: Ignore all product ideas! That’s right, ignore product ideas, and watch out only for data, no matter how crappy it might seem to you.

Here’s what I constantly ask myself to identify potential.

What data are we collecting that’s completely worthless? You surely have log data, right? Think about all the data you have that you don’t like to store, all the random byproducts of what you’re doing.
What unique data do you have? You have a website; you have a product, and you have your company internals. Maybe you have machines that produce products. All of these components collect unique data. What else do you have?
What parts of your data would be valuable if only there were this one magic technology? Discard all of your assumptions about cost.
What is the data you’d pay a lot for to procure if only it was possible?
What data would your customers pay dearly for to get their hands on if only it were possible? Benchmarks? Competitive data? How about information on doing things the right way? Spell it out for your market.
What’s the hottest data in your industry? Don’t discard hot stuff just because it’s hot. Include it in your list and see later whether it truly doesn’t compare to the rest.
How would you describe the data that the top player in your industry uses to make his products great? Now, think about the analogy in your industry. Is it really impossible to get that kind of data?

Reach out: If you’re happy with your responses, feel free to message me, I’m always up to help people make more out of data.

Related Writing

There’s things you have to watch out for when designing your strategy! I wrote about a bunch of common mistakes people step into when devising their data strategy.
If you’re looking for more ways to find a good strategy for your product, Take a look at 3 actionable tactics inspired by Netflix.
In terms of books, I can’t recommend Good Strategy Bad Strategy: The Difference and Why It Matters by Rumelt enough.

Three Data Point Thursday

Discussion about this post

Ready for more?