Thoughtful Friday #18: Winning in Snowflake-like Markets

Sep 23, 2022

I’m Sven, and this is Thoughtful Friday. We’re talking about how to build data companies, how to build great data-heavy products & tactics for high-performance data teams. I’ve also co-authored a book about the data mesh part of that.

Let’s dive in!

Time to Read: 9 minutes

To win in the data markets, network effects are necessary.
Open Source is not the only way to generate them.
Databricks consolidates like crazy to create network effects
Dbt builds on an open standard, SQL to create network effects
And Airbyte uses partial open source to generate them
No one is doing the strategy that works elsewhere: building a marketplace.

🔮🔮🔮🔮🔮🔮🔮🔮🔮🔮🔮🔮🔮🔮🔮🔮🔮

This comment on my last Thoughtful Friday about different types of open source sparked my desire to finish this draft, so thanks for that!

“Illuminating reflection on the pros of open-source. Thanks! I think a great complementary post would be to highlight the risks of going all open. At the end of this post, I was wondering "what are the reasons for *not* going fully open source?".” - a founder of a data company.

Yes! Let’s talk about that.

(remember this pic? If not, consider reading “How to become the next 30 billion $$$ company”)

Is there risks to going for an “all-open” OS strategy? Honestly, I’m not sure. All Open OS means trying to build a huge ecosystem via open source.

What I do know is, there might be better options to build an ecosystem, given the circumstances.

Since this is a data newsletter, I’ll dig into the data-specific markets and the three main business strategies companies are doing to build out platforms & ecosystems.

So let’s dive in:

1. My perspective on 2021 and why it needs updating (I was wrong)

2. The snowflake problem is real

3. Where the snowflake problem comes from

4. Why it needs to be solved in the data markets

5. The databricks way of solving the snowflake problem

6. The dbt way of solving the snowflake problem

7. The Airbyte way of solving the snowflake problem

Disclaimer: I updated my beliefs on this topic

In 2021, I published an article that was very pro open source models in the data space. But since then, I updated my beliefs on the underlying hypothesis. Back then, I argued that the diversity of data sources & data targets leads to companies adopting open source for solving this issue, and as such, the data space will be dominated by companies leaning heavily on open source.

But since then I saw companies shining in this space without relying on open source as heavily as I imagined. They still had to deal with this problem, but they did so in a very different way. So now, I think I need to update my fundamental hypothesis on the data markets.

Hypothesis 2021: “Every company has to solve the problem of the diversity of data sources & targets. So it has to adopt open source; In OS, only the biggest ones win, so going all-open is necessary, by peer pressure.”

Hypothesis 2022: “The data markets are massive network effect markets. The one with the biggest network-effected engine will win. Since solving the data source & target problem is essential, some companies will achieve this through an all-open strategy. Others will find a different solution to this problem and generate the network effects by building up a very different kind of network effect engine (almost all of the time through building platforms). ”

It didn’t get simpler, and I feel it rarely does. So, given this disclaimer, let’s dive deeper into this other half, the ones of thriving companies that do not go through open source.

The “Snowflake problem”

Pick two people, what are the chances they wear the same set of clothes? Pretty slim right? Yet if you go take a look at a larger set of people, it turns out, that for a specific stack of clothes, it’s pretty easy to find a large base of people wearing them.

This is why clothing stores make sense, the preferences of the underlying consumers are well clustered, even though they might be separated. That means two people might like two very different kinds of clothes, so we need more than one clothing store, but overall, each clothing store will have a large audience to cater to.

Not so in the data markets. The data markets have a property I like to call “the snowflake problem”. It’s a problem every company has to overcome inside the data space. And yet I see very different ways of dealing with it, some more successful than others.

The snowflake problem is a simple idea: If we choose your data stack, meaning all the sources you want to tap, the targets you might want to shove data into, and the tools you want to use along the way, it’s quite likely you’re almost alone.

Certainly not enough to run a business on just the stack you have.

The company Airbyte at its founding stage ran a survey on 200 data teams and since then probably many more and has validated this: 80+% of companies need to custom code inside the data stack because they are so unique that no one else bothers to serve it.

Where does the snowflake problem come from?

In the data markets, the emergence of the snowflake problem IMHO has a simple reason: We’re lacking protocols & standards, and I don’t see any (mature ones) emerging as of now. This results in every single source of data, the target of data being a unique little thing, making every company’s data stack unique.

Why do data companies need to solve this snowflake problem?

Data wants to be integrated, whether that is your cool new shiny business intelligence tool, your machine learning framework, or your data integration tooling. You will need to get data from somewhere and likely to some other place. You need to integrate.

Key idea: If you’re a data company that wants a new data tool out, you will need to get it integrated.

If you start to build these integrations yourself, you will likely quickly come to a point where your integrations are way over your head, already degrading in quality, and yet you still haven’t reached a sufficiently large market.

Key idea: The implication of the snowflake problem, you won’t be able to just “provide the integrations” yourself.

So let’s look at a few companies that are successfully solving these issues in different ways. You will notice, some products & businesses are naturally more or less likely to hit upon this problem.

Consolidate & Abstract with Databricks

The company databricks tries to provide a platform for all your data. Data has to come from somewhere though, it has to be ingested, and it has to be analyzed somewhere, utilized, and turned into analytical or operational applications.

The strategy databricks has chosen to solve this problem is simple and two-fold: First, they are trying to consolidate the complete lifecycle of data work into databricks. For that purpose they acquired the company Redash in 2020 and repackaged it into “Databricks SQL Analytics”, making it possible to build reports & dashboards inside the databricks ecosystem.

Second, databricks builds on high levels of abstraction, the highest level being the cloud providers as well as Spark as a framework. To ingest data into databricks, all someone needs to do is to be able to shovel data into the standard object store of AWS, GCP, or Azure.

Notice, both of these strategies are hard! It’s a lot of work required to get to that level of abstraction as it is to consolidate like crazy, but databricks is pulling it off right now.

DbtLabs & Snowflake

Snowflake started out as a data warehouse, dbt as a SQL-based data transformation tool. And yet both companies behind them are growing like crazy, even though, both companies offer tools that rely on heavy integration.

After all, why would you want a data warehouse if you have no way of pushing data into it or taking it out? Why would you want to transform data, if you cannot connect these transformations to your personal data store?

Both companies had to build integrations themselves, and they succeeded. But today most integrations are built by other companies. Why?

Because dbt as well as Snowflake built on top of a protocol, a standard, the SQL standard, one of the few that exists in the data world. So even though, there is a lot of criticism on the prevalence of the SQL standard, it is what makes these companies succeed.

Notice, there are many more standards & protocols you can build on, you can even try to create your own, just none so far that covers the breadth of data well like the HTTP & Rest standards, or Bitcoin & Ethereum do.

Databricks, Airbyte, OS

Being a large company, databricks also employs yet another strategy to deal with the snowflake problem, open source.

Even though databricks is not all open, they open-sourced Delta Lake, and of course, Apache Spark has always been open source. Both pave the way for others to integrate with these solutions and take this burden off of databricks. A lot of companies are now building Databricks connectors, like the company Airbyte for their data ingestion solution.

Speaking of Airbyte, the company Airbyte made open-source their primary strategy and wrote at length about it that the snowflake problem can only be solved by open-source. So while the Airbyte core is not open, all connectors for sources & targets are.

Going beyond just OS, to the marketplace

The extension to this strategy would be to create a marketplace - a place to connect people that need “connectors” and people who provide them. And yet I haven’t seen any companies do that so far. The benefits of marketplaces are monetary incentives for the participants, meaning the quality of the connectors would likely increase. It’s what made the company Shopify so successful.

There is no need to have a marketplace based on open source, a lot of the successful implementations are not based on open source at all. The data world though is so far lacking any of them (at least I don’t know about them).

Summary

There we have it, three business strategies in use, and one extension observable in other markets. None of them running on All-open OS, currently in action, and successful.

Would love to hear some thoughts on this, and additional strategies you’ve observed.

What did you think of this edition?

Want to recommend this or have this post public?

This newsletter isn’t a secret society, it’s just in private mode… You may still recommend & forward it to others. Just send me their e-mail, ping me on Twitter/LinkedIn and I’ll add them to the list.

If you really want to share one post in the open, again, just poke me and I’ll likely publish it on medium as well so you can share it with the world.

Three Data Point Thursday

Discussion about this post