🐰 Airbyte's wrong turn; Metadata lake; A modern data stack in 5 mins; ThDPTh #41 🐰

Hi,
This week I got caught off guard by the company Airbyte, a hot new data integration start-up, I’ve been watching closely.
The company is betting on the fact that open-source is the future of data integration, and yet decided to close the source on its core.
Mmmmh.
I’m Sven, I collect “Data Points” to help understand & shape the future, one powered by data, not electricity anymore.
Svens Thoughts
If you only have 30 seconds to spare, here is what I would consider actionable insights for investors, data leaders, and data company founders.
Airbyte took a wrong “license” turn.
If you’re in the same situation don’t fall for the tyranny of the OR, embrace your genius of the AND.
Think about buyer-side pricing and figure out how to drive community engagement to the max, without speed limits like closed cores.
Take a look at the idea of the metadata lake.
Then think whether you agree that you should instead think about how to connect individual data pieces together, the underlying “graph”.
There might be a business opportunity here both in providing & connecting metadata lakes if one wants such a thing.
Getting started & providing modularity are still key challenges in the data space, in every single area, again there is more than one business opportunity here I feel.
🔥 What: Airbyte, a young and super quickly growing data integration startup just changed the license on its core from MIT to EL2, thereby moving away from open-source to a protective license that does not allow others to monetize the core.
It’s a move that’s been a somewhat common reaction from open-source-based companies fearing “commoditization”. I’ve already written a lengthy piece about why I think these companies fall for “the tyranny of the OR”.
🐰 My perspective: Based on my research so far, this seems to be a move in the opposite direction of where the company has to go if it wants to win in this market. Especially in the early stages. Worse, they might not notice it because they got good momentum going and will be on the rise for quite some time.
Airbyte rightly realizes that nailing the “connector” problem (or data snowflake problem, as I like to call it) is key to the data integration space. But the fear of big companies providing a hosted Airbyte solution drives them in the wrong direction.
If you as a company want to have as much “connector contribution” as possible, you’ll have to create strong incentives for the developers of the connectors. That means as Tobi Lütke of Spotify puts it, to leave all the money on the table and give it to the developers. That in turn means, getting as many consumers onto your “platform” as possible, and that means, getting widespread adoption!
Widespread adoption is if big companies come and host your product. Because you then get a lot of spreading for free. That means you will not make money on part of that, but that’s just a question of how you spin your business. Because it’s a huge huge uplift and a big incentive for your creators to create more connectors, which in turn will spin your flywheel faster.
What’s even worse is that keeping the connectors + specification open source, while closing the core means the company just opened up an easy vector of attack: replacing the core with something truly open source & using the existing connectors; Drive that by giving more money to the developers.
Finally, the pricing strategy Airbyte mentioned in that announcement differs from what I previously took as a “buyer-side” strategy. Turns out they seem to go for a “company-size” pricing model. Why this actually hinders all the profits of the company, the growth of the tool, and again opens a path to commoditization I also explore a bit in my open-source pricing article.
It’s a way to go, and other companies work their closed core quite well. Indeed lots of companies don’t have any open-source. But if you’re at that crossroad, you should think really deeply about why you are actually in fear of “commoditization” (and read my post on it!), or whether it actually might be either a great opportunity or whether you can have it all, no commoditization while still having almost everything in an open core!
🎁 What: Metadata is data about data. Today, there are many different forms of metadata like performance metadata or user metadata. Prukalpa makes the case that we’re now at the time of metadata creation, driven by both the explosion of kinds of metadata, that justifies the creation of a metadata lake.
Whereas the data lake enables easy access to all data we get through centralization, the metadata lake enables easy identification of valuable data pieces through centralization. Key to that of course is a graph structure connecting pieces in the metadata lake as well as the data lake.
🐰 My perspective: I’m not sure Prukalpa is right. Very likely some organizations are at this point. But I also see the fact that centralization in itself is not the final solution to any data problem. Barr Moses points out in her “data discovery 2.0” article, that we need to acknowledge the distributed nature of data even in the metadata layer.
I also feel like the graph structure connecting pieces is actually the crucial ingredient here, otherwise, there is no “value identification” throughout the metadata lake. But if the graph is the key ingredient, you don’t really need to centralize anything, you just need to take care of the graph.
It’s interesting that we put so much struggle into such simple concepts that ultimately come down to a simple question “Eh what exactly is this weird attribute you’re providing me with over your API?”.
I feel like we’re still not at the point of tackling the elephant in the room.
humansofdata.atlan.com • Share
Setting up a data stack in 5 mins.
What: Tuan Nguyen wrote this short piece about setting up a “modern data stack” within 5 minutes using terraform on GCP. I’m not so much interested in the hows of setting up this on GCP, but rather on the meta point about setting up a complete data stack within 5 minutes.
🐰 My perspective: We are currently at a weird point of time where startups have a huge advantage over incumbent companies because incumbent companies usually have a locked-in data stack, whereas startups can launch one within minutes. Even more interesting, they can launch a better data stack in minutes! And still, I consider this a key challenge of the data space.
I feel it’s not just about launching a data stack, but about modularization and being able to exchange parts. There still seems to be a lot of coupling inside a “modern data stack” which really shouldn’t be there.
So yes, there are still a lot of business opportunities right there! Getting up quickly, wrapping stuff & modularizing things, all of these things seem to be nowhere where they should be.
I only know of a few companies even going into that direction, GoodData and their “headless BI” concept come to my mind, as does what tabular plans to do (in like 5 years).
towardsdatascience.com • Share
🎄 Thanks => Feedback!
Thanks for reading this far! I’d also love it if you shared this newsletter with people whom you think might be interested in it.
Data will power every piece of our existence in the near future. I collect “Data Points” to help understand & shape this future.
If you want to support this, please share it on Twitter, LinkedIn, or Facebook.
And of course, leave feedback if you have a strong opinion about the newsletter! So?
It is terrible | It’s pretty bad | average newsletter… | good content… | I love it!
P.S.: I share things that matter, not the most recent ones. I share books, research papers, and tools. I try to provide a simple way of understanding all these things. I tend to be opinionated. You can always hit the unsubscribe button!
Data; Business Intelligence; Machine Learning, Artificial Intelligence; Everything about what powers our future.
In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue