Thoughtful Friday #22: The 2022 Data World in Three Words: I was wrong - 6 Truths I didn't understand 365 days ago.
I’m Sven, and this is Thoughtful Friday. We’re talking about how to build data companies, how to build great data-heavy products & tactics of high-performance data teams. I’ve also co-authored a book about the data mesh part of that.
Let’s dive in!
Thanks to newsletter subscriber P. for a fun discussion this week. He helped me to formulate the open-source based lessons even better and made me pick this article for this week in the first place.
60,000 words in newsletters, a dozen articles, and a book.
That's what I wrote about the data space in 2022.
I write, mostly to learn. Yet when I look back at 2022, most of what I learned was things I got wrong, things I had to unlearn, and things I had to learn anew.
So, following my own advice to take unlearning in the data space seriously, here is the top 6 of my unlearnings.
(1) The Data Mesh is there to stay, but it' not coming as fast as expected.
The data mesh keeps haunting me, whenever I think I'm done writing about it, it comes back. At the beginning of the year, I was finishing the book "Data Mesh in Action" going through all kinds of data meshes out in the real world and drafting data mesh blueprints for start-ups. My position at the beginning of the year was "the data mesh is coming to every company & every industry, fast".
And while is still believe this to be the case, 2022 turned out to be the year of a shift towards pragmatism in terms of the data mesh.
Barr Moses, CEO of Monte Carlo, predicted for 2023 the idea of data meshes coupled with fat, emphasis on fat, central platforms.
And while a lot of companies try to adopt "ideas from the data mesh", the adoption of the data mesh itself remains a hard task only few companies tackle head on.
The reason for this is: The data mesh is first and foremost a cultural/processes/people shift, and these things are notoriously hard to change.
The reason I believe that the data mesh is the future is just as simple: data is going to eat up every single value proposition, every product in 10-20 years is going to be all about data, and about nothing else.
To extract value from data at scale, companies will have to decentralize at some point in time, and thus, enter into the data mesh.
Thanks to Zhamak Dehghani we now have a roadmap for this, and a name. But process is still going to be unevenly distributed, owned by a lucky few, and take time.
(2) "Just do what the software engineers do" isn't going to cut it in the data space.
I tend to argue, the data mesh is yet another decentralization move, just as all the others that the software engineering & product world already performed. So at the beginning of the year, it only seemed natural to me to keep on saying "just transfer practice X from software engineering to data and you'll be so much better off".
"DataOps: Taking the Dev world into data" and the rise of "Data Reliability Engineers" are all examples of this idea. Take what works in the software engineering world and see whether this works well in the data world.
2022 was the year I realized, it just doesn't work like that. After failing a lot (literally years) in trying to help data developers understand how to use software engineering best practices, I learned the hard way two lessons that do changed my perspective:
Data people still have largely a different background, culture & practices than software engineers
The data value creation process looks different than the software one
(1) might seem obvious to a lot of people, but this lesson goes way deeper than is commonly accepted. Most data product managers come form a technical background, in the software world this isn't the case. Business analysts turning analytics engineers have even less exposure to software engineering than most data engineers. Let alone data engineers & data scientists that that often have a quantitative or science background.
(2) software that gets released is exposed to end users and creates value. Data-heavy products on the other hand, get released and only when the data hits the software do they create value. This sounds trivial, but data for data-heavy products is the main part of it. This makes the whole "software delivery process" for data longer and more complex. The focus shifts onto the data, and software tools are traditionally ill equipped to handle this. Leaving us with a hole lot of nothing.
Some random implications of these two lessons are:
The data mesh as "just another decentralization movement" makes sense only, if the company culture supports it.
Machine learning integrated into products only makes sense, if the company has a product management that openly embraces this.
If your company considers data to be a sidekick, then no big investment in best practices make sense.
Pushing data into your CI system makes sense only as long as it stays small. Yes it's called "dbt - data build tool" - but it is not the same as a software build process, not for the vast majority of cases.
Speaking of companies that consider data to be a sidekick...
(3) 99,9% of companies are not aware of what data is important and what to with it.
In the beginning of 2022 I believed 90% of companies are wrong on these big ideas. But over the year 2022 two things happened:
I realized more kinds of data are important than I believed to be before (see below)
A few major trends actually moved the world into the opposite direction!
(2) As much as I love dbt, it is made to make you focus on well structured modelled data. This movement makes the true issues a seem farer off compared to the alternative. Problem is, the alternative, to focus on well structured, modelled data, isn't an alternative at all.
I believe that the following four things are completely underrated:
Real-time data use (not to be confused with event-driven architectures)
Unstructured data (and everything that builds on top of as use case)
Company external data (and everything that builds on top of as use case)
Turning data into actions
You might argue that this is obvious, and you knew that already. But I think the extend to what this is true is astonishing.
(4) Turning data into action is important, yes yes, we get it... But do you? What did your data team do over the last 4 weeks? Think about it, what efforts are truly targeted at helping to turn data into actions? If your data team isn't the right place to look, is someone else at the company helping to do this? If not, you're in serious trouble.
How about a simple decision: What is more important, taking a couple of hours to ingest a new data source or spent a couple of hours helping a decision maker understanding data? Only one of these options helps to turn data into action.
(3) Company external data is huge, and next to no company utilizes it. Its only the Amazons (hello "minimal pricing mechanism") and Netflixes of the world, and the investment world, that takes external data seriously. Period. It's completely out of scope for most, and yet it is where all the growth happens.
(2) Unstructured data. I love dbt. But dbt helped to launch a revolution that targets to focus on small selected data sources, on well structured & modelled data. That's a good thing on the one hand, and a terrible thing from another perspective. It makes unstructured data look like the unwanted cousin. And it's not. It's the most important data source you have, but in 2022, it became less important.
(1) Would you rather ingest a new data source or reduce your data frequency from 2 hours to 30 mins? My feeling is, most people will choose the first option. That's kind of what the dbt movement does, make these kinds of tasks easy. So real-time data became less important in 2022. And yet, if you look at things like the pandemic and the Ukraine war, it seems like "real-time" data is becoming the only thing important to any business.
(4) Even a collapsed crypto world has so much to teach the data world.
In the beginning of 2022, I had a faint interest in the crypto world. Then I took a closer look. Oh boy, I should've looked way earlier. What I realized over the period of 2022 is that the crypto world has a similar set of physics to the data world. And even better, it seems like the crypto world is 2-3 years ahead of the data world.
The three laws of crypto physics:
Systems thinking tops product thinking.
Community-led innovation tops company-internal innovation in terms of breadth.
Open tops closed.
These laws don't apply to any sector, not at all. But they have hold very steady in the data space. And so I did shift my perspective there, and am starting to learn a lot from the crypto space.
Talking about openness...
(5) OS is hard, like really hard to pull off. And nobody talks about it.
Publishing open source for business, a lot of companies pull this off right? Google is leveraging open source solutions like kubernetes to achieve huge business goals. RedHat, Automattic, GitLab, all built successful companies on open source. It should be obvious and easy to publish open source to achieve business goals.
And yet, as I kept writing about it in 2022 I realized, it is not. And worse, no one is talking about it. I literally could not find a single book on the topic of publishing open source software for business benefits.
What I did see in 2022 is that it is hard, way harder than most people think to use open source for business. Google just adopted Apache Iceberg, and that will make it incredibly hard for companies like drem.io or tabular to pull of their business models (both based on Apache Iceberg).
Companies like Dbt are in a constant (albeight so war successful) struggle with their opennness. And every day new founders seeem to ask the same question: do I open source? How much do I open source? Then how do I make money?
Caveat: However, I do believe its not only possible, but incredibly valuable to incorporate publishing open source for business purposes into your repertroire. It's the reason I keep on writing extensively about it.
(6) Open Source is not the only option for data companies, but key challenges need to be solved.
In How to Become The Next 30 Billion $$$ Data Company, I argued that the only way to become a great data company is to rely heavily on open-source.
I don't think that's true anymore. I've studied the rise of many data companies over the year and I now think I have a better understanding of what I was trying to get at.
My new understanding is, that you will need to heavily rely on three things:
some kind of standard/ protocol,
on a large degree of openness,
and need to make network effects key to your company strategy.
You might be able to do this with open source, you might only cover the first two with open source, or you may not use open source at all. With respect to the lesson before, I'm more inclined to recommend to not default to open source, but rather to think deeply about your strategies in each of this category.
While open source is the way a large amount of companies entered the data space, it might not be the best option. I suggest to first focus on these three pillars, and then decide whether it makes sense to try to utilize open source to make your business happen.
Thanks to P. for making me put this into much nicer words.
Now its your turn, feel free to tell me I'm still completely wrong by hitting reply!
What did you think of this edition?
Want to recommend this or have this post public?
This newsletter isn’t a secret society, it’s just in private mode… You may still recommend & forward it to others. Just send me their email, ping me on Twitter/LinkedIn and I’ll add them to the list.
If you really want to share one post in the open, again, just poke me and I’ll likely publish it on medium as well so you can share it with the world.