Platforms; Dbt; Apache Iceberg in 2 sentences; ThDPTh #47
I just read the simplest explanation of a table format like Apache Iceberg.
Go read it too! See below…
I’m Sven, I collect “Data Points” to help understand & shape the future, one powered by data.
Svens Thoughts
If you only have 30 seconds to spare, here is what I would consider actionable insights for investors, data leaders, and data company founders.
- Table formats like Iceberg, Hudi, Deltalake challenge the snowflakes of this world. By bringing the same functionality, but the ability to exchange parts on the fly, table formats might completely wipe out the vendor-lock-ins Snowflake et. are currently building up.
- If the user needs a cheat sheet, your product probably needs a usability update.
- Platforms are hard to define, but taking a general perspective actually makes them much more manageable. Baldwin & Woodard take the architecture perspective that distinguishes just three parts of a platform, which makes it really easy to manage the big picture of ANY (and I mean any) platform.
- Using the architecture perspective, an internal developer platform is managed the same way an external marketplace is.
- There is likely no platform where the so-called “complements” do not change over time or in cross-section. Although it is tempting to think so.
🔥What: Erika Pullum shares her tips & tricks for dbt on GitHub with lots of links, articles for beginners, and a lot of expert knowledge like how to run the parent model of a model identified by a specific tag.
🐰My perspective: I always love knowledge sharing, so it’s great that Erika shares her knowledge with other dbt users. And all things aside, dbt is a great tool, one that filled a huge hole and I’m still amazed that no one went on to create yatot (yet-another-transformation-only-tool). But, I stumbled over the idea that I need a cheat sheet for it.
I don’t have a Mac cheat sheet, that thing is simply pleasant to use on its own. My theory for why dbt needs a cheat sheet and my Mac doesn’t is that dbt is not yet fully integrated into a full development cycle, one which is almost self-explanatory. That might be simply an implication of the whole immaturity of the analytics engineering space, but it might also be an opening for dbt to get into.
I don’t know the right answer, I just stumbled over it.
🔥What: Turns out, whenever I talk to people about “platforms” we often mean different things. So let’s explore that a bit.
There are apparently a lot of different definitions of “platform economy”, “platform products” or “platforms” in general. A lot of them are more applied to “current platforms” meaning the likes of Uber & Airbnb. This also means they don’t carry the deeper variety of strategic insights one might gain from using a more general definition.
Here’s one definition from one of my personal favorite Nobel prize winners:
“The central role of “platform” products and services in mediating the activities of disaggregated “clusters” or “ecosystems” of firms [or other entities] has been widely recognized.”
(Rochet and Tirole, 2003)
However, this post is not about this definition but about an extension, a look at the generic architecture of all such platforms that follow the above generic definition. Baldwin & Woodard make a good case of these elements which are:
1. The core of the platform — a modularized complex system meant to evolve all the time
2. The interface of the platform — stable & fixed for most of the time
3. The complements, the heterogeneous & ever-changing things that connect to the platform (be that users or tools or datasets)
🐰My perspective: I like the following simple idea: If platforms are hard to define, but share a common architecture, we can just as well look into the architecture to draw our strategic conclusions.
An example developer platform:
Let’s understand this a bit more. Take a platform, a company internal developer platform allowing for creation & deployment of REST APIs. Suppose a developer could deploy an API with a simple check-in of a JSON
{“name”: “foo”; “endpoint_url”: “bar”; “point to docker image w. Exposed endpoint”}.
Then the platform itself creates a few docker containers, deploys it, does the routing, load balancing, etc.
That’s a developer platform. The “complements”, the changing parts which we mediate between are:
the developers
the deployment infrastructure (like a Kubernetes cluster e.g.)
The fixed part is the interface, the JSON config nature, the place this JSON has to be checked into, the documentation, and so on…
Strategic insights for such a developer platform:
If we take the architectural perspective, we can quickly derive a few high-level strategic insights.
1) The interface as said should stay fixed, or at the very least new versions should be backward compatible. This will enable developers to extract the most value from this platform while not being interrupted in their flow.
2) The interface should abstract almost everything away from the “deployment infrastructure”, because this is a pretty special topic with special skills. So to mediate between these two it is necessary to remove all boundaries.
3) Deployment infrastructure pieces are constantly evolving. Make sure to keep on evolving the platform to deliver more and more value to the developers.
4) Developers vary both in time as well as in cross-section meaning you will have to target different sets of developers e.g. by allowing for advanced optional! configurations for the more infrastructure-savvy developers.
Application to data mesh self-serve platforms:
Data mesh self-serve platforms have even more moving parts because the complements are split into more units:
1) data producers
2) domain data products
3) data consumers
It is tempting to think because data platforms are internal, the “complements” are fixed, but they are not. You’ll build such a platform iteratively, meaning the first “complements” will be very different from the last. In addition, things like “data products” will change all the time, whereas consumers & producers change more slowly, but change as well.
So the complements of this kind of internal platform are extremely volatile both in cross-section and over time. To keep the interface as stable as possible, you only have one good option, which is not upfront design, which won’t work because of the temporal variations but the minimalist approach e.g. using the idea of the “Thinnest Viable Platform (TVP)”.
To mediate between three different complements means a necessarily higher degree of modularization than the above example. You might’ve already guessed this, but it is also apparent from the major data mesh articles which put a lot of focus on the modularization & separation of the platform interfaces.
So to put it into simple words, for a data mesh platform you’ll have to:
start as minimal as possible (with a reasonable amount of foresight ;)), and then
iterate quickly and a lot by
using lots of modularization.
Generic strategic insights for platform building:
The whole purpose of a platform is to “mediate”, or in other words: To deliver maximal value to the mediated parties.
However, that can be very asymmetric. For a developer platform, you’re not delivering value to the tooling of course but to the developers. For a data mesh platform, a lot of the value delivery is on the data consumer side, a bit less on the producer side, and none on the data product side.
So first, you’ll have to understand your value in terms of the changing set of complements. Then you keep the platform interface as fixed as possible while iterating on the kernel to deliver more value to the changing set of complements. That means three things:
1) modularizing the kernel, possibly exposing parts of it (keep in mind the stale interface though! Think about optionality a lot here)
2) adding new complements (adding new developer tools, or tooling for a new data producer persona)
3) increasing the value for the existing complements
I find it truly important to have a good mental model in mind, and this one suits really well for all platforms, be it the next Stripe or “just” your internal wiki. The mental models of network effects & “the flywheel” will help you understand the “complements” side of things really well, but it will not tell you about the interfacing & the modularity of the kernel side of things.
They also make it harder to take just one perspective on internal & external platforms, whereas this perspective clearly captures the point and might even help you understand at which point you can kick things off and turn an internal platform into an external one (that’s exactly what Amazon did with AWS).
🔥 What: Ryan Blue from tabular just wrote down a great explanation of table formats like Deltalake, Hudi & of course Apache Iceberg, the project he co-founded.
To sum it up in two sentences…
“I’ve found that the easiest way to think about it is to compare table formats to file formats. A table format tracks data files just like a file format tracks rows. And like a file format, the organization and metadata in a table format help engines to quickly find the data for a query and skip the rest.”
🐰 My perspective: Table Formats bring to object stores what databases already have. Why? Of course, because databases try to not distinguish between “files” and “rows”. They of course from the beginning are made to abstract away the file storage, which still is beneath every single database.
So are table formats simply reinventing something that is already there? Yes and no. In my perspective, they are taking what is there, and are rebuilding a specific set of database features like ACID guarantees (but not necessarily these!).
But the fun part is, this doesn’t lead to a database, on the contrary, it leads to a very modularized database-like data lake. And the word modularized is the fun part! It’s almost impossible to exchange the compute engine on almost all databases, but with table formats, it’s easy as pie.
It’s just as easy to exchange the storage system, which again in the database world is really hard (although snowflake now has something called “external tables”.
If all goes well, with table formats you will be able to get out of vendor lock-in and have the data-X you need, instead of the one you bought.
Con: There is of course one downside that is hard to ignore, storage & computation are coupled for a reason on databases, it’s because storing a certain way (say columnar) makes computation easier. I have no idea yet where that is headed, but I like the modularization aspect.
🎁 Notes from the ThDPTh community
I am always stunned by how many amazing data leaders, VCs, and data companies read this newsletter. Here are some of the reader’s recent noteworthy pieces.
Pierre Brunelle, the CPO of noteable.io (I had to type that in three times till the auto-correct stopped auto-correcting!) explains his vision for the collaborative notebook platform noteable.io.
Prukalpa Sankar, the co-founder of Atlan, recently published an article called Data Governance Has a Serious Branding Problem in TDS.
Only for email-based subscribers & interesting data-related topics: To share your projects just reply to the email and leave me a quick note with your name, company, and one sentence about your article/ project. I select a small list of the most fitting pieces.
🎄 Thanks => Feedback!
Thanks for reading this far! I’d also love it if you shared this newsletter with people whom you think might be interested in it.
Data will power every piece of our existence in the near future. I collect “Data Points” to help understand & shape this future.
If you want to support this, please share it on Twitter, LinkedIn, or Facebook.
And of course, leave feedback if you have a strong opinion about the newsletter! So?
It is terrible | It’s pretty bad | average newsletter… | good content… | I love it!
P.S.: I share things that matter, not the most recent ones. I share books, research papers, and tools. I try to provide a simple way of understanding all these things. I tend to be opinionated. You can always hit the unsubscribe button!
Data; Business Intelligence; Machine Learning, Artificial Intelligence; Everything about what powers our future.
In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue