Data is Expensive; Dbt Refactoring; Data Mesh Platform Architecture; ThDPTh #46

Nov 18, 2021

This week I just started to prepare a workshop on refactoring in the data space, something that seems to be really hard for people.

Just then I stumbled over the “Dbt Labs Refactoring Guide” which surprised me with one thing: It’s missing the essential first step to refactoring.

Read about it below…

I’m Sven, I collect “Data Points” to help understand & shape the future, one powered by data.

Svens Thoughts

If you only have 30 seconds to spare, here is what I would consider actionable insights for investors, data leaders, and data company founders.

- The future of data tooling marketing might look very different than it currently does. Instead of big X technologies, we might see more miniX technologies that allow easy combination into small components.

- The future of building data applications might look different than it currently does. Instead of having big centralized pots of things, like ELT tools, data lakes, serving layers, we might end up with what the software engineering side already successfully uses, cuts alongside business lines, separate “pots” for e.g. the marketing data applications, and so on.

- Testing is still completely underrated in the data world. Still, almost all data developers shy away from testing first, and developing second. Even the Dbt refactoring guide forgets to mention the word “test” which should be the very first one.

- Platforms with heterogeneous user sets in cross-section need a lot more modularization than others. When taking a look at the common logical architecture of data mesh platforms, we will realize that it’s pretty modularized. But that fact doesn’t come from the specifics of a data platform but rather from the fact that it mediates at the very least between three parties.

Modern Data Stacks & Money!

🔥 What: Benn Stancil discusses something I would summarize as “now we got a bunch of cool new tools, but turns out, they all cost a lot of money and we feel like it’s not really worth it…”. He talks about the modern data stack which brought us all a long way, but also into the hands of dozens of companies. It’s a fractured space and the costs of data become quite diffuse. Benn thinks one of two things will happen: Data will become really valuable and we will all simply stop caring about the cost. Or two: market consolidation, a bloodbath of companies.

🐰My perspective: On the part on what will happen I am pretty convinced that the “bloodbath” will happen, and it will happen again and again for one simple reason: All these companies will at some point engage in a bunch of major networking-effect controlled markets which will inevitably lead to a bloodbath as I’ve outlined a bunch of times in this newsletter.

But then again, I don’t think you think about Google as some bloody slayer of Search Engines (which it is following this analogy). So that will feel just fine. But it’s not gonna solve the cost problems, because

1) Winner-takes-all-markets create temporary monopolies and at least it’s not clear that this is a good thing for the consumers.

2) I don’t think the market structure is the problem!

I am much more concerned with the general idea. I think something very different will happen: We will start cutting technologies & data apps horizontally, alongside business lines, not on central technology lines anymore. Then the question simply becomes “how much does the marketing support data system cost?” which is easy to answer, because it will be wrapped into one thing!

That does not mean that I think in the future there will be an ETL tool just for CRM data, on the contrary, I simply think today’s solutions are too fat.

That means I don’t think the centralized data lake is a concept to last. Neither the centralized data warehouse nor the centralized data orchestrator.

What is a concept that is there to last might be “ETS” — extract, transform, serve technologies — things that I’ve not seen much on the market yet as a replacement for all this technological centralization. Technologies roughly the same size as the current “data lake” just cut very differently allowing for a whole business component to be wrapped into one.

But also simply smaller technology pieces, like a “miniLake” and a “mini-airbyte” which allow me to build a component just for the “marketing support system” where:

- my “mini-airbyte” allows me to choose on configuration just the 2–3 connectors I need and itself is a really small piece of technology (without GUI,…)

- my “miniLake” manages itself and is not accessible from anywhere else

- my “miniHex” allows me to have a small webserver serving just my 5–10 dashboards containing the marketing support system which then is integrated into my marketing automation tooling.

I hope you realize that this would be a 1–1 translation of what happened in software engineering. Now we are nowhere close to this world, but I do think it’s the step necessary to truly unravel the problem Benn is outlining.

benn.substack.com • Share

Rewriting Legacy SQL for dbt

🚀 What: A guide to migrating legacy SQL code into modular (!) dbt models. Accomplished by refactoring. Dbt suggests a simple 5 step process consisting of:

1. Migrating code 1:1

2. Implementing sources over raw table references

3. Implementing CTEs

4. Separating a few of the CTEs into standardized layers

5. Audit the output

🐰 My perspective: Ok, I’m a big advocate for applying software engineering best practices to data. And refactoring is one such practice, as is modularizing. However, I would’ve loved, if Dbt Labs would’ve gone the preferable route for refactoring which is summed up in one single quote:

“Before you start refactoring, check that you have a solid suite of tests. These tests must be self-checking.” — Martin Fowler & Kent Beck in “Refactoring”.

Why? Because while the process above might work for migrating legacy code to Dbt in most circumstances, it does not work well in general for refactoring SQL code.

The thing is, refactoring code should be a central activity of every data developer.

I do truly think, refactoring is the fastest way to understand code, and thus to develop & advance it.

But, thankfully even in Dbt this is easily done. I would suggest the following process which works with or without Dbt:

1. Write a test! For whatever you want to refactor. Run it, see it passing.

2. Do your refactoring.

3. Run your test again, check that it still works. (Which is the point of refactoring!)

How does that work in Dbt? There are a lot of ways, one simple one is:

1. run “dbt compile model_X” => this will generate plain SQL for you. (if e.g. you just migrated your legacy SQL into dbt)

2. Take the plain SQL and take e.g. Python to both create a few test data pieces inside e.g. an SQLlite database, execute the plain SQL, and test the result against your business logic.

3. Do your refactoring, run your test again.

This is a simple process that can be automated in a few lines of code and run automatically while you’re developing. There are also a few dbt specific ways I outlined a couple of months back in this repository: Dbt A TDD Workflow.

Yes, this might not feel like fun, but it’s the only way to ensure your business logic works — the only way to get your job done. And it is in my experience truly the fastest way to work with data.

docs.getdbt.com • Share

Logical Architecture Planes of Self-Serve Data Mesh Platforms

☀️ What: Zhamak outlines the capabilities of a self-serve data platform in three planes, where a plane is “integrated yet separate”. Each plane has a set of related capabilities and thus serves a specific group of platform users through exposed physical interfaces. She outlines three exemplary planes, the “data infrastructure provisioning plane”, the “data product developer experience plane” and the “data mesh supervision plane” each targeted at different user groups.

🐰 My perspective: Zhamak uses the word “plane” in order to distinguish it from a layer, a piece with strong hierarchical implications. Yet she wants to have a weak hierarchy.

From my perspective, this division is a result of the platform nature in general and actually should be found in any kind of platform. It stems from three facts:

1. Everything should be exposed via interfaces, even the layers inside software should be connected via interfaces.

2. A platform often serves a heterogeneous set of platform users, in this example at least three different kinds of users.

3. Platforms have to have a very flexible core because the interfaces should stay extremely fixed. That implies modularity at the core level.

If you take AWS as a platform, you can find a very similar architecture where a user can use the AWS Cloudformation “interface” to create his infrastructure, the service itself, as well as a more advanced AWS user, could use the APIs itself of the underlying services (or use another IaC provider) to interact with them.

If we take this perspective, then I think the key takeaway from this section of the data mesh article applies to all platforms with a heterogeneous set of platform users and is… “Keep your interfaces really fixed, lean towards a lot of modularity, and design the inner modules of the platform kernel in line with the heterogeneous set of platform users.”

The simple fact is, that if you have three parties you’re mediating between, like in the case of a data mesh self-serve platform, you will end up with exponentially more modularity, hence the rise of “plane architectures” and a much higher degree of modularization.

martinfowler.com • Share

🎄 Thanks!

Thanks for reading this far! I’d also love it if you shared this newsletter with people whom you think might be interested in it.

Data will power every piece of our existence in the near future. I collect “Data Points” to help understand & shape this future.

If you want to support this, please share it on Twitter, LinkedIn, or Facebook.

And of course, leave feedback if you have a strong opinion about the newsletter! So?

It is terrible | It’s pretty bad | average newsletter… | good content… | I love it, will forward!!!

P.S.: I share things that matter, not the most recent ones. I share books, research papers, and tools. I try to provide a simple way of understanding all these things. I tend to be opinionated. You can always hit the unsubscribe button!

By Sven Balnojan

Data; Business Intelligence; Machine Learning, Artificial Intelligence; Everything about what powers our future.

Tweet Share

In order to unsubscribe, click here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Three Data Point Thursday

Discussion about this post