Thoughtful Friday #16: Why you should DAG twice
Let’s dive in!
Time to Read:
DAGs in general are a bad thing, in a way.
Data systems should use fault-tolerant communication too, just like software systems.
Orchestration per se is a great thing though! But the current DAG-centric view of it produces a few bad practices (IMO!)
This post was originally called “why DAGs & data orchestrators suck”. But after a few text exchanges with a very engaged data analytics engineer, I decided to switch the perspective toward making things better, instead of focusing on the downsides.
So let us instead talk about what a data orchestrator could do, to be even more effective.
So let us take a look at the status quo, the main claims of most data orchestrators, and why these claims per se are great. Then let’s look at why a data orchestrator also makes us very prone to “bad practices” and in a conclusion, what a data orchestrator could do to make us more effective.
1. Orchestration Status Quo
Let us look at fivetran and their idea of orchestration. They claim the following benefits of orchestration:
“1. Without a well-defined execution order, data is prone to errors, e.g. data can be outdated, transformations can be performed with incomplete data or results can be returned later than required by the business.
2. Changes in schemas upstream can result in unidentified errors in subsequent tasks which expected a different input.
3. Changes in the data model downstream can cause unidentified errors by making input data incompatible.
4. The whole data journey can be very inefficient, for example, if multiple parallelizable processes are always run strictly sequentially.” (taken from Data Orchestration Explained – and Why You Shouldn't DIY)
This is a very decent list. 1-3 can be boiled down to “make dependencies explicit” and (4) is one implication of making dependencies explicit. If we make dependencies explicit, we also make non-dependent tasks visible. Then we can go on to utilize both of these visibilities!
Key idea: So data orchestration boils down to making dependent (encoded in a DAG) and non-dependent (encoded in two separate DAGs) tasks visible, and then utilizing that knowledge.
The main tool to accomplish this is the so-called DAG. Which is the dependent side of things. But here I feel, that since we only got this one concept, the focus of data orchestration strays a lot from the actual goals (as promised above).
2. Bad Practices
In my opinion, the focus on DAGs enforces just one side of the promise of orchestration: making dependencies visible, which come in the black-and-white form of a DAG.
But as outlined above, really the true value of orchestration isn’t in “making dependencies visible” at all. It is utilizing them. And there I feel a lot of bad practices seem to creep into data workflows like …
Creating large DAGs. Dependencies in general are a bad thing, as explained above, you want to have independent things, so you can run them in parallel. So that if one thing fails, the other still works.
Focusing on the DAG, and not its output. The DAG itself has no value at all to the data system it powers. Only the output of it produces value. Making the DAG and not the output of it the key thing is like having a software team that is obsessed with the code and not the actually compiled/running software the code powers.
Thinking in black and white DAG or no DAG. Dependencies are never truly black and white. If I have a model which joins a fact table and a dimension table together, neither of these are true dependencies. If things break,. I can still join two old versions together, I can make sure the join works even if the dimension table is out of date to get the new facts in, or I can join and load only the new dimension.
Not managing “hidden dependencies”. If we’re saying two microservices A->B have a dependency, we usually mean they communicate through some interface. To add redundancy, making the communication fault-tolerant, we can add caches, retries, etc. That’s because the intersection is truly minimal.
When we have a DAG A->B we usually mean a heavier dependency. For instance, A might be an ingestion job into a database, and B the transformation. In that case, A & B share an interface (possibly a table) AND some infrastructure (the database). Sadly, data orchestrators don’t actually do anything with the data/the infrastructure so they are not able to do anything about the hidden dependency. This makes it hard to manage the dependency at all in quite a few cases.
So if data orchestrators are a good thing, but might encourage bad practices, how can we mitigate that?
3. Encouraging Good Practices
Of course, you can start to avoid these bad practices by yourself. However, I would really love it, if data orchestrators were starting to help you make good choices.
Data orchestrators obviously already know good practices of data pipelines. Some of them as outlined above include:
Decouple DAGs, break them down, and remove dependencies.
Run workload in parallel by default if possible.
Have fault-tolerant sub-DAGs. If one part breaks but the rest is only “a bit dependent”, run it anyways.
Rerun failed tasks if possible.
Focus on getting the output there.
Reduce duplication by having sub-DAGs and DAGs easily configurable.
Possibly enabling resending of messages, adding optional caching layers,...(or at least incorporating them in some way into the orchestrator!).
Making the data a first-class citizen, not just the output of the “workload”.
Good practices should lead to a simple state of your data system. Workloads should only break if true human intervention is needed. If that happens, they should be swiftly alarmed. And, everything else should be working as usual. The blast radius should be minimal.
Key idea: A great data orchestrator helps us build data systems that by default don’t “break”. Only parts of them break, and if they do, the rest keeps on operating, and developers are alarmed immediately.
The Data World talks about DAGs
I’m very thankful to Nick Schrock for challenging the draft for this post a LOT. So I’m quite happy to see that the data world is already starting to move in the right direction, although I feel very much that it’s still a long way to go, simply because the mindset of the data developer very often is still focused more on the technical process than on the data output.
The Dagster 1.0 Launch together with GA is a big step toward a tool that is trying to put more of the data output into the center of the work, happy to see that happen.
Additionally, Benn Stancil put out an article called “Down with the DAG” illustrating much better than I could why the data output perspective is so crucial.
One key point from Benn’s post and one I argued with Nick over is the following idea:
Key Point: If really only the output of your data product/application is important because it’s the thing that delivers the value, then you have a lot of freedom in implementing this or very similar output. That means you can design your dependencies, and your DAGs in multiple different ways. There is no “DAG” inherent in a certain data output, the data developer builds the DAG and he may choose to build a very different one, a better one, a loosely coupled one, with a very similar result which might end up being much more valuable.
What did you think of this edition?
Want to recommend this or have this post public?
This newsletter isn’t a secret society, it’s just in private mode… You may still recommend & forward it to others. Just send me their e-mail, ping me on Twitter/LinkedIn and I’ll add them to the list.
If you really want to share one post in the open, again, just poke me and I’ll likely publish it on medium as well so you can share it with the world.