Airflow at Scale, SQL+Jinja or Not, X is the next big thing; ThDPTh #64
I’m Sven, and this is the Three Data Point Thursday. The email that helps you understand and shape the one thing that will power the future: data. I’m also writing a book about the data mesh part of that.
Another week of fun!
What’s the next big thing in data this year?
Apache Airflow at scale, possible, if you understand Airflow as orchestrator only.
SQL + Jinja is not going to cut it, we need a true library, a language.
🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰
For Data it's the year of X (fill in the X)
What: Turns out I like crypto. But the part I’m sharing here reminded me sooo much about stuff going on in the data space right now, I had to share it…
“Step 1: You hear a strange new word (“NFTs”, “DeFi”, “Eeeeeeth”, etc.)
Step 2: You ignore it
Step 3: You hear it again
Step 4: Your cousin Bobby quits his job, because he made $8M from that strange new word
Step 5: You immediately Google the strange word
Step 6: Your head hurts. The concept is confusing. You secretly hate Bobby.
Step 7: You become a believer
First, it was Bitcoin, then it was Ethereum, then it was DeFi, then it was NFTs, and I believe next…is Stablecoins”
My perspective: Ok, so now you gotta do some filling in. What are those strange words to you? And much more important: What are the stablecoins of data? ELT seems to be an oldie already. Observability and the modern data stack however seem to be on the front. As is the “data mesh”. But just as with stablecoins, the question is: For which one of these new fancy data words is the “why now?” moment, well, now?
Now it’s your turn, go read the piece and then conduct your own little analysis, what’s the X?
Apache Airflow Problems and Solutions
What: Eitan Chazbani from Databand shares his thoughts on running Airflow at scale, what problems it poses and how to solve them.
My perspective: I like how Eitan spells out the following thought: “you can run Airflow at scale, but its permissiveness makes it easy for you to mess it up”.
I always find it hard to believe that the alternatives dagster & prefect are supposed to scale better.
Eitan points out one good point at the beginning, Airflow is a pure orchestrator! Or at least it should be used as such. It’s not supposed to store data or execute itself.
I get that they address a bunch of issues and are more opinionated. But scaling is something that comes with the robustness of code because you want to be able to scale in a majority of environments. And there, airflow simply has a much larger contributor base, contributions, and life-years.
What: Furcy Pin recalls a legendary article written by Maxime Beauchemin and in particular a part about mountains of templated SQL, and why it’s a total antipattern.
“It’s pretty clear to me that combining SQL with Jinja templating doesn’t provide the proper foundation for these emerging constructs.”, Maxime
Furcy is going really deep in this post starting with the history of data frames. She also provides a POC on Github to make her point.
My perspective: Her points are sound, as are many other perspectives around the industry. On a metalevel we have two facts about SQL right now:
1. SQL as a language is massively popular due to its simplicity
2. SQL as a language is massively limited due to its simplicity.
Ouch, there you got the problem. Of course, we all know, the easy solution to tackling 2 while keeping 1 is to have a really good abstraction on top of SQL. Something that allows me to write:
“SELECT * from table”
But also
“order_alphabetically(SELECT * from table, custom_module_function())”
And have both executed in a similar fashion.
I don’t have a good answer, haven’t seen one out there yet. Sounds like a good business opportunity, one that will “grow the pie” for everyone but also one that needs a high level of systems thinking. I’m curious how this will play out in the future.
I recommend you read the article though because it is making a good point about the limitations of SQL and what we would really like to do with it.
towardsdatascience.com • Share
🎄 Thanks => Feedback!
Thanks for reading this far! I’d also love it if you shared this newsletter with people whom you think might be interested in it.
Data will power every piece of our existence in the near future. I collect “Data Points” to help understand & shape this future.
If you want to support this, please share it on Twitter, LinkedIn, or Facebook.
And of course, leave feedback if you have a strong opinion about the newsletter! So?
It is terrible | It’s pretty bad | average newsletter… | good content… | I love it!
P.S.: I share things that matter, not the most recent ones. I share books, research papers, and tools. I try to provide a simple way of understanding all these things. I tend to be opinionated. You can always hit the unsubscribe button!
Data; Business Intelligence; Machine Learning, Artificial Intelligence; Everything about what powers our future.
In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue