Discover more from Three Data Point Thursday
🚀 The Future of BI is OS, md5() in SQL, Kitagawa on Platforms; ThDPTh #12 🚀
What the future of BI looks like, how to generate proper unique keys in SQL, and a final look at how to build data platforms.
Data will power every piece of our existence in the near future. I collect “Data Points” to help understand & shape this future.
If you want to support this, please share it on Twitter, LinkedIn, or Facebook.
🔥 (1) The Future of BI is Open Source
Maxime Beauchemin, the creator of both Apache Airflow and Superset, just published a great piece about why the future of business intelligence is open source. I totally agree with him and still find it mind-boggling that open source is just now catching up to this. In BI, or in fact, in most data topics, the cost of implementing something is usually governed by two drivers:
1. the “source” of data, meaning the number of different sources and their intrinsic complexity,
2. the target, the number of use cases, and their complexity/ quality requirements.
Since this is the case, customers of BI tools, data integration tools, etc. have a very heterogeneous field of needs. This is the perfect place to apply open-source, Take a look at any piece of bought data tool in your pipe. Does it fit all your use cases? I bet not. I bet that for at least 20% of your use cases you got to customize the hell out of it.
Read the piece, after reading it I doubt, you will go for anything other than an OS solution for your BI stack.
While “software is [still actively] eating the world”, it’s also clear that open source is taking over software. Simply put, open source is a superior approach at building and distributing software…
💥 (2) The Most Underutilized SQL Function
Tristan Handy published a short article about the md5() hashing function in SQL. Simply put, the md5() function generates a unique id. It’s a hashing function, so the same input yields the same result, but reversing isn’t possible. Tristan Handy writes….
“[I believe] every single data model in your warehouse should have a rock-solid unique ID.”
And I agree. Indeed I find the point of having a contextless md5 id is:
A unique md5 id is ONE join key, possibly replacing a combination of keys that make a data set unique, and as such speeding up joining AND speeding up development.
It abstracts away “domain knowledge” which is a great thing, and thus kills hidden assumptions which come with “domain knowledge”.
It stops you from exposing these keys to end-users; Which hopefully helps you understand the true requirements.
I truly believe data teams should stay as close to the source data as possible. That includes, not making ANY assumptions on the uniqueness of anything a “source” provides. If you got handed a data set which contains:
An “item id”
An “order id”
You might assume that order id + item id is a unique combination, so you could join over “item id & order id” but that makes your SQL statements more complex than they need to do. So a single join key should be here. Why not add another column “itemId-orderId”?
If you use this column, you’ll indirectly assume that the combination itemId-orderId is unique, very likely without checking and certainly without making it visible to other developers. But what if updated orders actually add a new row with the same items? The solution would be to use:
md5(itemId + orderId)
Implement a check on uniqueness on the md5
Take a look at some of your SQL statements and see where you apply uniqueness assumptions but really shouldn’t. Also, check where you could reduce complex to simple joins by using just one id.
There’s a single SQL function that I have come to use surprisingly often. What is it? md5()
☀️ (3) A final look at platforms by Justin Kitagawa
I talked a lot about platforms in my last newsletter, I got one last thought to go, and then I’ll leave you alone. Justin Kitagawa leads the dev platform efforts at Twilio. He describes some important shifts they made to get into the “platform” feeling. In particular, he has a lot of important points that focus very much on the “product side” of things. His four principles apply very well to data platforms or X-as-a-Service constructs:
API First, after all, it’s for developers and interfaces are the ground concept of platforms.
Self Service Platform, you’ll want people to do things themselves. In particular, you want them to be able to use it without talking to you, let alone having an expert in their team.
Declarative over Imperative to reduce cognitive load. Declarative constructs usually take away the “how” which should be hidden inside the platform and focus only on the “what”.
Design with empathy (for the developer/ end-user), after all, it is a product they should love to use!
I’d like to add a fifth one:
Build Best Practices in. If you want developers to adhere to best practices on your platform, then make it worth their while! Either build them in or at least explain & show, why they will benefit, by making them faster & better.
Ok, enough of data platforms.
Justin Kitagawa talks about Twilio’s DevOps culture of “You build it, you run it”, and the evolution, tenets, and lessons learned of Twilio’s internal Platform.
🎄 In other news & Thanks
P.S.: I share things that matter, not the most recent ones. I share books, research papers, and tools. I try to provide a simple way of understanding all these things. I tend to be opinionated. You can always hit the unsubscribe button!
Data; Business Intelligence; Machine Learning, Artificial Intelligence; Everything about what powers our future.
In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue