Real-Time dbt + trino, Cookies, Top50 Data; ThDPTh #66
I’m Sven, and this is the Three Data Point Thursday. The email that helps you understand and shape the one thing that will power the future: data. I’m also writing a book about the data mesh part of that.
Time to read this newsletter: 6 minutes.
Another week of data nuggets:
- The top50 companies of data are finally identified
- Real-time data is becoming exponentially more important
- Real-world data science is messy but valuable, and not about fancy algorithms
What: a16z took the time to put together some data on the top50 data start-ups including valuation, funding, category, etc. It’s the first time I’m seeing such an analysis.
My perspective: A must-read.
What: Michiel De Smet shares a demo setup using trino, dbt, and a local data lake topped off with Hive & Iceberg. It provides real-time data leveraging dbt and trino. He’s using trino as a query engine and uses a simple lambda architecture where:
on schedule full data is loaded into the data lake
on a query, the delta is pushed down into the data source to get real-time results.
The architecture for sample purposes does not implement any kind of sampling mechanism on the real-time part as a lambda architecture sometimes does.
My perspective: I feel like both, neither data architectures nor the importance of real-time data gets enough attention these days. Real-time data will become the thing of the next decade. And for data to be effectively used, we need good architectures to enable them.
So I love that Michiel explains complicated architecture so plain and simple. Check it out and think about it. The obvious drawback of this solution is querying a live database on demand which will need either controlling or some kind of caching layer/copy of the production system to not crash the production system for analytical queries.
What: This is an article written by data scientist Alan Schelten, a colleague of mine about the practical troubles of doing learning-to-rank.
My perspective: I think this is the first time I share something written by a colleague in the newsletter. But I truly enjoyed reading through this piece. It fits very well into the newsletter. It’s practical and explains the troubles the everyday data scientist deals with. I particularly like both the optimization target which is chosen: “customer satisfaction”, as well as the following explanation of how to then choose a proxy to indirectly measure exactly that.
I think the summary is pretty simple: Machine learning & data science in real-life is not a Kaggle project. It is messy and almost nothing is about cool algorithms. But it also can be pretty amazingly effective, in the example, we’re talking about a possible 1% gain in revenue.
🎄 Thanks => Feedback!
Thanks for reading this far! I’d also love it if you shared this newsletter with people whom you think might be interested in it.
If you want to support this, please share it on Twitter, LinkedIn, or Facebook.
And of course, leave feedback if you have a strong opinion about the newsletter! So?
P.S.: I share things that matter, not the most recent ones. I share books, research papers, and tools. I try to provide a simple way of understanding all these things. I tend to be opinionated. You can always hit the unsubscribe button!
Data; Business Intelligence; Machine Learning, Artificial Intelligence; Everything about what powers our future.
In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue