SCDs, data-centric AI, data science infrastructure; ThDPTh #73
I’m Sven, and this is the Three Data Point Thursday. The email that helps you understand and shape the one thing that will power the future: data. I’m also writing a book about the data mesh part of that.
Time to Read: 3 mins
Another week of data thoughts:
- SCDs again.
- Data-Centric AI is still interesting, and worth a look.
- Data Science infrastructure is still a huge bowl of spaghetti.
🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰
Slowly Changing Dimensions
What: Cedric Chin takes a look at the history of changes in “dimensional data” and how it is handled. Although “dimensional” doesn’t really cut it, it’s about stuff that changes, where you don’t get a “history” if you don’t record it yourself.
My perspective: I like his explanation of why the functional approach proposed by Maxime Beauchemin is better. To encapsulate it into a few lines of code, his approach makes using historical data as simple as this: …
--- With current attribute
select *
FROM fact a
JOIN dimension b ON
a.dim_id = b.dim_id AND
date_partition = `{{ latest_partition('dimension') }}`
--- With historical attribute
select *
FROM fact a
JOIN dimension b ON
a.dim_id = b.dim_id AND
a.date_partition = b.date_partition
Resource: https://www.holistics.io/blog/scd-cloud-data-warehouse/
Data-Centric AI Overview
My perspective: The article claims that 80% of AI projects fail. I would like to point out, that 80% of all product ideas fail, not just AI ones. And IMHO a product idea should be just that, not divided into an “AI” or “non-AI” idea. That sounds a lot like what I like to call an “implementation detail”.
Nevertheless, the idea of data-centric AI is pretty simple:
In the past 10 years, almost all efforts in research went to improving models, which produced an equal focus for practitioners. Data-centric AI points out, that this means, improvements on the data sets probably yield a higher return than on the algorithm side.
Twitter summary of data-centric AI: Focus on the data, not the algorithm.
As usual, I’m not sure about the numbers, but I like the general perspective of putting the data first, and then the code/algorithm.
Data Science & Infrastructure
What: The start-up iterative.ai released a terraform plugin, the most common infrastructure as a code tool, specifically for making machine learning workloads easy to run.
My perspective: For any machine learning engineer, any data scientist, the world of infrastructure must look like a huge spaghetti bowl.
The fun part: It’s his job to find the 2-3 good spaghettis to fully utilize his idea, run it at scale, and get it into production.
There are already a few tools like Metaflow trying to tackle that challenge, but I also keep on thinking about one fundamental truth:
The idea that “software is a platform” simply is a lie. No software, no abstraction, plugin, whatsoever is ever going to take the “pain” of dealing with data structures, how data is stored, and how computations are executed away.
In fact, there is only one good solution for you: Fully embrace the data structures & computations.
That doesn’t mean it’s not going to be easier in the future. But I have a strong feeling that only two ideas will win in the end:
One, for the data people out there: The strong, 10-100x data people will be the ones who fully embrace how data is stored & how computations are run. They will be able to write code that utilizes this knowledge and align with the machine, not trying to handle this over to an abstraction layer.
Two, for businesses building these infrastructure abstractions: The winning solutions will fall into two categories, one the ones that indeed abstract everything away. These will target the non-experts, the quick-n-easy wins. The second group will target the rest and do so by helping them utilize the data storage formats, the computational strategies, and the like. Their goal will be to enable everyone to operate on a higher level.
Obviously, I can’t see into the future, I’m just extrapolating from the gaming industry and companies like Unity which showcase exactly this development.
Resource: https://dvc.org/blog/terraform-provider
🎄 Thanks => Feedback!
Thanks for reading! I’d also love it if you shared this newsletter with people whom you think might be interested in it.
And of course, please provide me with feedback:
It is terrible |It’s pretty bad |average newsletter… |good content… |I love it!