Prof. Wider on Data Mesh, TASTI, MDM, and AI; ThDPTh #54

Jan 20, 2022

I covered Sonal Goyal’s tool zingg.ai in the last newsletter, which led me down a small (Twitter) rabbit hole this week.

I ended up finding both, a pretty interesting research article that seems completely underrated from a business-tech perspective, and a piece by her on the case she is taking on…

I’m Sven, I collect “Data Points” to help understand & shape the future, one powered by data.

Svens Thoughts

If you only have 30 seconds to spare, here is what I would consider actionable insights for investors, data leaders, and data company founders.

- The Data Mesh paradigm becomes practical. Besides the more theoretical & visionary articles out so far, a new ebook by Wider & Schultze turns much more to the practical side of things.

- In 10–15 years, almost all data will be unstructured. Yet businesses are way behind extracting value out of unstructured data. Turning unstructured data like videos into structured data by using object detection, labeling, etc., etc., etc., is one way of extracting value from them.

- Technologies for querying unstructured data are still premature. Most techniques currently require very heavy lifting like training completely new neural networks for a simple query like “show me all videos with 10 black cars in it”. TASTI is an interesting approach for at least cutting out a bit of complexity from that. And it sounds to me like TASTI could scale to a cloud solution which would both, eliminate the need for training networks per query completely as well as enjoy tremendous data network effects.

Wider & Schultze Data Mesh EBook | Starburst

What: Arif Wider and Max Schultze produced this lovely ebook about data meshes from a more implementation-heavy perspective. It covers a lot of ground but is more practical as to be expected since it comes from one of the more public first implementers of the data mesh paradigm.

My perspective: I haven’t read most of it, but the content at a glimpse sounds really great. They cover a lot of stuff we’re currently writing about in our book as well like using business capabilities and specifically the journey into the data mesh world itself. Besides, I am a pretty big fan of Arifs writing. I can in particular recommend his amazing article(s) on CD4ML which I cannot recommend enough.

www.starburst.io • Share

TASTI Video Indices for fast Search

What: Unstructured data is the data type that will make up 90% of the datasphere in 10–15 years. So what do we do with that? How do we extract value from that? Certainly not by watching it.

What we’re currently able to do is to turn unstructured data like a video into structured data by using big fat neural networks. These networks e.g. label frames, do object detection, etc.

But now that we got a structured abstraction of a bunch of videos, we still need some way of searching through them, querying them like to count the number of black cars in all videos, etc., etc., etc.

Now as you might imagine, doing all this abstraction work actually means lots and lots of crunching on huge GPUs. TASTI uses a different approach. For each query, like “find the number of black cars in all videos” it takes one network with counts black cars, runs that network on just a small sample set of videos, counts the black cars, and then uses similarity measures to extrapolate from this small set into the complete space of videos.

So basically, TASTI takes a look at an expert car counter and then takes a good guess at the counts on all other videos based on the car counters’ expertise.

My perspective: Unstructured data is the data question of the next 10–15 years of data. This stuff is gonna makeup everything we will see in the data space, companies going into this direction will be able to leverage extreme data network effects, having extremely well-tuned systems that will be able to probably surpass humans in their abilities for pattern recognition on a large scale.

I also really like the ideas used in the approach mentioned in this paper. With some smart generalizations, this could turn into a pretty interesting start-up idea.

I hope someone will pick up a few of these ideas and found a company around them.

arxiv.org • Share

Master Data Management eats AI for breakfast or does it?

What: Sonal Goyal makes a case for her current approach for solving the “hey, we got five different partial addresses for this customer, can someone please join them together?” problem.

My perspective: The piece is short, but something about the problem is pretty appealing. Probably because at least I know the problem. And I also know the usual solution: Human Labor. Repeated human expert knowledge & pattern recognition actually is the solution to merging these “customer attributes” into one master data set. So from a high-level perspective

This problem seems both worth solving and solvable with machine learning.

towardsdatascience.com • Share

🎁 Notes from the ThDPTh community

I am always stunned by how many amazing data leaders, VCs, and data companies read this newsletter. Here I share some of the reader’s recent noteworthy pieces that were shared with me:

Nothing here this week.

🎄 Thanks => Feedback!

Thanks for reading this far! I’d also love it if you shared this newsletter with people whom you think might be interested in it.

Data will power every piece of our existence in the near future. I collect “Data Points” to help understand & shape this future.

If you want to support this, please share it on Twitter, LinkedIn, or Facebook.

And of course, leave feedback if you have a strong opinion about the newsletter! So?

It is terrible | It’s pretty bad | average newsletter… | good content… | I love it!

By Sven Balnojan

Data; Business Intelligence; Machine Learning, Artificial Intelligence; Everything about what powers our future.

Tweet Share

In order to unsubscribe, click here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Three Data Point Thursday

Discussion about this post