Effective DS Infra, MetaCards, Data Domains; ThDPTh #55
Ville Tuulos seems pretty busy lately, he’s close to finishing his book and just put out an interesting piece on a new feature for machine learning pipelines built with MetaFlow called Meta cards.
Read about it below!
I’m Sven, I collect “Data Points” to help understand & shape the future, one powered by data.
If you only have 30 seconds to spare, here is what I would consider actionable insights for investors, data leaders, and data company founders.
Data Science Infrastructure still is a major bottleneck to data success. But it is getting really good! It seems to be the case that companies today can make a serious difference in their success rate for data science projects simply by providing good basic infrastructure for data science teams to use.
MetaFlow is still going strong. The data science infrastructure framework MetaFlow is still adding cool features like the MetaCards idea. This now allows embedding the usual training reports and visuals inside the pipelines.
Business capabilities over domains in the data world. For some reason, in all data contexts, the idea of business capabilities seems to be more accessible than the idea of domains. If you want to work with DDD in the data world, it recommends itself to take a look at business capability mapping as an entry point.
What: Ville Tuulos managed a team at Netflix to create MetaFlow, a framework for getting data science & machine learning projects into production quickly. Now he’s close to finishing his book on the subject of data science infrastructure, it is currently in Mannings MEAP program with 9 chapters reader. I think the gist of the book can be expressed best by a few quotes from the book. The main idea is …
“Technically, everything presented in this book has been possible to implement for decades, if time and cost were not of concern. However, for the past seven decades, nothing in this problem domain has been easy.”
The book uses the term data science as a union for data science, ML, and AI. The following quote describes, why it is so important to think about data science infrastructure.
“Today, most data science applications can be supported by generalized infrastructure” as opposed to custom infrastructure or even a custom application.
One of the main claims in this direction from Ville is, that today there exists infrastructure that lets anyone build a bridge between the Possibly and the Easy, over the “Valley of Complexity”. Which I know from personal experience, a lot of data science projects, all into and never emerge back up.
My perspective: I love that Ville decided to put the effort into writing about this topic. I’m just learning how much effort goes into a book, so I applaud everyone who chooses a subject that may seem “unsexy”. And FWIW I think everything with the word Infrastructure in it becomes unsexy, except Infrastructure as Code.
I remember and still see the struggles all around the community with data science projects that get stuck in exactly that phase. Data scientists searching for proper architectures, tools over tools, but none of them seem to solve just the right problem and a lot of unicorn solutions for just this one problem.
And I also see, that there are powerful and relatively easy-to-use frameworks out there, that, if applied consistently can alleviate a lot of the pains of data scientists and help them deliver business value instead of prototypes.
The MEAP is already great, so I would recommend everyone to take a look at it.
What: Metaflow can be summed up by this quote
“Today, the world is a different place. You don’t need a PhD to develop a jaw-dropping computer vision demo or a robust model for predicting sales.”
MetaFlow is just that, a great framework for enabling everyone to do production-ready machine learning with all the tools you already know as sci-kit learn, TensorFlow, and so on.
MetaCards are a cool new feature enabling embedding visual reports into the pipelines.
My perspective: This kind of feature can already be found in a few different solutions like in CML. There this is one of the core ideas. Since I really enjoy the ease of MetaFlow, this feature is a great addition to the package.
What: Piethein Strengholt has been writing about data meshes for quite some time from his own perspective which I enjoy. In this article, he tackles the ideas of domain-driven design and translates them over to the data world.
My perspective: I like how Piethein focuses on business capabilities, something my coauthor Jacek on the Data Mesh in Action book is also very keen on. Business capabilities seem to be much easier to explain to non-software developers, and they might even work better in the data world. Since the two concepts, domains and business capabilities are not mutually exclusive, I’d consider business capabilities to be a more easily accessible perspective on domains.
🎁 Notes from the ThDPTh community
I am always stunned by how many amazing data leaders, VCs, and data companies read this newsletter. Here I share some of the reader’s recent noteworthy pieces that were shared with me:
Nothing here this week.
🎄 Thanks => Feedback!
Thanks for reading this far! I’d also love it if you shared this newsletter with people whom you think might be interested in it.
Data will power every piece of our existence in the near future. I collect “Data Points” to help understand & shape this future.
If you want to support this, please share it on Twitter, LinkedIn, or Facebook.
And of course, leave feedback if you have a strong opinion about the newsletter! So?
P.S.: I share things that matter, not the most recent ones. I share books, research papers, and tools. I try to provide a simple way of understanding all these things. I tend to be opinionated. You can always hit the unsubscribe button!
Data; Business Intelligence; Machine Learning, Artificial Intelligence; Everything about what powers our future.
In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue