🐰 #26 Future of Data Engineering, Personalization & ML Observability; ThDPTh #26 🐰

Jul 01, 2021

How RudderStack sees the future of data engineering, different approaches to personalization in machine learning models, and what ml observability actually is.

Data will power every piece of our existence in the near future. I collect “Data Points” to help understand & shape this future.

If you want to support this, please share it on Twitter, Linked In, or Facebook.

(1)🔮 RudderStack, Future of Data Engineering

RudderStack highlights a few interesting points in a recent article. One is the coming rise of C-level data executives, which is already happening.

Second is a shift towards data becoming important in every single development team, which is something that’s already being carried forward by the data mesh paradigm or in general platform teams as a concept.

They feel that moving data will become commoditized, and I agree, although I am still very worried about when we will get there. I feel like it’s still gonna take some time because the important parts of that problem are not yet addressed. Not by RudderStack and so far not by anyone else in a meaningful manner.

Finally, they mention real-time data, which I think will play a huge role in the future, and I’ve already written about that; RudderStacks team already noticed that and integrated quite a bit of event data into their tool, but I’m not sure this is enough to move into the right direction.

So that’s the data engineering future according to RudderStack. I don’t think it’s very comprehensive, but I do feel the points they make are sound. But since it’s not very comprehensive, it makes me worry a little bit about RudderStacks general direction ;)

The Future of Data Engineering. In our previous post, “The Data… | by RudderStack | Jun, 2021 | Medium

The data engineering megatrend impacts companies across industries. Know the big changes to the field of data and for the role of data engineer .

rudderstack.medium.com • Share

(2)🔥 Patterns for ML Personalization

I really like the depth Eugene Yan provides in this overview. Back up a sec.:

“Personalization is the process of customizing each individual’s experience. It’s how an electronics geek gets different recommendations from a cooking hobbyist, and how they might get different results from the same search query (e.g., “Apple”)”

This is the problem, and Eugene provides a nice little summary at the end which I have to share with you:

When to use which? Here’s a rough heuristic:

- Want to continuously explore while minimizing regret? Bandits

- Starting with neural recsys and want something simple? Embeddings+MLP

- Have long-term user histories and sequences? Sequential

- Have sparse behavior data but lots of item/user metadata? Graphs

- Want generic embeddings for multiple problems? User models

Now if you want to know some of the details, just dive right into the article which is written really well!

Btw. I did also enjoy Eugene’s welcome series to his newsletter “How to be an effective data scientist”.

Patterns for Personalization in Recommendations and Search

A whirlwind tour of bandits, embedding+MLP, sequences, graph, and user embeddings.

eugeneyan.com • Share

(3)📣 What is ML Observability?

Aparna Dhinakaran wrote a decent article describing ML observability I’d like to share. I’ll recap a short bit. If we productionize ml systems, we soon will hit a few typical problems:

Training/Serving Screw (production data is different, and the system fails to work properly…)
Changing data distribution (“oh it’s summer, people buy different stuff…”)
Messy data (“oh, these are all the same articles! Don’t you match duplicates?”)

So where does observability come in to cure these problems? Well of course, it doesn’t cure any of them. ML observability means making these things transparent, which reduces the time to detection & to correction.

Then, Aparna describes an approach to achieve observability I’m not sure I agree on completely, but I find the general idea behind it very useful, after all, monitoring & observability really are just “statistical process control” which is exactly what a good agile production system in the physical world would implement.

In my mind the problem should be tackled more from the data as code perspective & then layer on the “statistical process control”, but again, I also find her approach useful.

🎄 In Other News & Thanks

Thanks for reading this far! I’d also love it if you shared this newsletter with people whom you think might be interested in it.

P.S.: I share things that matter, not the most recent ones. I share books, research papers, and tools. I try to provide a simple way of understanding all these things. I tend to be opinionated. You can always hit the unsubscribe button!

By Sven Balnojan

Data; Business Intelligence; Machine Learning, Artificial Intelligence; Everything about what powers our future.

Tweet Share

In order to unsubscribe, click here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Three Data Point Thursday

Discussion about this post

Ready for more?