Iceberg, Photobox, Data Versioning; ThDPTh #59

A couple of weeks ago, I hit an iceberg.
A tabular Iceberg. Just when I was trying out their beta, it appeared on the horizon. 2 hours later, I crashed into it.
So sadly, I didn’t get what I wanted to work. So even better, that they are now providing a simpler demo setup for Iceberg at the very least, so you don’t have to experience the same…
I’m Sven, I collect “Data Points” to help understand & shape the future, one powered by data.
Svens Thoughts
If you only have 30 seconds to spare, here is what I would consider actionable insights for investors, data leaders, and data company founders.
Iceberg & Tabular are on the “trial radar”. I enjoy both, the Apache Iceberg project as well as what tabular is currently working on. It’s making working with data simply easier for everyone.
Data things are complex, don’t let people crash into them. Iceberg is currently mostly used by big corps (I assume, correct me if I am wrong!), mostly because the first versions aimed at a huge amount of data. However, iceberg has the potential of simply being an abstraction layer that could be used at any company. But for such complex things, it’s essential to make it approachable and easy to use these kinds of things. And Iceberg makes it notoriously hard to be used by anyone except the Sparkyspark experts out there.
Photobox unboxes an interesting Data Platform. Photobox shows that it’s pretty easy to build a modern data platform using events and AWS-native technology.
Data Versioning is still a thing. And it’s still completely underrated, a niche topic one could say. I am glad to see companies like treeverse.io taking an initiative in pushing that topic further, both with content and with products.
Docker, Spark, and Iceberg: The Fastest Way to Try Iceberg!
What: It’s a blog post from the company tabular.io, who’s building around the Apache Iceberg project to create a seamless experience for Iceberg. It shows how to quickly set up a local version of Iceberg with notebook and Spark, and play around with the most basic things.
My perspective: FWIW, waaaah, finally guys. I almost broke my finger trying to get that to run myself even using your cool GDocs documentation for the tabular version of this ;)
Iceberg is complex and hard to understand, so I am really excited about now having something I can show to people which they can get running in a couple of minutes and then understand the benefits of Iceberg themselves.
Iceberg in general as a table format is an important step into making the difference between data lakes and data warehouses disappear. It also is a great tool for solving a lot of challenges with data lakes out-of-the-box.
If you have a data lake inside your company, I really recommend checking out this mini-demo and seeing whether this could benefit you.
What: Stefano Solimito from Photobox writes about the new data platform they rolled out for the company Photobox. He shares the typical past problems, and how they set on a new standardized architecture using events and the Cloud Event specification. Everything is built on top of AWS using a bunch of AWS native technologies.
My perspective: I always enjoy practical articles which are full of actual true problems and experience in the data space. This is one of them. The team has built a really simple, and thus likely very evolutionary, architecture.
At the front, they have an API Gateway (AWS native), which takes either singletons or batch events. These are dumped into one Kinesis stream and picked up by Kinesis Firehose for further processing. They have an integrated PII-hashing service which I particularly like. They also are putting a lot of effort into pushing a lot of responsibility onto the data producers.
All in all, it’s a very interesting read, well worth checking out. In particular, the focus on AWS-native technologies, Cloud Events, and the data producer focus is worth noting.
What: Paul Singman, the dev rel person at treeeverse.io, wrote an article about data versioning. He includes examples, best practices, and a lot of good stuff.
My perspective: Underrated.
🎄 Thanks => Feedback!
Thanks for reading this far! I’d also love it if you shared this newsletter with people whom you think might be interested in it.
Data will power every piece of our existence in the near future. I collect “Data Points” to help understand & shape this future.
If you want to support this, please share it on Twitter, LinkedIn, or Facebook.
And of course, leave feedback if you have a strong opinion about the newsletter! So?
It is terrible | It’s pretty bad | average newsletter… | good content… | I love it!
P.S.: I share things that matter, not the most recent ones. I share books, research papers, and tools. I try to provide a simple way of understanding all these things. I tend to be opinionated. You can always hit the unsubscribe button!
Data; Business Intelligence; Machine Learning, Artificial Intelligence; Everything about what powers our future.
In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue