External Data, Apache Iceberg, Data Version Control; Three Data Point Thursday #83

Jan 05, 2023

I’m Sven and I’m writing this to help you build excellent data companies, build great data-heavy products & become a high-performance data team.

Every other Thursday, I share my opinion on three pieces of content about the data world.

Shameless plugs: Check out Data Mesh in Action (co-author, book) and Build a Small Dockerized Data Mesh (author, liveProject in Python).

Let’s dive in!

If you’re not using external data as a core part of your business, you’re missing out.
External data is there to stay, and is becoming more and more important to every business.
Every data team should adopt one form of data versioning.
Apache Iceberg is getting adopted by major cloud vendors much faster than other technologies.
Data companies like Dremio or Tabular betting their business model on it must change.

🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰

(1) How to use external data for business success

What: Aaser and McElhaney, McKinsey, make the case that companies should utilize external data more. They use the covid pandemic as an example.

Commonly used successful external data sources:

Weather data
News, IP, legal
Public data
Web-harvested data, app data, reviews, ratings,...
Panel data
Business data (Revenue,...)
Geospatial and Satellite

My perspective: Richard Craib founded Numerai, a crypto-related data science competition, used to tell the success story of external data.

He saw how big hedge funds and investment firms got an edge over the competition by pouring money over money into external data. They source social media mentions, they use satellite images to count the cars in parking lots of businesses, they do everything possible to get an edge using data.

And it works. It works in other sectors as well, the covid pandemic and the Ukraine war are just very obvious manifestations for every company how external data could benefit every business.

The truth is, the world is becoming more and more complex, while the availability of external data is exploding.

The two forces make the value of external data extremely potent, and more so every year. It’s a development that I feel has escaped almost every company, but that shouldn’t.

The companies that rely on external data already are becoming huge, Amazon literally integrates external data into their flywheel. The amazon pricing mechanism scrapes the web (external data) and matches the price. To always offer the lowest price for a given product.

Netflix relies on external data to generate new content, Walmart uses data like weather data to redistribute goods across the US.

If you look closely, almost all of the big data crunching companies heavily rely on external data as core part of their business strategy. So the question is, why don’t you?

Ressource: https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/harnessing-the-power-of-external-data

(2) Iceberg now runs on Google BigLake

What: After AWS, Google also announced support for Apache Iceberg on BigLake.

My perspective: Apache Iceberg is an amazing project, yet another step in the convergence of databases and data lakes.

But what will happen to data companies that have a business model based on Apache Iceberg, like drem.io or Tabular?

Both of them essentially solve the problem of “how do I manage a data lakehouse”. And both of them will need to step up their game and sharpen the value proposition, because that problem just got a whole lot of more competition by the big players.

Apache Iceberg is an interesting open-source project to watch as it has become adopted by major players like Google & AWS way faster than other tools, way before open-source based start-ups could start to profit from it.

But then again, Dropbox also came up on top, even though Google Drive came out. Sometimes, big company stuff isn’t going to cut it. So the task for drem.io and Tabular is to figure out how to become “the Dropbox of the data lakehouse”.

Resource: https://cloud.google.com/blog/products/data-analytics/announcing-apache-iceberg-support-for-biglake

(3) What is data version control? A practice everyone needs to adopt.

What: Einat Orr, co-founder of treeverse, the company behind lakeFS, shares her thoughts on data version control.

My perspective: The title is spot on. Data version control is a completely underrated practice every data engineer can and should adopt.

Enough said.

Resource: https://lakefs.io/blog/data-version-control/

How was it?

Recommend this or have this post public?

This newsletter has an opening rate of roughly 50%, but each edition is shared on average 2.5 times by every one who reads it!

So keep up the sharing, and I’m happy to add everyone who contacts me to the newsletter.

If you want to share one post in the open, just poke me and I’ll likely publish it on medium as well so you can share it with the world.

Three Data Point Thursday