🐰 #36 Tabular Icebergs, Firebolt & Data Meshes; ThDPTh #36 🐰

Sep 09, 2021

New tabular data company on the horizon! What the future holds for Firebolt and how to build a Kafka-based data mesh.

Data will power every piece of our existence in the near future. I collect “Data Points” to help understand & shape this future.

If you want to support this, please share it on Twitter, LinkedIn, or Facebook.

🚀 (1) Iceberg & Tabular

Tabular just announced its Series A led by a16z. Tabular is targeting to capitalize on Apache Iceberg. Now Iceberg is an interesting project itself, basically, an analytical table format (with an engine in between) that makes working with huge analytical tables easy.

Delta Lake and Apache Hudi are trying to do something similar, but currently, it’s Iceberg that’s used at Netflix, Uber, and the likes.

The company Tabular is, as described by a16z Martin Casado, trying to become a “headless database”, an abstraction layer that allows data developers to focus on creating business logic, not infrastructure magic.

I really can sympathize with that idea. What’s most interesting I think is that the company Tabular seems to be headed for competition with databricks and the likes. So it will be really interesting to see how things play out and whether Tabular can finally make Iceberg accessible to companies without a team of 20 infrastructure data engineers.

Investing in Tabular - Andreessen Horowitz

Data systems have long involved a tradeoff between flexibility and ease of use. Cloud data warehouses are well integrated systems…

a16z.com • Share

🎁 (2) Firebolt

I applaud every effort to create new technologies in the data space, I think much more innovation has to happen here and that we’re basically in the stone ages of data.

Firebolt is a very interesting project. I spend some time digging through all material provided by Firebolt, Snowflake, and Redshift. Firebolt basically claims to be a lot faster than Snowflake.

…

Here’s my short & very simplified perspective on the Redshift-Snowflake-Firebolt trio:

The short version: Postgres, Redshift, Snowflake & Firebolt mainly differentiate themselves by focusing on different questions. Each question emerged after the previous one had been “solved”. But nothing is actually stopping Redshift of Snowflake from solving Firebolt’s question as well. Indeed as far as I can tell, Snowflake has 99% of the technology Firebolt is currently using in place with one difference, the lack of “native nested array storage”.

The Redshift insight: Redshift realized, in my words, that analytical data is becoming a thing, so reading speed is essential, thus columnar storage and query result caches for databases were born.

But that wasn’t enough for Snowflake, they realized analytics workloads need more than the traditional model….

The Snowflake insight: Compute & Storage should scale independently, because for analytical workloads, for important stuff, we simply want to be able to throw money at the problem and make it faster, no matter the amount of data.

The Firebolt founders again realized, now that we scale compute & storage individually, now that the data isn’t stored where the computation happens, something else becomes key to analytical workloads….

The Firebolt insight: With cloud and separation of compute & storage, the key problem is to reduce the amount of data moving between the distributed storage spaces and the compute instance.

Firebolts idea is important, data and computing ability are both growing exponentially, they will likely grow in parallel, so this problem isn’t going away. So what is the key point here? If I submit a computation that needs data, the now crucial step is to determine which data is needed. That’s of course not completely obvious until we’ve completed the computation, hence the dilemma of the computation & storage separation…. So something has to travel through the network. Firebolt does a good job at reducing that by three main things:

They own filesystem (F3) and excessive use of indices (which basically tells us which data is where)
Their support for nested JSONs. Basically, if you have a nested array {A:1; B:{X:Y}} what they will do is to store {X:Y) in a separate table making it much easier to again use indices.
Lots and lots of query optimizations. Why is this so important? This is really the step that actually makes sure less stuff is transmitted over the wire!

And that’s it. My key takeaway after going through all the marketing material of both Firebolt & Snowflake is that these are the only three unique things they got going. (1) actually isn’t unique, Snowflake seems to employ a very similar strategy here, especially with regards to the sizes of partitions.

Summary: So what does that leave us at? In my opinion, two things will happen: Snowflake might catch up to Firebolt in the speed comparisons, from the publicly available it seems they are only missing the nested JSON support and maybe some query optimization (focused on retrieving less data). Second, I don’t see any open-source innovation in this space which is so prone to exactly that. So I’m betting on an open-source analytical database that is able to take on both Firebolt & Snowflake to emerge sometime soon.

Why we invested in Firebolt: Snowflake ...

For many years, Snowflake has been the byword for the entire cloud data warehouse market….

medium.com • Share

🔮 (3) A Kafka based Data Mesh at Gloo.us

The idea of Kafka based data meshes is quite appealing. Jacek Majchrzak already mentioned that design in 2019, and the company Gloo.us followed up with their version of a data mesh.

Here’s a little summary of their journey. Like most companies, they first focus on evangelizing the data product idea, decentralizing the data ownership. This step is definitely the most important one in any data mesh as it’s the only thing that stops people from treating data as a by-product.

Their data mesh features a schema registry which is really important as it is the means to track versions of data products. Most data products are Kafka topics, although some are supplied via Rest APIs or else, and as far as I understand they are still contained in the schema registry.

Currently, they use KSQL streams as a kind of consumer side transform of existing data streams. Finally, they also outline how a derived data product is created and then re-fed into Kafka.

If you’re trying to build something similar, I suggest you take a look at the article, I enjoyed it.

Catching Data In a Data Mesh: Applied (Part II) | by Trey Hicks...

At Gloo.us, we’re building technologies that connect individuals looking to start a growth journey …

medium.com • Share

🎄 In Other News & Thanks

Thanks for reading this far! I’d also love it if you shared this newsletter with people whom you think might be interested in it.

P.S.: I share things that matter, not the most recent ones. I share books, research papers, and tools. I try to provide a simple way of understanding all these things. I tend to be opinionated. You can always hit the unsubscribe button!

By Sven Balnojan

Data; Business Intelligence; Machine Learning, Artificial Intelligence; Everything about what powers our future.

Tweet Share

In order to unsubscribe, click here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Three Data Point Thursday

Discussion about this post