Making Data; Three Data Point Thursday #87
I’m Sven and I’m writing this to help you (1) build excellent data companies, (2) build great data-heavy products, (3) become a high-performance data team & (4) build great things with open source.
What I go for you today:
No one is “making data”. You should.
Don’t blindly adopt BSL licenses.
Fivetran thinks databases still can’t do computing.
ThDPTh is growing fast since I opened it up again, so welcome to 20% new readers since last month!
Congrats, you just joined a smart bunch of data leaders, VCs, data company founders, and curious data people interested in building the future of data.
(Pro data tip: if absolute numbers are tiny, share relative ones, they look much better ;))
Share ThDPTh
I just created the “20 Point Questionnaire To Assess The Strength Of Your Data Startup Idea”.
Just recommend the ThDPTh and respond to this email with “SHARE: [link to your recommendation]” and you’ll receive this cool giveaway.
🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰
(1) Who is making data?
What: Cassie explains her point in one quote:
“ My favorite way of explaining the difference between data science and data engineering is this:
If data science is “making data useful,” then data engineering is “making data usable.”
These disciplines are so exciting that it’s easy to get ahead of ourselves and forget that before we can make data usable (let alone useful), we need to make data in the first place.
But what about “making data” in the first place?
The art of making good data is terribly neglected.”
Cassie argues data is like the ingredients to a great dish. And we’re not doing anything to celebrate making data.
My perspective: Data-centric AI is all about “making data”, but it's a very niche field so far and only limited to the ML&AI realm.
I really like to remind people of the story of Capital One (back then not called that) who spent years investing in “making data”, and in turn generated $$$ billions.
This is to say, I’m 100% on Cassie's side here and would like to point out that the business value of focusing on “making data” is completely underrated.
Cockroach on its License
What: Cockroach Labs, the company behind cockroachDB changed the licensing model in 2019. I think this article is a great read on it.
“But our past outlook on the right business model relied on a crucial norm in the OSS world: that companies could build a business around a strong open source core product without a much larger technology platform company coming along and offering the same product as a service. That norm no longer holds. ”
Their new (and current model) works with a time restriction like this:
“Our BSL protects CockroachDB’s current code from being used as a DBaaS without an enterprise license for a period of three years. After 3 years this restriction lapses and the code becomes open source (per our current Apache license) and is free to use for any purpose.”
My perspective: The cockroach, the BSL approach with a timelapse is a viable approach. But I try to warn people of two things:
Licenses are just that a piece of paper: Not business reality, they don’t offer “complete protection”, they are just one of many tools. Think about the “elastic war”.
There is more than one answer: The “norm” described above doesn’t hold, but this answer is one from a defensive position. Other companies are able to come up with an answer that comes from a position of strength. Positions that enable them to thrive when other companies start to offer the same product as a service (e.g. Automattic).
So, if you feel like your OS-based business is in danger of being “eaten up by others” then ask yourself:
Am I able to do serious product pivoting to make my product stronger when others offer it as a service? (Yes that means the product you’re “selling” becomes something different, not the thing others offer)
If I do adopt BSL (or similar licenses), what are the business moves other big fish could do to render my BSL ineffective?
What’s the best course of action to thus grow my business (not “protect it”)?
Resource: https://www.cockroachlabs.com/blog/oss-relicensing-cockroachdb/
Fivetran tries to travel back in time and fails
What: Fivetrans VP of Product Marketing Catlyn Origitano explains why Fivetran normalizes your data inside their compute facilities and why you should want this from every ELT tool.
“Fivetran is the exception. We normalize your data within our own virtual private cloud (VPC), so you’ll never have to worry about data ingestion processes devouring your warehousing compute bill. We’ve made that decision specifically to support the most efficient data stack possible — and save you costs in the process.”
My perspective: Cat's argument is wrong and misleading. To make it clear: You always want a compute-optimized engine to handle your computing. That’s the way you save the most money & get the best value, by using a tool that does the task you want at a large scale - it’s called “economics of scale”.
Since every single modern data store comes paired with such an optimized compute structure you need anyways to do all your transformations, there is only one place you want to do all your computing: on top of your data store. That includes “normalization”.
Fivetran itself might profit from economies of scale here and cut down the margin (my assumption) to try to sell this idea. In the end, you will probably end up saving 10-20% on your “normalization” costs, something I hope no data engineer is truly concerned about.
The real problem happens afterward: Once you give “normalization” out of your hands, you’re starting to lose lineage, testability, and flexibility in your data structures. Something that is way more expensive than the tiny cost savings that come from doing “normalization” inside the Fivetran structures.
You also lose the ability to switch to a better (and or cheaper) computing engine. Something that is already well-commoditized.
I remember Drew Benin who in a recent “Analytics Everywhere” episode said, “trust the data warehouse!”.
It does a great job at these things. It’s been built to do just this stuff. It’s not 2001 anymore, we don’t do ETL anymore because databases can’t do compute, it’s the opposite, data stores are now optimized to do that on a petabyte scale.
Trust the data warehouse, and don’t try to travel back in time.
…
Now I feel bad for bashing Fivetran, so here is a constructive spin on it. I brainstormed a few things that would make sense in the cloud (and would require Fivetran to build a better product):
Reducing the margin on “normalization” to close to zero => would produce a vast inflow of “normalization workloads” and thus push computing prices for Fivetran truly down.
Go beyond “normalization” and enrich the data, making it more useful.
Doing PII data detection & flagging.
Leverage the huge amount of normalization data into e.g. outlier detection.
Go beyond just normalization (who cares about that anyways?) and join normalized Shopify + e.g. Salesforce data to produce outputs.
…
Resource: https://www.fivetran.com/blog/how-the-fivetran-approach-to-data-normalization-cuts-compute-costs
New articles by me
Shameless plugs of things by me
Check out Data Mesh in Action (co-author, book)
and Build a Small Dockerized Data Mesh (author, liveProject in Python).
And on Medium with more unique content.
I truly believe that you can take a lot of shortcuts by reading pieces from people with real experience that are able to condense their wisdom into words.
And that’s what I’m collecting here, little pieces of wisdom from other smart people.
You’re welcome to email me with questions or raise issues I should discuss. If you know a great topic, let me know about it.
If you feel like this might be worthwhile to someone else, go ahead and pass it along, finding good reads is always a hard challenge, and they will appreciate it.
Until next week,
Sven