3-Level data lakes, dbt snapshotting, Nbdocs; ThDPTh #74
I’m Sven, and this is the Three Data Point Thursday. The email that helps you understand and shape the one thing that will power the future: data. I’m also writing a book about the data mesh part of that.
Time to Read: 5 mins
Another week of data thoughts:
- Data lakes got 2 additional levels over the last 3-4 years. Use them!
- Snapshotting data is a great idea, but hard to do well.
- Nbdocs looks cool.
🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰
3-Levels of Data Lakes
What: For your data lake, “If you’re operating at the file level, you should consider leveling up to the table, then to the repository level”. That sums up this talk pretty well. It’s a talk from Data Council Austin done by Paul Singman, DevRel at treeverse.io the company behind the open-source framework lakeFS.
My perspective: It’s pretty interesting that the data lake space has been transformed in the last couple of years. Basically 4-5 years ago you were stuck with a level-0 data lake, now you got two more levels to go. That’s pretty awesome. The levels Paul Singman sees and discusses are:
Level 0: Your basic “file-level” data lake. It consists of your files like parquet or CSV inside object storage like AWS S3.
Level 1: Your “table-level”. Requires a table format like Apache Hudi, Iceberg, or Delta Lake on top of your object storage. Bringing you lots of benefits for very few downsides.
Level 2: Your “repository level”. It brings you git-like operations on top of technically either Level 0 or Level 1, although Paul puts it on top of Level 1. This allows for a lot of convenient operations and makes data very reliable.
Resource:
Properly doing dbt snapshots
What: This is a short guide on working with dbt snapshots. Snapshotting means saving potentially mutable data as immutable timestamped snapshots inside the database. In particular, the guide focuses on working with multiple environments where you’re best off saving snapshots not inside the “snapshotting database” but externally.
My perspective: I really like how the Montreal Analytics folks are digging deep into what it takes to use dbt in a high-performance team. The solution they use is very in line with my perspective:
If you snapshot data, you’re creating new data. That newly created data should be backuped and saved just like your production database or your codebase.
However, your database in itself should be easy to recompute, easy to shut down, and launch/fill again. So these two things should really be separated.
That line of thought is very much in line with the functional data engineering approach. If you use a functional approach, your task is to separate parts that are easy to recompute from the ones that are not. That might mean:
To have an immutable staging area with multiple layers of backup that restores super fast
and a mutable transformation/model/metric layer sitting on top of this, which is super easy to recompute based on the immutable staging area.
IMHO, computing & storage are cheap and becoming cheaper by the day. So for me, the consequence is to not sweat recomputations. They will make your life much simpler and your systems more robust.
Resource: https://blog.montrealanalytics.com/using-dbt-snapshots-with-dev-prod-environments-e5ed63b2c343
NBdocs by Outerbounds
What: A framework for writing “technical” documentation inside a notebook. With support for testing examples in the docs as well.
My perspective: The data-X stack, the tools data people use today is somewhere 10% as complete & productive as the ones software engineers use. So I like every single effort for making the data worker more productive by giving him better tools. I also like the approach of understanding that data work simply is different from code work (something IDEs are optimized for).
Resource: https://github.com/outerbounds/nbdoc
Thanks for reading!
What did you think of this edition?
-🐰🐰🐰🐰🐰 I love it, will forward!
-🐰 It is terrible ( = I just made it to this link b.c. I was looking for the unsubscribe button)
Want to recommend this or have this post public?
This newsletter isn’t a secret society, it’s just in private mode… You may still recommend & forward it to others. Just send me their e-mail, ping me on Twitter/LinkedIn and I’ll add them to the list.
If you really want to share one post in the open, again, just poke me and I’ll likely publish it on medium as well so you can share it with the world.