LakeFS, Boundary-layer, SQLPad; ThDPTh #6

Feb 11, 2021

#6 Version Data Lakes, Declarative DAGs, and shared SQL stuff with SQLPad.

Data will power every piece of our existence in the near future. I collect “Data Points” to help understand this future.

If you want to support this, please share it on Twitter, LinkedIn, or Facebook.

The three data points for today are next-gen data lakes with lakeFS, declarative DAGs with boundary-layer, and fast data engineer onboarding with SQLPad.

1 LakeFS, versioning and branching data

lakeFS is a tool that provides a layer on top of your AWS S3 or GCS data lake. It allows automatic versioning and branching of your data. The team provides lots of best practices, e.g. showing how to set up a data mesh using lakeFS. It’s open-source and evolves pretty fast, so I suggest you take a look at it!

I suggest you first take a look at the docs which are really well written, and then head over to the blog post about data quality and finally maybe take a look at how to use lakeFS with apache airflow.

Ressources

The docs for lakeFS.
Blogpost about data quality with lakeFS.
Apache Airflow and lakeFS.

2 Boundary-layer

DAGs or directed acyclic graphs have become the concept data scientists and data engineers use alike in their data pipelines. A data pipeline represented by a DAG usually contains both a “graph” meaning steps and the logic chaining them together, and the possibly complex transformation logic inside the steps.

This violates the “Single Layer of Abstraction” principle and thus makes DAGs really hard to understand. In the “Composed Method” developers aim to provide code on the same “level”. Since DAGs often contain two separate levels or more, this can be solved by extracting one of the other. Declarative DAG tools aim to do just that, and the DAG tool from Etsy seems to be most promising. It’s built for Apache Airflow DAGs, and allows a YAML DAG declaration for the step logic. The YAML then compiles down to an Apache Airflow DAG.

Fwiw, of course, the composed method can be used in a normal Python DAG using plain old python. The benefit of declarative DAG tools is that they enforce this method, not that they are the only way to do it.

Ressources

The ThoughtWorks technique: declarative data pipelines.
Etsys boundary-layer tool.
Two more tools in this space are a-la-mode and airflow-declarative.

3 SQLPad

I remember being set up as a data guy. Get some SQL editor, asking someone to tell me the connection strings I needed, getting to know the databases etc. When working on a ticket I usually had to hack together completely new SQL.

But versioning & configuring connections and SQL is actually really easy! And having a nice looking UI + be able to share credentials etc. speeds up development quite a bit. I’ve been using SQLPad for querying and simple visualizations for quite some time and enjoyed it…

You can use a combination of versioned & seed connections to have both a versioned set of data as well as a “custom set” for each developer if needed.

I simply like to use SQLPad as a local SQL editor run inside docker with versioned connections + queries that can be shared with the team. But you can of course also deploy SQLPad and put the data onto some persistent storage.

Ressources

The SQLPad Github repository.
The SQLPad Dockerhub page.
The SQLPad docs.

In other news

I finished an article about trunk-based development and how it’s an amazing technique for data people.

P.S.: I share things that matter, not the most recent ones. I share books, research papers, and tools. I try to provide a simple way of understanding all these things. But I tend to be opinionated. But you can always hit the unsubscribe button!

By Sven Balnojan

Data; Business Intelligence; Machine Learning, Artificial Intelligence; Everything about what powers our future.

Tweet Share

In order to unsubscribe, click here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Three Data Point Thursday

Discussion about this post