💥 Functional Data & ML Engineering, Evolutionary Data Architectures; ThDPTh #13 💥

Apr 01, 2021

Why functional data engineering is the right approach to batch ETL, Machine Learning can use a functional approach as well and how to build evolutionary data architectures.

Data will power every piece of our existence in the near future. I collect “Data Points” to help understand & shape this future.

If you want to support this, please share it on Twitter, LinkedIn, or Facebook.

🎄 (1) Functional Data Engineering

Two years ago, Maxime Beauchemin, the creator of both Apache Airflow and Superset published an article about why the functional paradigm is as important in data engineering as it is in software engineering. I very much agree and I feel this idea is still not completely absorbed by the community. Indeed I think it carries over to machine learning just as well, where true functional programming usually isn’t the case.

Functional programming avoids states & mutable data; Good functions are “wrappers” that are testable, unlike a lot of what happens in data pipelines. I particularly agree with the following quote:

“Thinking of partitions as immutable blocks of data and systematically overwriting partitions is the way to make your tasks functional. A pure task should always fully overwrite a partition as its output.”

In short: Always delete & write everything, that way you get reproducible functions inside your data pipelines. If that takes too long, then don’t use “updates & inserts” but partitions and keep on overwriting.

The way I think about this is in terms of immutable infrastructure. I believe you should be able to recreate every piece of data that is not in a raw form in an instant. You should be able to delete your data warehouse and recreate it, with all its data in at most a couple of hours.

What about backups? Backups are for speed on recovery, not for storing past data. If you use something like snapshotting mechanisms in a slowly changing dimension context to generate soon historical data, you might argue that’s not possible.

I disagree, I actually think data generated from snapshots should be kept in a data as code repository, just like anything else. This is obviously only possible if you have a completely functional approach to your data pipelines.

But if you have, you gain a level of simplicity and robustness to your data structures that will be unparalleled.

Functional Data Engineering — a modern paradigm for batch data processing | by Maxime Beauchemin | Medium

Batch data processing — historically known as ETL — is extremely challenging. It’s time-consuming, brittle, and often unrewarding. Not only that, it’s hard to operate, evolve, and troubleshoot. In…

maximebeauchemin.medium.com • Share

🚀 (2) Functional Machine Learning with FKLearn

The unicorn nubank open-sourced their machine learning framework fklearn in 2019, bringing the best of the usual machine learning frameworks together with the paradigms of functional programming.

In particular, the functional approach fklearn advocates allows you to….

build everything in a reproducible fashion, not using states or mutable objects, which are very common in machine learning,
Make a model production-ready with just a few extra steps because of the reproducibility built-in.

I’m a fan of what functional programming has to offer as I already explained with respect to data engineering and ETL in a past newsletter. And playing around with fklearn taught me that we really benefit from the same rigor in the realms of machine learning.

Introducing fklearn: Nubank’s machine learning library (Part I) | by Lucas Estevam | Building Nubank | Medium

At Nubank we rely heavily on machine learning to make scalable data-driven decisions. While there are many other ML libraries out there (we use Xgboost, LGBM, and ScikitLearn extensively for…

medium.com • Share

☀️ 3 Evolutionary (Data) Architectures

I just read a good portion of the book “Building Evolutionary Architectures” featured by ThoughtWorks. Even though it’s about the more general architecture of systems, it struck me that data teams and architectures should be built just as evolutionary/ adaptable as the overall architecture.

And to be fair, the book targets both levels, the higher up and the team-specific architecture. A couple of things very data-specific came to my mind when reading the book.

In my newsletter ThDPTh #4, I introduced “RedCI”, a CI system focused on AWS Redshift. I don’t know the details of the implementation, but building a CI system just for one database will produce coupling, and this coupling will lead to a non-evolutionary architecture as it will be hard to evolve away from it.

Instead, the better choice is to build an adapter for an existing CI system (which the team might have done), allowing you to simply write a new adapter if you want to change the underlying database.

In general, there is a simple technique to get a better handle on whether your data architecture is able to evolve or not and that is to ask for every part of your toolchain:

“What is this tool exceptionally great at? And what can it do for which I have better tools?”

Turns out, this will help you decouple a LOT, become much more flexible and evolutionary. Some examples you might come up with:

Visualization tools like Redash/Superset are great at visualizations and user management. For SQL transformations you have better options. So ideally every single “query” in Redash/Superset looks like a “Select * from materialized_view … where filter blablabla”, utilizing the great parametrization mechanisms these tools have, but nothing more.
The permission management system on your database/ lake is probably much better & more suitable to handle permissions. So you should opt to keep permissions close to the data and map that back into the visualization tool.
A tool like the Google Tag Manager is great at delivering asynchronous tags. It can also diff, version & deploy things. But for that, you very likely have much better tools like a CD or CI system. So use the GTM API to deploy using your CI system, not the GTM mechanisms for deployment.
Your Superset also has a versioning mechanism, but you got git! So use your familiar versioning system and not the built-in.

If you look at this, suddenly you’ll realize your lock-in just became a lot less bad. If you now want to exchange your tag management solution, at least it is doable. If you want to replace your visualization tools, that’s easy as a pie because all important stuff is kept in materialized views, etc.

Turns out, by using tools that are best for a specific purpose, you’ll get to use better tools AND reduce coupling to build an easy-to-evolve architecture.

This is very much my take and just one specific application to the data world. I suggest you read the book for all the rest.

P.S.: The approach to tools highlighted above actually comes from Sam Newman and his excellent book “Building Microservices” in which he discusses how to treat 3rd party systems in a microservices architecture. Data teams simply happen to have a lot of 3rd party systems.

Building Evolutionary Architectures | ThoughtWorks

This practical guide gives you the lowdown on building evolutionary architecture, to support your organization in a fast-changing world.

www.thoughtworks.com • Share

🎄 In other news & thanks

Thank you for taking the time to read it till the end!

P.S.: I share things that matter, not the most recent ones. I share books, research papers, and tools. I try to provide a simple way of understanding all these things. I tend to be opinionated. You can always hit the unsubscribe button!

By Sven Balnojan

Data; Business Intelligence; Machine Learning, Artificial Intelligence; Everything about what powers our future.

Tweet Share

In order to unsubscribe, click here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Three Data Point Thursday

Discussion about this post