Protecting open source, how to build analytics pipelines, catching tax cheaters; ThDPTh #76

Sep 15, 2022

I’m Sven, and this is the Three Data Point Thursday. We’re talking about how to build data companies, how to build great data-heavy products & tactics for high-performance data teams. I’ve also co-authored a book about the data mesh part of that.

Time to Read: 4 minutes

Another week of data thoughts:

build flat data pipelines at first, and scale up later.
AI is used to catch tax cheaters using aerial images, and it works.
The open source landscape changed, and that’s not necessarily bad.

🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰🐰

Data & Modern Analytics Pipelines

What: Patrik Braborec from GoodData explains his perspective on modern data pipelines. He introduces a GitLab-based data-built pipeline that runs on a schedule or on a change of code. The stages are

stages:

# pre-merge

- extract_load

- transform

- analytics_staging

# post-merge

- analytics_prod

He showed how to practically trigger these pipelines by either a change of code OR on schedule. This way, in both cases the extract & load process is kicked off, the transformations happen, tests inside a staging environment as well as deployment to production happens.

My perspective: In data development, we usually think about the code pipeline, and the data pipeline, both working separately but intersecting each other to create value.

What Patrik is essentially doing is combining the two pipelines into one which I think is a great way to develop when you’re starting out on a new project or on all smaller projects. This method will get you started with a proper CI set up quickly, without introducing too much complexity.

It also means he’s deploying an exact copy of the production environment (including the data) to the staging environment, again this is something I generally recommend.

So when should you switch from the one pipeline view to the “two pipelines” view?

IMHO you do at some point in time, and that point is as soon as you start wanting to change to only one of them. For instance, if you’re looking to change a piece of code that’s just a mapping table, you will not want to run a complete extract & load process, instead, you just want to test your logic (which in the case of mapping tables is usually quite trivial and only concerns downstream models).

Instead, you will want to separate out the logic tests for the code pipeline. However, this also means you will essentially introduce what’s today called “data observability” into the data pipeline, once you use that only to run on schedule, i.e. on data changes.

I encourage you to take a look at the article because I really like the simplicity of this method.

Resource: https://medium.com/gooddata-developers/how-to-build-a-modern-data-pipeline-cfdd9d14fbea

The Batch on Catching Tax Cheaters using AI

What: Andrew Ng’s newsletter “The Batch” shares insights into a system developed by Google and Capgemini. It takes aerial images of residential areas, detects swimming pools, and checks with a register to see whether they are not properly registered.

My perspective: This is AI in action, and the effectiveness is astonishing. We’re talking of 94% accuracy over 20,000 detected pools.

On a different note, I think these types of AI-applied pipelines, going from “I have this request” to “AI delivers the results with great accuracy” make great data business plans.

A lot of data start-ups are focused on providing solutions for companies to “build this yourself easily”. However, by not building out the full pipeline, I feel these start-ups are missing out on a good portion of the exponential growth of data, and thus future growth %%%.

Would love to see more of this as a business.

Resource: https://read.deeplearning.ai/the-batch/issue-161/

HBR on Open Source

What: The HBR publishes this interesting article about the changes inside the open source system. The changes from mostly developer-driven open source to company-involved open source. They talk about how companies can facilitate inside this environment.

My perspective: I enjoyed the read, it’s a great business perspective on the space, without too many inside deep dives.

Resource: https://hbr.org/2021/09/the-digital-economy-runs-on-open-source-heres-how-to-protect-it

What did you think of this edition?

-🐰🐰🐰🐰🐰 I love it, will forward!

-🐰🐰🐰 Average newsletter...

-🐰 It is terrible ( = I just made it to this link b.c. I was looking for the unsubscribe button)

Want to recommend this or have this post public?

This newsletter isn’t a secret society, it’s just in private mode… You may still recommend & forward it to others. Just send me their e-mail, ping me on Twitter/LinkedIn and I’ll add them to the list.

If you really want to share one post in the open, again, just poke me and I’ll likely publish it on medium as well so you can share it with the world.

Three Data Point Thursday