Discover more from Three Data Point Thursday
DataOps Testing, AirBnBs Quality Initiative, Testing with dbt; ThDPTh #3
Data will power every piece of our existence in the near future. I collect “Data Points” to help understand this near future.
If you want to support this, please share it on Twitter, LinkedIn, or Facebook.
Here are your weekly three data points: DataOps Testing, AirBnBs Quality Initiative, Testing with dbt.
1 DataOps, Value and Innovation Pipelines, DataKitchen
In DataOps companies aim for error numbers of 1 or less a year. To an average data guy that might sound crazy! A lot of what data people do is fixing errors, figuring out why data is incorrect, and digging through masses of ETL (or ELT) to analyze problems.
In DataOps we distinguish two pipelines, the value, and the innovation pipeline. If we got an error in the value pipeline it’s because NEW data crashed something. If there’s an error in the innovation pipeline, it’s because you just pushed a new feature and messed things up.
How to test value pipelines:
Testing the incoming data, is it free from issues? Like “are all zip codes 5 digits?”…
Testing the transformation process, did it work as expected? Did we get less than 10% new dimensional data?
Testing the outputs, is the stuff we show to end-users correct? Does the number of customers increase? Did our web traffic roughly match last week’s traffic?
For testing innovation pipelines the idea of continuous deployment is great. I really like continuous deployment for all kinds of things. The basic idea is that you must trust your testing enough that you are willing to let every commit be promoted into the production environment automatically. If you do so, you’ll gain incredible speed. But for that to happen you’ll also have to have incredible trust in your tests! And that means having a great development environment with fixed data sets.
Since both pipelines collide at the business logic level, the business logic tests of the transformation logic are also there to test the innovation pipeline.
That’s how you set up proper testing for your data that will get you below this error threshold.
2 Making Big Moves Like AirBnBs Quality Initiative & Chain-Link Logics
In 2019 Airbnb launched its “Data Quality Initiative”, a large restructuring of a lot of parts of the company around data. I really like the idea because it incorporates a core principle of working with data:
“With data, centralized efforts are sometimes the key! Because data processes in a company exhibit what Prof. Rumelt calls a chain-link logic. But to increase the quality of a chain-link logic system, you need large centralized efforts, not stepwise incremental efforts.”
How come? Data in a company is a chain-link logic because if you increase the quality in “just one part” it won’t do anything for the decision making capability of the company. Or at least very little. If you increase just the quality of your data producing systems, it won’t change much unless your data teams also pick it up. If you increase the quality of your data team’s transformation processes, it will produce a slight increase in quality, but it’s still “shit in shit out”. If your data team has lots of erroneous data, end-users will get used to that and not trust the system that much. Even if you do a lot of education and help them work with the system, it won’t increase the whole company’s decision making capability by much.
Why is that? Because of quality matching! The individual units will tend to match the quality of the other systems. If the tech teams produce shitty data, the data team will probably be a little lower on the quality scale as well, as will the end-users.
And to change that, simply increasing the quality of any individual step by 10x will do absolutely nothing (at least not 10x). Instead, we need centralized big efforts!
Two key moves stick out to me: 1. The concept of data set ownership meaning every data set should have a clear owner responsible for the SLAs. 2. The degree of decentralization AirBnB chooses. They choose to place data engineering teams together with product units which is a great move in a company that size as I explain in my article on different kinds of decentralization data organizations. It’s also what HubSpot is currently doing.
3 Testing with dbt & Snapshots
DBT is an amazing transformation tool mostly because it comes with a whole set of best practices which makes working with it simply fun. But dbt also comes with a bunch of testing capabilities that help to run a bunch of the value & innovation pipeline tests needed in the DataOps methodology.
Since dbt is the “T” in the EL (T), the incoming data tests shouldn’t really be placed there. And yet there is a place for testing incoming data. Dbt employs a simple snapshotting mechanism that helps to transform data based on changed incoming data. It can be used to build slowly changing dimensions. A colleague of mine pointed me to the great quote at the bottom of this post about snapshotting:
“The best time to start snapshotting your data was twenty years ago.
The second best time is today.”
I think this discourse explains very well the second use case, which is a regression test based on the changing raw data, which you btw. Should also test in your EL tool.
Finally, a new cool feature by dbt called exposures will help you showcase your output tests to the end-users which I think is vital. Because in essence, “exposures” are a way of distinguishing the dbt models between “core models, not meant to be used in reports” and “reporting models, meant to be used directly in reports”.
P.S.: I share things that matter, not the most recent ones. I share books, research papers, and tools. I try to provide a simple way of understanding all these things. But I tend to be opinionated. But you can always hit the unsubscribe button!
Data; Business Intelligence; Machine Learning, Artificial Intelligence; Everything about what powers our future.
In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue