Discover more from Three Data Point Thursday
🐰 #19 E(t)L(T); Manual Data Checks; Rise of the Data Engineer; ThDPTh #19 🐰
What’s the t in EtLT? How to conduct manual data checks and the rise of the data engineer.
Data will power every piece of our existence in the near future. I collect “Data Points” to help understand & shape this future.
If you want to support this, please share it on Twitter, LinkedIn, or Facebook.
(1)🔮 E T! LT
With the rise of EL (T) over ETL, we took a great step towards much simpler and better processes in the data world. But it is becoming apparent that in some cases a little (t) as in E(t)L (T) in some form is actually needed. Because for some data sources, we simply only want parts of the data, not the complete raw data. And in the big picture, it makes sense to have a general connector, which is open source, that gets “all data” and then uses a small transformation/masking/extraction to get only the needed data.
This is especially true for raw data containing PII which we’d really like to mask on the fly on extraction. So I’m really happy that both of the current trending open source extraction solutions meltano and airbyte got that topic on their radar. Meltano already discussed it inside an issue and got it on their roadmap for June. Airbyte discussed it in their open slack channel and follows a very similar thought track.
I share the article by the company Xplenty which describes this exact use case.
An exploration of the emerging ETLT data integration solution: what it is, how it works, and who can most benefit from it.
(2)📣 A Checklist for Manual Data Checks
“Hi boss, here’s the ad hoc query result you asked for, the list of all customers with revenue over XX$ on product Y”… “Cool! But wait, why is this customer there twice? And I’m pretty sure customer Z doesn’t use that product at all.”
…. Sounds familiar? …
The article written by Robert Yi argues to check and document your assumptions on the data you use for analysis because data will always come back and surprise you if you assume anything about it. The article contains a checklist of three checks to get the data context you need.
I recommend you do these kinds of checks for most of your work especially checking & documenting your assumptions. If you’re doing ad-hoc analysis this means checking into version control some basic SQLs, if you’re building a dbt model it means integrating the appropriate raw data checks.
I realize the article is trying to sell the product of the author, but I share it anyway because the content is sound, and the open-source tool the author created is as well.
When I was at Airbnb, I had the wonderful opportunity to work on a new team reporting to Brian Chesky. It was exciting…
(3)🚀 The Rise of the Data Engineer
Written in 2017 by Maxime Beauchemin, this article is still very interesting because it discusses the future of data engineering which has now arrived at most data engineering teams. But, unlike for other articles, I don’t have much to say and simply would like to point you to the article, and some quotes:
“Unlike data scientists — and inspired by our more mature parent, software engineering — data engineers build tools, infrastructure, frameworks, and services. In fact, it’s arguable that data engineering is much closer to software engineering than it is to a data science.”
True and very important. Data scientists are simply not data engineers. A data science team will very likely not build a great data ecosystem. Focus & a passion for actual engineering is important here. A small but: now there’s also the “machine learning engineer” because running machine learning systems at scale actually requires a lot of engineering as well.
“To a modern data engineer, traditional ETL tools are largely obsolete because logic cannot be expressed using code.”
Again very true.
“The data engineering team will often own pockets of certified, high quality areas in the data warehouse. At Airbnb for instance, there’s a set of “core” schemas that are managed by the data engineering team, where service level agreement (SLAs) are clearly defined and measured, naming conventions are strictly followed, business metadata and documentation is of the highest quality, and the related pipeline code follows a set of well defined best practices.”
I joined Facebook in 2011 as a business intelligence engineer. By the time I left in 2013, I was a data engineer. I wasn’t promoted or assigned to this new role. Instead, Facebook came to realize…
🎄 In Other News & Thanks
Thanks for reading this far! I’d also love it if you shared this newsletter with people whom you think might be interested in it.
P.S.: I share things that matter, not the most recent ones. I share books, research papers, and tools. I try to provide a simple way of understanding all these things. I tend to be opinionated. You can always hit the unsubscribe button!
Data; Business Intelligence; Machine Learning, Artificial Intelligence; Everything about what powers our future.
In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue