Breaking Down the Modern Data Stack: Practical Insights for Leveraging Analytics Progress
This is the Three Data Point Thursday, making your business smarter with data & AI.
Let’s dive in!
Actionable Insights
If you only have a few minutes, here’s what you need to know about the modern data stack:
Bob Muglia has the best definition. Others simply miss essential parts. The MDS isn’t just about open source or dbt; it is about SaaS, Cloud, Snowflake, and more. It is the wrapper around the progress in analytics over the last years.
You should try to go for a 100% SaaS MDS. But try not to build up too many dependencies (yes, that’s possible; you don’t have to embrace every single feature!)
If not 100% SaaS, look to the cloud vendors. They have semi-SaaS versions of Airflow, Snowflake, or any other piece you need.
Teach people SQL and bet on it. No, there’s no replacement coming.
Modularity can be dangerous. Try to consolidate your MDS; modularity leads to scattered stacks that bring joy to the data team but not to the company that employs the data team.
Don’t forget to close the loop. Tracking systems like Google Analytics are gold! As are CRM systems that integrate well with your data stack, use that.
In the 7th century, an Indian mathematician called Brahmagupta decided to formalize a concept that secretly pulled the strings of number theory for centuries: the number zero, previously used only as a placeholder. Suddenly, a world opened up, calculations became more accessible, and new concepts popped up around this new number, such that today, zero may be the most powerful of natural numbers (there’s an argument that zero is not a natural number at all, but I like to stick with this perspective, my Analysis 101 professor made a good case).
When I read Bob Muglia's book “The Datapreneurs,” it reminded me a lot of Brahmagupta and the number zero. I think we all feel it and know that something has changed in analytics. Bob Muglia makes the case that the MDS is just that, the center of the incredible progress in data analytics over the past decade.
Maybe the MDS concept is just a nice wrapper around the amorphous mass of progress in data analytics, or maybe there truly is a hidden “zero” here. I don’t think it matters much. What does matter is a straightforward question:
How can each of us profit from the progress in analytics? How can you make the most of the advances the modern data stack brings?
Unfortunately, while the question is simple, the answer is not. Because the MDS isn’t a single thing, it comes in pieces, thoughts, and best practices.
You can create a modern data stack inside Databricks, all with data ingestion, transformation, data science notebooks, and BI - in one big piece.
You can add Fivetran for data ingestion, data cloud for transformation, and keep Databricks for a managed SaaS MDS.
You can use Snowflake + dbt (self-hosted) + Meltano and top it off with Tableau. Ended up with a mix of DIY, SaaS, and lots of (usually well-working) integrations.
And you can do so much more… Question, should you?
Let’s step back for a second to define the modern data stack and then dive into the pros and cons of each nitty-gritty detail.
Defining the modern data stack
Bob Muglia defines the modern data stack as an industry-wide solution to data analytics that shares five characteristics (well, Bob gives us the first three, but I added the last two based on remarks in his book):
It is delivered via software services
Leverages the public cloud for scale and low-cost
Models data for use by a SQL data warehouse (SQL as an interface)
It is interoperable, providing modularity
It provides a closed loop, from collection of data to the use of it
What most people overlook
If you look into other definitions of the modern data stack, they will discard some of these characteristics. But, in my experience, these five characteristics are essential to a practical and helpful definition:
Most people like to exclude data collection technologies like Google Analytics from the MDS, but you simply can’t. Those tools provide a massive step forward and close the loop.
Similarly, you cannot ignore tools on the other side of the loop like HubSpot or Salesforce that emerged in the past decade and went out of their way to offer integrations such that, finally, other tools can push data into the places that need them.
And even though I’m a big fan of open source technologies, it is a matter of fact that open source technologies need expert knowledge. Per definition, 90% of companies don’t have that, so for 90% of companies to benefit from the MDS shifts, they have to happen as SaaS products, period.
"You may choose to ignore reality, but you cannot ignore the consequences of ignoring reality." - Winston Churchill.
Now, let us get actionable. Let’s examine each characteristic in detail and determine what YOU should do…
Ch.1 Delivered via Software Services (SaaS)
When Salesforce launched in 1999, the incumbent and by far the market leader in the US was Seibel Systems, now part of Oracle. The cost to set up the CRM system designed by Seibel for the average company was around $500,000 to $1,5 million.
When Salesforce launched, the cost was $50/user/month; for the average company, that was 95% less! It was a simple shift in pricing, but it unlocked CRM systems for 95% of the market that were previously using Excel sheets and paper.
That unlocking was only possible because of the SaaS nature of Salesforce, a model Salesforce pioneered for all of business & enterprise software.
It took a decade longer to arrive in the data world, but now we’re here today with a plethora of Software-as-a-Service solutions for everything inside the data space all around us. Today, you can find SaaS solutions for every part of the data analytics chain.
However, everything has benefits and downsides, so let’s explore them in detail.
Benefits of the SaaS model
While everyone loves to complain about his Snowflake costs, I don’t think people appreciate the vast cost efficiencies they enjoy. Companies only complain about Snowflake costs because they are so amazingly low compared to the alternatives they could never afford!
In 2005, of course, a mid-sized company with a sales staff of 100 people would complain about paying $60k a year for Salesforce, but only because they would’ve never been able to pay for the Seibel system in the first place and never knew what the alternative would cost.
The two significant benefits of SaaS are low cost and flexibility in usage. You pay only for what you use, and you can switch your provider quickly, given that you don’t have to pay $500,000 for customizations to your system, which would be very common when buying servers on your own.
Downsides of the SaaS model
People do complain for a reason about Snowflake prices. Lower price points make solutions affordable to more people. So, with SaaS services, you need to watch the cost.
Second, while these solutions are flexible, you have to ensure you stay flexible. It means you should take regular backups and watch out for yourself whether you’re integrating too deep into one solution. Because once you take the flexibility away, you’re in a pretty bad spot.
Actionable insights for the MDS
Do try to go for SaaS solutions; they are cheap and flexible.
Watch your cost and your level of integration, keep backups, and try to decouple yourself where possible. Keep an eye on alternatives and be prepared to switch a service.
Ch. 2 Public Cloud
Data is still mass-migrating to the cloud, leaving fat old servers hidden in the company’s proprietary data centers.
While the public cloud spurred the SaaS movement and made companies like Salesforce or DbtLabs possible in the first place, it also drives analytics transformations inside companies.
While you can have a modern data stack entirely in a managed space, most companies opt to have parts of their data stack inside their own rented infrastructure inside one of the big public clouds, AWS, Google, or Azure.
The public cloud essentially shares the benefits and downsides of SaaS solutions. They provide low cost and standardization while you may face the downsides of lock-in and producing waste because of these low costs (compared to the alternative you cannot afford).
An added benefit of public clouds is their support of data analytics and their increasingly significant part in the modern data stack. While you can run a modern data stack orchestrator like Apache Airflow yourself or opt for a managed solution with Astronomer, you can now opt to use the AWS Apache Airflow managed option. AWS offers notebook solutions close to Databricks and ETL tools to tackle most modern data stack needs.
Actionable insights
If you don’t want to consider SaaS for parts of your data stack, the next consideration should be a managed public cloud version.
If that still doesn’t tackle your fancy, use public clouds to build that part of your data stack.
Watch your cost and your level of integration, keep backups, try to decouple yourself where possible - and bear in mind that this is pretty hard with public clouds.
Ch. 3 SQL as an interface
SQL has become the analytics language because it has become the integration layer, wiring everything together - including the people using it. The analytics engineering movement has emerged as basically an SQL-ninja movement. Business analysts learn SQL to manipulate the data in any way they want; finally, data engineers focus on SQL-based transformations, and almost any data store now is SQL-based.
And while many movements try to argue that SQL is a terrible language for doing analytics, a simple fact is that it’s the winner for now. It’s the interface wiring all of these individual parts together with two simple, actionable implications for you:
You should bet on SQL; even if people tell you “it’s not powerful enough,” you should avoid non-SQL-supportive technologies.
You should teach many people SQL; people are the most critical integration piece.
Ch. 4 Modularity
Building on this familiar interface of SQL is characteristic of the modern data stack I find most important and most overlooked: It is modular! You can exchange pieces on the fly, batch two pieces together if you want, and make your whole data stack one big fat Databricks stack if you want. The technology allows you to exchange almost anything.
While this sounds great in theory, it has two big fat practical problems.
Scattered data stacks
You might lose the modularity because no tool is built to make itself easy to exchange
While data engineers and most data teams hate to hear the truth, looking from the company level, choosing one tool is usually better than having two. Integration and consolidation suit the company; they build knowledge and expertise and reduce costs.
And yet, most data teams create scattered data stacks consisting of tons of individual pieces that need multiple experts to run, even if they are managed as SaaS tools. I think it’s the downside of modularity; many people believe having two individual tools doing a 10% better job is a good thing. But in reality, the integration and complexity costs are rarely worth it.
On the other hand, if a tool comes along that provides a 50% improvement in effectiveness, and with the process of technology today that happens around every six months, you do want to be able to switch. But that means, no matter what tool you use, you want to try to stay flexible, stay to the common aspects of the tool, and keep it SQL.
Actionable insights
Don’t fall for the scattering fallacy; try to consolidate your tools and keep it simple.
Try to stay flexible; don’t buy into super-specific functions of tools unless the benefit is big (50+% big).
Ch. 5 Closed loops
Bob Muglia also explores the idea of closed loops in “The Datapreneurs” as one key feature of the MDS. Data stacks today can close the loop from data generation to collection to transformation, storage through analysis into collecting data from new decisions based on this data.
They close the loop by providing great data collection technologies and sending data into many different places, like CRM systems or marketing automation systems that generate data that goes into the analytics pipelines.
Data today flows in a circle, not in a pipeline. Even though fundamental to the modern data stack, the idea of the pipeline is outdated.
The question regarding the closed-loop characteristic of the MDS is thus whether you should use it. Just because you can send data to lots of places doesn’t mean you should; just because you can collect a lot of data doesn’t mean it’s a wise investment. There is such a thing as having too much data; there is a concept of waste and complexity in business.
A second point to consider is that data always flows in a closed loop; it is always collected and used. The critical property of the MDS is to enable this flow to happen within one linked technology chain. So, the second question becomes whether you want to pull parts of this loop into the MDS or enable them with a second technology.
Unfortunately, there aren’t any accessible, actionable insights for the closed loop. Yes, closed loops are good, but they highly depend on your company. The only good advice is…
Not to listen to any advice regarding reverse ETL, data collection, and closed loops.
You’ll need to find a custom plan for your company, which is highly dependent on your current tech stack in the rest of the company and the decision-making cycles you have in place.
Note on sources
Bob Muglia defines the modern data stack with only the first three characteristics but talks heavily about modularity and its closed-loop feature in his book “The Datapreneurs."
As for the story of Salesforce, I recommend the book “Behind the Cloud” by Marc Benioff on the rise of Salesforce.
Fun
Here are some special goodies for my readers
👉 The Data-Heavy Product Idea Checklist - Got a new product idea for a data-heavy product? Then, use this 26-point checklist to see whether it’s good!
The Practical Alternative Data Brief - The exec. Summary on (our perspective on) alternative data - filled with exercises to play around with.
The Alt Data Inspiration List - Dozens of ideas on alternative data sources to inspire you!