Roast Your Own Data - How To Ensure Your Data Is Ready For The Grind
This newsletter is brought to you by “The 7 Data Product Strategies” - the ebook for product managers and entrepreneurs who want to build amazing data-heavy products.
Marco Arment, ex-CTO of Tumblr and creator of Instapaper and Overcast knows his coffee. I’d say, in his personal blog, he’s almost obsessing over it. Seth Godin once explained that he learned the single most crucial lesson on coffee making from Marco, and that is to roast your own beans.
The reason is decay. Freshly roasted coffee beans are only good a few days, and only if they are stored proper that means airtight. This means that most ground coffee beans and all the prepackaged and stored packages you buy will be no good to start with! I find that astonishing. The only reasonable alternative to roasting yourself is to buy from a local store where you know when they roasted their stuff to get the smallest amount you know you’ll use within a couple of days.
Which brings me to data science and machine learning. I find it similarly astonishing that most people inside the industry think that what they call “raw data” is an excellent input to the data science project they are building.
Most data people think their time is best invested in building more transformations, monitoring their pipelines, and training better ML models. But it likely is not.
Alan’s beans
Alan is a machine learning engineer who works on complex projects. He’s developing a machine-learning solution to match duplicate entries. I know that sounds super boring, but people build businesses on it.
Now Alan has been working on this problem for over a year. Knowing when two things are a duplicate of each other isn’t easy, you need a “ground truth” a set of items that are labeled as such. Luckily, he has that.
Alan works at a company that has a bike-tracking app that allows people to input their bikes. The company allows customer service to mark certain sets of bikes as duplicates, so Alan already has a labeled ground truth.
For the past 14 months, he has used that data and improved his algorithm iteratively by minor percentage points, tweaking hyperparameters and combining different algorithms.
After learning about roasted beans, however, Alan thought: Maybe I could roast my own beans? Maybe I could produce more quality data?
He took some time away from algorithm tweaking and developed something: a tool that presents people with samples and lets them decide whether they are duplicates.
It turns out that this approach proved to take a month to label data and achieve better results than the past 14 months of algorithm tweaking on the old data set.
In my mind, Alan just became a barista. He started to consider the data as green beans.
He didn’t take what was given but instead asked the question: “What does my perfect dataset look like?” And then, he invested the effort to roast his beans! It’s a small shift of perspective, but it can make your progress shrink from 14 months into 1.
The lesson: Stop thinking of your raw data as a fixed input. You’re in charge of thinking up your perfect input, and you’re in charge of helping to create your perfect input. Start to ask yourself, “What does my perfect dataset look like?”
Bree’s beans
Bree leads a data engineering team. Her team built a platform that ingests data from all over the place. One place is the production database of the company’s bike tracking app. It has all the user behavior inside of it.
Bree is also tasked with building dashboards on top of this. She constantly fights quality issues. Business Analysts come to her and complain about missing data. More often than not, once she hunts it down, it is due to a weird behavior in the customer data. She keeps on thinking “the app team sure needs to fix this at some point.” and applies a little work around on her side.
But not today; today, Bree had a very discouraging conversation with the product manager of the tracking app. He told her, in essence, they are not going to change; they need their data to behave the way they do, and she will have to live with it.
But, over a coffee, Bree talks to Alan and learns of his new barista approach to data. She wonders whether this could be applied to her problem, too.
She noticed one key fact: The data she is using is a side product for the software engineering team, it is not made for her. Noone is roasting her beans.
So, she decides to change how she views data. She decides to view data not as her input. From today on, every data pipeline will start with a step called “create the input data”. They contain input schema tests, cast time stamps into the proper format, and, if necessary, cut things short.
The lesson: Stop thinking incoming data is made for you. Instead, start to think you need to create the incoming data yourself. You will always need to mold it to your purposes. Never try to rely on others to do your job. Your data is your key asset, so you better have full control over the quality of it, and that means you’re the one who raises it to the quality you need it.
Charlie’s beans
Charlie is in charge of the recommendation engine of our bike-tracking app. Based on where you bike, it recommends nice routes in your neighborhood. But you know, things change. Sometimes, multiple people drive to a new supermarket, and Charlie's recommendation engine suddenly starts to recommend a trip to the supermarket as the new weekend trip.
So, Charlie comes to live with it. He puts monitoring in place to alarm him of such deviations, and when that happens, he gets the on-call person to do a quick rollback of the recommendation engine.
Charlie is now assuming that his recommendation engine has to break from time to time.
Until he meets Bree, Bree tells him of her new barista approach to data. So Charlie wonders whether this could apply to him too. He realizes, no one is checking his coffee beans, no one is controlling the quality, so he has to!
What if he stops to think the incoming data is too big to check each row piece by piece? Maybe there is a way.
And sure enough, after talking to the rest of the team, they come to the realization, if they update the recommendation engine only in their working hours, but then every 30 minutes instead of just once a day as before, they can indeed run a larger set of tests over the new data, because it’s much smaller.
In addition, because they afterward monitor the behavior, they are able to catch errors fast, with little impact, and can fix them without anyone on call.
The lesson: Stop thinking your incoming data is too big to be checked row by row. It probably is, but that doesn’t mean you can’t ensure the quality of your inputs row by row! You can test every row automatically (just like a bean roaster doesn’t actually check …). Use automation and test the quality of each row based on the inputs you need.
We’re all Alan, Bree and Charlie
We’ve all been there, and we all will be there again. We all keep on taking the data we get as input. We consider it a given, but it is not. It never is. You can roast your own beans, you can check every bean for quality, and you can put yourself into the creator's shoes.
When we change our way of thinking, the magic happens.
Let’s keep the magic happen.
Or at least enjoy a good cup of coffee.
Actionable Questions To Change Today
Think of your current data project, be that a fancy app, a data warehousing pipeline or a machine learning solution. It surely is based on data, on incoming data.
Where is that data coming from? Make a list of the key sources, your key inputs to your process.
If you were to dream up your dream input dataset, what would it look like?
What does 100% quality look like? What can YOU do to make that happen?
If you could check your data row by row, what would you check?