Thoughtful Friday #11: Realtime data is underrated
Let’s dive in!
Time to Read: 4 minutes
Real-time data is underrated, not overrated.
Real-time data produces higher quality insights.
Real-time data is easier to maintain.
Google for "Real-time data is underrated" and you'll likely end up with an article by Hunter Walk, explaining why it is overrated and the true data that's underrated is the past.
The idea is pretty simple: Most people want "real-time" but usually just need something "fresher than daily". History is pretty long, and the last hour is pretty short compared to that, so it's reasonable to assume, that most information in data is hidden in the past, not in the fresh last hour of data.
So it does sound, like "real-time" or even "fresh data" is more of a trap than a benefit.
However, I've lately been challenging that thought. I think this might be becoming a self-fulfilling prophecy. Because I do see one huge benefit of fresh data, and that is, fresh data also makes the same data higher in quality.
Before diving into it, I call realtime or near real-time the data that is fresh. Data that appears soon after the physical event happens, whatever that means. If that means, there is some pipeline that pulls once every half an hour, or there is an event stream that pushes stuff into the system 15 secs after happening, I don't really make a difference there. If it's fresh to the end-user, then it's real-time for the purpose of this piece.
Let me tell you why I believe fresh data can be so beneficial to the quality of data:
Because real-time data makes real-time observability possible
Because real-time data makes the data increments small and thus easy to fix
Because real-time data makes the data increments small and thus easy to understand
1. Real-time data => Realtime data observability
If the data freshness is set to "daily" for your API, the dashboard, your machine learning system, I often observe a specific kind of learned behavior emerging: If the data seems wrong, people wait. They wait for the "daily load to finish" or they wait another day. If they raise a flag on the first day, the data people behind the system will sometimes wait. Wait for the system to finish the load, or for the next load on the next day to fix the problem automatically.
A lot of waiting for one simple reason: the feedback cycles are really long! If you compare that to writing code and getting feedback from a unit test within seconds, this is a crazy long feedback cycle, one which produces a very reactive and passive behavior.
If you flip that around however and set your data freshness to an hour or less, you will likely get errors reported almost immediately.
That's what real-time data observability is, catching errors of the outputs almost immediately.
Having feedback faster makes possible problems much easier to fix and faster to fix. This increases the quality of your data. But there's a second effect that makes fixing problems easier this way.
2. Real-time data => small batches => easier to fix
If you import 10,000 orders each day on a daily schedule, you have 10,000 possible places to check for an error. If you chop that into small 1 hour bits, you're up to with a lot less pain.
By reducing the "batch size" you're making your search field smaller. But you're also making the blast radius smaller because if a batch fails at 1 pm, you still have the data for almost half a day ready to be explored. And, you'll fix the error faster, so you end up with lots of higher data quality here.
Finally, contrary to what Hunter Walk explains, I believe there is also a lot of value in digesting real-time data.
3. Real-time data => small batches => easier to digest
It's true, that a lot of information is hidden in the past data. But that doesn't mean, that fresh data isn't important. The true question is how we process the data into information.
Yes, the latest 10 tweets are probably not the most important for me, but I also do not process them the same way I would if I were to search for the "most important tweets". I skim them for "outliers", and that's it. But that kind of skimming is only possible for me if I happen to have a "short batch size".
Likewise, a salesperson would probably benefit from being able to check what customers purchased the day he visits them to skim through them and possibly be alarmed of any outliers.
Now you might not count this as "higher data quality", but it sure sounds to me like data that is easier to digest is of higher quality.
4. Twitter Voices
I discussed the topic on Twitter for a bit and got a few very interesting responses I’d like to share. The first thought is a great point, about how freshness/“real-time” and “event-driven” seems to get conflated.
The thing is, real-time to a developer for some reason always means “event-driven” which is much more of a technical idea. Whereas real-time to an end-user simply means “I want the stuff right away (like within seconds)”. Whether that happens via a “micro-batch” process every second, or right when the event happens is most of the time of no concern to the end-user.
The problem with this conflation is, that developers consider “event-driven” a lot more expensive than batch/micro-batches. But the end-user value will most of the time be there with batch/micro-batches, so in reality, real-time/fresh data actually isn’t more expensive to do, it’s easier to do than a “daily batch”.
Finally, Gavin Johnson also echoes what I described as a “self-fulfilling prophecy” above.
So, what out for that! Stop underrating real-time/fresh data and take it seriously.
So, how did you like this Thoughtful Friday?