Unstructured Data Unravelled
This is the Three Data Point Thursday, making your business smarter with data & AI.
Want to share anything with me? Hit me up on Twitter @sbalnojan or Linkedin.
Let’s dive in!
"What one man can invent, another can discover." (Sherlock Holmes, in The Adventure of the Blue Carbuncle)
“Unstructured data” is a misleading term for describing the video, still images, audio, [...]. While people easily comprehend these data types, computers struggle with them. [...] It makes much more sense to call these data types “complex”” (Bob Muglia in The Datapreneurs)
A lot of data taxonomies and terms are stuck in the past. The term “unstructured” data kind of is, at least in most people’s heads. Let’s put a new perspective on it.
Unstructured rooms
One of the most significant pain points of robot vacuums is that you have to clean up before they can do their thing.
You remove obstacles and organize your rooms before the vacuum can do its thing.
With unstructured data, it’s a lot like robot vacuums. It’s not about the data; it’s what you want to do with it. The robot shouldn’t just stand around, but to make it work, you need to put some structure on the room you want cleaned.
No data is truly unstructured; it’s just a term we use to say when we would like it in a different structure than it is in TO DO valuable something with it. The vacuum robot is terrific in an empty room, but that’s not what you want to clean.
The one thing we want to do with data is…. To throw it into a tool with a standardized input format, a notebook, a pandas library, a database, an ETL tool, whatever you have going on. And that requires order - structure.
A good definition of unstructured/structured data
Ok, here’s a helpful definition of vacuum robots,....eh. Structured data:
Structured data: For the sake of business, structured data is already in the format you want it to be in because the tool you have can process it.
Unstructured data: This differs from the format you want. That’s it.
Example: If you have any access log files from your servers, you usually consider them unstructured because you don’t have any tools.
If you plan on doing access log analysis to fight off DDoS attacks, you’ll want to put a structure on them, a tabular structure.
But, from the perspective of the tool writing these logs, it has a pronounced and clear standard structure for writing them (how could it not? All data is generated into a structure because of its digital nature).
What should you care about?
The key question you should always ask yourself is never, “Is this data unstructured?” or “Do I need to make more use of unstructured data?” or “Do we need to create more structured data?”
… but rather, “What do I want to use this data for, and what is the cost of transforming it into a suitable format?”
As with the rooms in my house, many more unstructured data pieces are out there than structured ones. Naturally, because use cases differ, most tools are special-purpose tools.
This is my process in a nutshell
For every piece of unstructured data, I ask myself
What would I wish to do with this data? (Example: fight off DDoS attacks)
What tool do I need to make this work? (Example: Feed it into TensorFlow)
What format do I need for that? (Example: A tabular format with columns for IP, request, header, etc.)
What’s the benefit + cost of pulling this off?
Key lesson: Never discard data because you think it’s “unstructured.” All data is, and no data is.
Note: Don’t get confused by statements like “modelled vs. unmodelled data is a better taxonomy” - I don’t think so.
A good taxonomy is actionable and useful. The structured vs. unstructured is, with the process above. And has proven very valuable for me over the years.
Whether you want to turn something into a “model” or a “materialized view” is an implementation detail, not a significant point of concern.
If you want another perspective on data taxonomies, read All about data provenance by Cassie Kozyrkov.
Here are some special goodies for my readers
👉 The Data-Heavy Product Idea Checklist - Got a new product idea for a data-heavy product? Then, use this 26-point checklist to see whether it’s good!
The Practical Alternative Data Brief - The exec. Summary on (our perspective on) alternative data - filled with exercises to play around with.
The Alt Data Inspiration List - Dozens of ideas on alternative data sources to get you inspired!