Thoughtful Friday #13: Encrypt Every Analytical Data Store, Completely!
I’m Sven, and this is Thoughtful Friday. The email that helps you understand and shape the one thing that will power the future: data. I’m also writing a book about the data mesh part of that.
Let’s dive in!
Time to Read: 6 min
Encrypting everything means turning your data into gibberish, all of it, at the physical storage level
It should still be accessible as easy as before (almost)
Changing the default here turns implicit to explicit decisions, which in turn brings lots of benefits
This is technologically possible, but not yet implemented at scale
🔮🔮🔮🔮🔮🔮🔮🔮🔮🔮🔮🔮🔮🔮🔮🔮🔮
(Werner Vogel with his “Encrypt Everything” shirt. Always makes me think of Obadiah Stane, the villain from Iron Man 1.)
What if…. Every single data piece in your data stores, your analytical databases, your data lakes, everything, would be encrypted - not readable?
Please pause a minute and try not to say “then everything will be slow, and unusable!”. Let’s spare that discussion for down below. Just think about the impact of having everything encrypted, still accessible, and usable.
What Does “Accessible Encryption” Look Like?
So what does encryption look like? It just means, my database table does not contain
“Jim, 5 orders”
anymore but possibly
“JzdF, 455dertf”.
It could also mean, that the database columns aren’t telling me anything anymore, or that there is NO database at all! But it also means, I have some abstraction on top of it, that helps me see what kind of data “possibly” is here.
It means I can access this data easily and natively using my usual SQL calls. It should mean, for the developer, almost nothing changes.
Key Point: Accessible encryption means, for the developer, analyst, or data engineer as end-user of data, almost nothing changes.
It just means three things:
As a developer, I somehow need my access rights to be set such that I automatically get encrypted data.
The physical storage of data is unreadable to anyone.
Someone or something has control over the access rights to encrypt/decrypt data.
Notice that I am imagining all data, not just personal data to be encrypted because simple is good. And simple is almost antifragile. It will protect us from things we don’t know can go wrong. And it will help us with stuff we didn’t know we need.
Now let’s talk about the possible benefits of encrypting all analytical data.
Benefits Of Encryption of All Analytical Data
If for the developer nothing changes, then the true benefits of encryption don’t really apply to him. Instead, they are where the data is “owned”.
For one, by physically storing gibberish, security increases by a multifold. You gain protection against any kind of data leaks, and any kinds of stuff you cannot foresee.
Second, physical encryption makes your security mesh simpler. If you encrypt everything, there is no need to restrict access inside the company to the data warehouse if you’re worried about people “seeing stuff they shouldn’t see”.
But it does kick the can down the road a bit because you will still need to control the encryption process.
Key Point: The soft benefits of encrypting everything are all driven IMHO by changing the default option, going from implicit decisions to necessary explicit decisions. The hard benefits are 1-2 more security layers, period.
Third, this practice would enable a concept called “crypto shredding” possible as a default practice. If the “e-mail” attribute of the customer 123 is encrypted with a specific key, and you would like to not store that data anymore, you don’t actually need to “erase” anything, you just need to throw away the key. That’s it.
The beauty of crypto shredding? You can still do a “count overall customers” even if they are crypto-shredded. In a sense, you can have a certain layer of immutability over your analytical data.
Fourth, encryption of everything means you’re basically making the question of data access always part of the conversation. It means turning to explicit decisions on data access instead of the “analysts should have access to everything” kind of implicit decision. This might sound like an extra effort, but I believe that data sparsity is actually a valuable thing, especially in analytics.
It means you will always follow up the question of “should we use this data” with “is it really necessary to use this data?”.
Fifth, you’re making things a lot simpler. By making the default choice to encrypt everything, you’re eliminating options. Do we need to encrypt this? Secure that? In fact, several European companies have restructured all of their data systems into “two-part” systems, one with normal data, and a second part with encrypted data, just to comply with the GDPR. This adds a huge amount of complexity to every single work step. If instead we simply encrypt everything, there’s no choice.
Even if you still default to opening up everything for everyone, you will still gain another layer of security.
Making it Reality
Let’s talk about the non-fun parts of this: making this accessible for anyone working with data, and making sure it doesn’t slow down anything.
Key Point: Encrypting everything even in an analytical context is possible without a major loss of speed. But it is also not yet implemented at scale.
There is an interesting company called Evervault which works on providing this level of experience to software developers. And although they do not yet focus on analytical encryption of all data, they do focus on making it as seamless of an experience as it is without encryption.
There also is a great talk by Gidon Gershinsky from the Weizman Institute called “efficient spark analytics on encrypted data” that talks about the speed side of things. He’s talking about an open-source implementation of an encryption algorithm intertwined with Apache spark, but the most important insight is, that encryption is never the bottleneck in his analyses!
Encryption adds only about 10% duration to analytical queries using Spark. And there is a lot of stuff still to happen on the algorithmic side of things to possibly speed that up. Going from 1 second to 1.1 seconds sounds like an easy trade-off if you gain the benefits described above if you ask me.
Also, encryption might actually happen in a way that is congruent with computation. Imagine metatables like e.g. Iceberg keeps which keep metadata about the encrypted data like the top and bottom values. For planning computation, you would only need to encrypt those and would lose almost no time.
Another idea would be to have encryption algorithms that respect orders used for computation. So for instance, if I always sort data by the first names, then one could use an algorithm that turns “John” into “J2034drfr” making the computation way faster.
But the reality is also, that I couldn’t find a single implementation of that in action, even though it would be already doable using dbt, Spark, or probably most other computational frameworks.
FWIW, it’s not clear to me where such encryption should happen, on the compute side of things is an obvious choice, but you could also implement it on the abstraction layer between compute and storage, like an Apache Iceberg or HUDI table format or a data catalog like Hive which basically tells me “where is data X?”.
Summary - now it’s up to you
So, companies are already building up things, and speed is probably not an issue. But no one is talking about it, or using it, even though it sounds like a very cool thing. As far as I know.
So now it’s up to you to decide whether this actually makes sense, whether you want to try it out and do a simple implementation inside some dbt macros, or whether you want to respond to this post and tell me that is nuts and simply won’t work.
If you want to deep dive into Evervault, I suggest you take a look at
.
So, how did you like this Thoughtful Friday?
It is terrible | It’s pretty bad | average newsletter... | good content... | I love it, will forward!