Discover more from Three Data Point Thursday
The vector database hype explained - the story of Victor, Hector, and Lecter
Let me tell you a story of three guys. Three very directional guys, Victor, Hector, and Lecter.
Victor loves numerical shirts. FWIW, the villain Vector from the movie “Despicable Me” is born “Victor Perkins.”
Subscribe (free!) or someone will steal your data business & (data) users.
Act 1: The shirts
Victor is from Dallas, Texas, and he loves wearing this particular simple one.
Before he got this one, he had another one from a store called the “Document shirt shop,” but it just took up too much space because of all the text. He’s a minimalist; he likes easy-to-store, short numerical shirts.
In our story, there are also Hector and Lecter, both from Texas, and both with their own favorite numerical shirt.
One day, they visit New York together. Two women, Eve and Bianca, walk up to them and say, “Hey, we love your shirts! You’re all from different places in Texas, right? Are they close to each other or not?”
They explain it to Eve, who in turn explains it to Bianca: “So the guy with the  is like in the middle of Texas close to the gulf, while the  and  guys are close to each other on the top.“
Eve: “Weird, why then do your shirts say 1,2,3? 2 is just as close to 1 as it is to 3, isn’t it?”
True, they realize. So they decide to mix up their shirts. They find this excellent new shop in NY called the “embedding shirt shop.” They come out with these new ones. When they meet Eve and Bianca again, they are at first confused…
Eve: “What’s up with these shirts? Now I have no idea what that’s supposed to mean; the ones before at least were simple....”
The three friends smile and draw up a square on the bottom, then they each take place inside the square.
Eve and Bianca are amazed: “Wow, that’s amazing, now we know right away how close you are living to each other.“
So, if we get you right, your new shirts are from the embedding shirt shop, right? And they are:
Easier to store than the ones from the document shop
Easier to compare than the plain numerical ones.
Act 2: Bianca is dating
Bianca goes on a date with Victor. He likes him, but he has to go back to Dallas. So Bianca decides to search for more dudes like him.
How? It turns out New York visitors from Texas all love those new embedding shirts. Bianca gets everyone into the square, sorts them by shirts, and then picks someone from the red circle in the middle.
Eve: “Sounds like these embedding shirts are really useful in finding similar items, recommending things, and ranking things. That sounds like basically everything this fancy “machine learning” stuff is all about. “
Act 3: The closet, storing shirts.
Our unhappy Victor is back home in Dallas, sad that he couldn’t stay with Bianca. But at least he and his three friends got a bunch of different embedding shirts to take home. He now has a ton of them!
Previously, he used to store his shirts like this:
That was great for his old shirts, it was easy for him to find his favorite number , and if he wanted a different one, he knew exactly where to look.
But with his embedding shirts, this doesn’t work somehow… he has no idea how to put them into linear order, do I put a shirt representing Arlington before the Dallas one or behind?
Victor visits Lecter one day and promptly asks about this problem: “Hey Lecter, how do you organize your embedding shirts in a way you can quickly pick out the ones that belong together?”
Lecter:” Oh, I got this fantastic new organizing system. It’s called a vector database mat; let me show you. But don’t tell my girlfriend; she will think this is a mess, where it really is genius!”
Victor: “Wait, so you just have a bunch of big piles on this mat - and you put those shirts representing close places close to each other? That’s awesome! I have to get one of those. So these vector database mats:
Allow you to store these vectors
Allow you to store them in a way so you can quickly find the “close ones.”
Act 3: The hype
Victor: “I love your shirt and your mat. This is a totally unrelated question: Do you know why these vector databases are lately so super hyped?”
Lecter: “Yes, I do! Let’s walk over to my computer, and I’ll show you. You know about ChatGPT and similar things, right? These tools are great at general tasks like answering questions in a text form, based on lots of text.”
“So let’s ask a question about a recent Marvel movie, Wakanda Forever.”
Ok, so you might know the “knowledge of ChatGPT ends in 2021. No big deal, right? Well, that’s a general point. ChatGPT and all of these recent breakthroughs, and a lot that will come, catch on so well because they are general. Good at general things. Not specific ones.
ChatGPT and all of these recent breakthroughs, and a lot that will come, catch on so well because they are general. Good at general things. Not specific ones.
So if you wanted ChatGPT to write something about the manual for your kitchen appliance, it probably wouldn’t be able to do so. But since it is excellent at asking questions in general, you can do the following:
Lecter: “So I just did paste in a snippet from Wikipedia to provide context. And then ChatGPT picks up from there. But what snippet do you use? That’s where you can leverage the vector database. Create lots of vectors from wikipedia movie snippets, use your vector database to find close ones to “wakanda forever”, then feed them into chatGPT as context, and you’re done!”
“And that’s it; since you already know about documents, vectors, embeddings, and vector databases, you just learned why these are so hot and will stay hot!”