Skip to content

The Magic of Maybe - Understanding Bloom Filters

Hey there! Welcome back to “Tech Bytes with Pratyay”—your weekly shortcut to computer science on the go.

I want to take you back to my college data science class .In the class, the professor spends 15 minutes, on a topic called "Bloom Filters." treated like a minor footnote, not important enough for a full lecture. But it turns out, this "footnote" is one of the most clever and widely used data structures in modern computing, and I wish someone had explained why back then.

So today, we're giving the Bloom Filter the attention it deserves. We're going to uncover how it lets services like Google and Instagram check if a username is taken almost instantly.

Let's set the stage. Imagine you're signing up for a new Instagram account, with billions of previous existing users. You type in your desired username. The system has to check if that name is already in its massive database. How does it do that? - Maybe it searches a list. But searching an unsorted list of billions of users is erratically slow. - what if the list is sorted alphabetically? A binary search would be fast for reads, but think about writes. - Every time a new user signs up, you might have to shift billions of records just to maintain the sorted order. That makes your application incredibly write-heavy - we turn to our trusty computer science tool: trees. A balanced binary search tree, like an AVL or Red-Black tree, would be great for searching - For a perfectly balanced tree with 7 billion users, a lookup would take about log2​(7,000,000,000), which is roughly 33 comparisons - Thirty-three steps is blazing fast! But for a system that gets thousands of sign-ups a minute, the cost of constantly rebalancing that massive tree is still significant - Every write operation involves complex rotations and pointer changes to keep the tree perfectly balanced.

  • the question is, can we do even better? What if we could check for a username not in 33 steps, but in 1? What if we could get an answer in constant time? Seems too good to be true right?

This is where the Bloom Filter comes in. It's a probabilistic data structure, and that word "probabilistic" is the key. It doesn't give you a perfect "yes" or "no" answer. Instead, it gives you one of two answers: - "Definitely not here." - "Possibly here

Let's imagine this. You have a 10x10 row of lightbulbs, all turned off initially.

Now, every time a new username is created, like "Pratyay," we use a digital blender (Hash function) to pick three random lightbulbs and turn them ON. So for "Pratyay," maybe bulbs #15, #82, and #51 are now lit up.

Okay, so what happens when you try to sign up with a new name?

Let's say you try "Pratyay." We use the same secret recipe, and it tells us to check bulbs #15, #82, and #51. We look, and all three are glowing. So we say, "Sorry, that name might be taken."

But now you try the name "Priyanshu." The recipe tells you to check a different set of bulbs, maybe #22, #90, and #15. You go and look. Bulb #15 is on, but #22 and #90 are dark.

And here's the trick: If even one of the bulbs it points to is OFF, you know for sure that name has never been seen before. You can say with total certainty: "It's all yours!"

That’s the core idea. If all the lights are on, it’s a "maybe." If even one light is off, it’s a "definite no."

Notice the trick? A "possibly here" answer could be a false positive—maybe another username just happened to check the same boxes. But a "definitely not here" answer is always correct. There are no false negatives.

So, back to our username checker.

  1. You enter a username.
  2. The system first checks a Bloom filter.
  3. If the filter says "definitely not here," you get the green light instantly. No database lookup needed.
  4. If the filter says "possibly here," then and only then does the system perform the expensive database search to be 100% sure.
    • Mostly they don't even do that, they just suggest you an alternative
  5. This simple trick filters out the vast majority of requests, saving immense computational resources.

You see this everywhere:

  • Instagram uses it for checking for username, Google for emails.
  • Google Chrome uses it to check for malicious URLs.
  • Medium uses it to check which articles you've already read, so it doesn't recommend them to you again.
  • Even Bitcoin uses it to help lightweight clients check for transactions related to their wallet without downloading the entire blockchain.

So, what are the key takeaways? - Speed: Bloom filters are incredibly fast, offering constant time complexity, written as O(1) or more accurately O(k) for k hash functions, regardless of how many items are stored. - Space: They are extremely space-efficient compared to storing the actual data. - The Trade-off: They are probabilistic. They can have false positives but have zero false negatives. This makes them perfect as a first line of defense to avoid slow, expensive operations.

wrapping this up: the Bloom filter is a brilliant example of how trading absolute certainty for incredible speed can solve massive-scale problems. It’s a gatekeeper that says "no" with confidence but says "maybe" with caution.

That’s your byte-sized note from “Tech Bytes with Pratyay.” Today we went over a data structure that was likely skipped in your college class but is secretly powering the web you use every day.

Next week, we'll dive into the world of Git. We all use git commit and git push, but what is actually happening under the hood? How does it store your history so efficiently?

If something clicked for you, don’t forget to follow, like, and share! What's a tech concept you wish was explained better? Tell me your story, and let’s bust more tech myths together.