You have two database replicas, each with millions of records. How do you check if they are perfectly in sync?
"I'll stream all the data from one replica to the other and compare them, record by record."
You've just chosen the slowest, most expensive, and most network intensive way possible.
Git, Crypto use Merkle Tree instead.
A Merkle Tree (or hash tree) is a way to verify the integrity of a large set of data without having to compare the data itself. You do it by comparing a small tree of hashes.
Your job isn't to compare the data. Your job is to compare a recursively-generated "fingerprint" of the data.
Imagine you have 8 records
◦Level 0: Calculate the hash of each individual record. (H1, H2, ... H8)
◦Level 1: Hash pairs of those hashes. (H12=hash(H1+H2), H34, H56, H78)
◦Level 2: Hash pairs of the next level. (H1234=hash(H12+H34), H5678)
◦Root: Hash the final two to get a single "Root Hash".
Now, to check if the two replicas are in sync, you just compare their Root Hashes.
If the roots match: The data is 100% identical. Done.
If the roots differ: You compare their children (H1234 and H5678). You find H5678 is different.
You then "walk the tree" downwards, comparing hashes at each level until you find the exact record (e.g., Record 7) that is out of sync.
You found the needle in the haystack with just a handful of hash comparisons. This is how Git, Cassandra, and Bitcoin work.
So, the real question isn't: "How do I compare two massive datasets?"
It's: "How can I create a hierarchical fingerprint of my data that allows me to verify its integrity and pinpoint differences with logarithmic efficiency?"
___________________________________________________________________
I'm a curious software engineer who loves exploring how things work. I break down complex topics. Subscribe to YT ::
Subscribe to my daily newsletter where I dissect the world of software engineering :




I was just reading and implementing Merkle Tree after reading from Cursor's blog and then knowing it's being used in both git and bitcoin. I was looking for other sources and papers and now you have an article. I love it.
My implementation: https://github.com/sidkhuntia/dsalgo/tree/main/merkle_tree
It is mostly to check for file changes.