Scaling Vector Search : Embedding based retrieval

Jul 10, 2025

Your new RAG-powered app is amazing. But it's slow. Finding the right context from millions of documents is crushing your database. "Just do a vector search" is the new "add an index". This simple advice hides a terrifying scalability issue.

The magic keyword is Approximate Nearest Neighbor (ANN) The entire field of vector search at scale is built on brutal trade-off: sacrifice perfection for speed. Your job isn't to find the *single best* match. Your job is to design a system that finds a *99% good enough* match in milliseconds, not minutes.

You're building an e-commerce site with 100M products. A junior engineer implements an exact, brute-force search. For every user query, the system compares the query vector to all 100M product vectors. The system collapses.

A single search takes 30 seconds. Users see endless spinners and leave. You didn't build a search feature; you built a very efficient Denial-of-Service machine that attacks your own database.

A senior engineer knows exact search is impossible. They implement an ANN index like HNSW (Hierarchical Navigable Small Worlds) or IVFQ (Inverted File with Product Quantization).

Instead of a linear scan, the index creates smart partitions. The search intelligently navigates a graph or only checks a few relevant partitions to find the nearest neighbors.

The search is now faster, taking milliseconds. The trade-off? The results are ~99% accurate. It might return the 6th most similar item instead of the 5th. But the user gets a full page of relevant results *instantly*. The system is fast, scalable, and the user is happy.

So, the real design question isn't "How do I find the most similar vectors?" It's this: "For this user's query, what is the business impact of a 99% accurate result delivered in 50ms, versus a 100% perfect result delivered in 30 seconds?"

The answer determines your entire retrieval architecture.

Ashutosh’s Newsletter

Discussion about this post

Ready for more?