Vector Databases — Embeddings & ANN Search

Vector databases went from niche to ubiquitous on the back of the LLM boom, because they answer a question keyword search can't: "find me things that mean the same, not just things that share the same words." They store high-dimensional embeddings and find the nearest ones to a query vector fast, which is the retrieval half of RAG (retrieval-augmented generation) — the pattern behind most production LLM apps, including the context assembly we discussed in harness engineering and building with LLM APIs.

⚡ Quick Takeaways

Embeddings turn meaning into geometry — a model maps text/images to vectors so that semantically similar items land close together.
Search = nearest neighbors — embed the query, then find the closest stored vectors by cosine/dot-product similarity.
Exact kNN is too slow at scale (O(n·d)), so vector DBs use approximate nearest neighbor (ANN) — trading a little recall for huge speed.
HNSW and IVF are the dominant ANN indexes; quantization (PQ) compresses vectors to save memory.
A vector DB adds metadata filtering, persistence, CRUD, and scaling on top of a raw ANN library (Pinecone, Weaviate, Milvus, pgvector, Qdrant).
RAG is the killer use case — retrieve relevant chunks by similarity and stuff them into the LLM's context window.
Hybrid search (semantic + lexical) usually beats either alone.

tldr

An embedding model converts text/images into vectors where distance ≈ semantic similarity. A vector database stores these and answers "what's nearest to this query vector?" Because exact nearest-neighbor search is O(n·d), it uses approximate algorithms (HNSW, IVF) that get ~99% recall at a fraction of the cost. On top of the index it adds filtering, persistence, and scaling. The dominant application is RAG: embed a query, retrieve the most similar chunks, and feed them to an LLM as context.

Embeddings: Meaning as Geometry

An embedding is a fixed-length vector (say 768 or 1536 numbers) produced by a model from a piece of content. The model is trained so that semantically similar inputs map to nearby vectors: "dog" and "puppy" land close together; "dog" and "tax form" land far apart. This works across modalities — text, images, audio — and it's what lets you search by meaning rather than exact words. The vector itself is opaque (no single dimension "means" anything human-readable), but distances between vectors are meaningful, and that's all search needs.

Similarity Metrics

"Nearest" requires a distance measure. The common ones:

Metric	Measures	Notes
Cosine similarity	Angle between vectors	Ignores magnitude; the default for text embeddings
Dot product	Angle & magnitude	Fast; equals cosine if vectors are normalized
Euclidean (L2)	Straight-line distance	Sensitive to magnitude; used in some models

For most text use cases the vectors are normalized and cosine/dot-product are effectively equivalent. The key point for interviews: similarity search ranks by one of these metrics, returning the k closest vectors.

The Core Problem: Approximate Nearest Neighbor

The naive way to find the nearest vectors is to compute the distance from the query to every stored vector and keep the top k — O(n·d) where n is the number of vectors and d the dimensions. At millions or billions of vectors, with d in the hundreds, that's far too slow for interactive search. So vector databases use approximate nearest neighbor (ANN) algorithms that find almost the true nearest neighbors in sublinear time. The trade is recall (what fraction of the true top-k you actually return) for speed — and well-tuned ANN reaches ~95–99% recall while being orders of magnitude faster.

the central trade-off

Vector search is fundamentally recall vs latency vs memory. Exact search is 100% recall but slow; ANN sacrifices a sliver of recall for massive speedups, and tuning the index (and quantization) trades memory against both. "How approximate can you afford to be?" is the defining design question.

ANN Algorithms: HNSW and IVF

Two families dominate:

HNSW (Hierarchical Navigable Small World) — builds a multi-layer graph where each vector links to its near neighbors. Search starts at a sparse top layer and "greedily" hops toward the query, descending layers to refine. It gives excellent recall and speed, and is the default in most vector DBs; the cost is higher memory and build time.
IVF (Inverted File Index) — clusters vectors (e.g. via k-means) into buckets; a query only searches the few clusters nearest its vector instead of all of them. Cheaper memory, slightly lower recall, tunable by how many clusters you probe.

HNSW: greedy hops through a layered graph

layer 2 (sparse):   ●───────────●          start here, hop toward query
                    │           │
layer 1:        ●──●──●─────●───●──●        descend, refine
                │  │  │     │   │  │
layer 0 (all):  ●●●●●●●●●●●●●●●●●●●●●        finest neighbors → top-k

  each step: move to the neighbor closest to the query vector
  → finds ~nearest neighbors in O(log n)-ish hops, not O(n)

To cut memory, vectors are often quantized — e.g. product quantization (PQ) compresses each vector into a compact code, trading a bit more recall loss for a large memory reduction. Real systems combine these (IVF + PQ, or HNSW + PQ).

What a Vector Database Adds

You can run an ANN library (FAISS, hnswlib) in-process, so why a database? Because production needs more than the index:

Metadata filtering — "nearest vectors where tenant = X and date > Y." Combining filters with ANN correctly is non-trivial and a core vector-DB feature.
Persistence & CRUD — durably store, update, and delete vectors (re-indexing on change), not just a static in-memory index.
Scaling & availability — sharding across nodes, replication, and horizontal scale.
Operations — backups, monitoring, access control.

Options span dedicated systems (Pinecone, Weaviate, Milvus, Qdrant) and extensions to existing databases (pgvector for PostgreSQL, vector search in Elasticsearch/Redis) — the latter attractive when you don't want a new datastore.

RAG: the Killer Use Case

Retrieval-augmented generation is why vector DBs exploded. An LLM has a fixed context window and no knowledge of your private data, so RAG retrieves the relevant pieces and puts them in front of the model:

RAG pipeline

INDEX (offline):
   docs → chunk → embed each chunk → store vectors + metadata

QUERY (online):
   user question → embed → ANN search → top-k relevant chunks
                → stuff chunks into the LLM prompt as context
                → LLM answers grounded in the retrieved text

This is exactly the "context assembly" job from harness engineering: the vector DB is how the harness finds the right chunks to put in a finite context window. Chunking strategy and retrieval quality matter as much as the model.

Hybrid Search and vs Elasticsearch

Pure semantic search can miss exact matches (a specific product code, a rare name) that lexical search nails, and vice versa. Hybrid search combines vector similarity with keyword (BM25) scoring and fuses the rankings, usually beating either alone. This is where vector DBs and Elasticsearch converge: classic Elasticsearch is lexical, vector DBs are semantic, and both are adding the other so that hybrid is becoming the norm.

Aspect	Vector DB (semantic)	Elasticsearch (lexical)
Matches on	Meaning (embeddings)	Terms (inverted index)
Great at	Paraphrase, concepts, cross-lingual	Exact terms, codes, names
Weak at	Exact/rare tokens	Synonyms, intent
Best together	Hybrid search fuses semantic + lexical rankings

Pitfalls

Recall tuning — ANN parameters trade recall for speed/memory; the defaults aren't always right for your accuracy needs.
Embedding consistency — query and documents must use the same embedding model; changing models means re-embedding everything.
Cost & dimensions — high-dimensional vectors at scale are memory-hungry; quantization and dimension choice matter.
Freshness & deletes — graph indexes like HNSW handle deletes/updates awkwardly; heavy churn may need periodic rebuilds.

takeaway

A vector database is "search by meaning": embed content into vectors, then find the nearest ones with an approximate-nearest-neighbor index (HNSW/IVF) that trades a sliver of recall for huge speed. On top of the index it adds filtering, persistence, and scale. Its defining use is RAG — retrieving the right context for an LLM — and in practice hybrid (semantic + lexical) search wins.

🎯 interview hot-takes

What's an embedding? A vector from a model where distance ≈ semantic similarity, so similar meanings are nearby points.
Why approximate nearest neighbor? Exact kNN is O(n·d) — too slow at millions of vectors; ANN (HNSW/IVF) gets ~99% recall far faster by trading a little accuracy.
HNSW vs IVF? HNSW = navigable small-world graph, high recall/speed, more memory; IVF = cluster into buckets, probe a few, cheaper but slightly lower recall.
What does a vector DB add over a library? Metadata filtering, persistence/CRUD, scaling, and ops — not just the raw ANN index.
What's RAG? Embed a query, retrieve the most similar chunks, and feed them into the LLM's context so answers are grounded in your data.