2026-02-02•10 min read

Vector Search Beyond Cosine Similarity: Graph-Based Approaches

Name: shodh-memory
Author: Shodh

vector-searchalgorithmsperformance

vector-search-beyond-cosine.md

Vector Search Beyond Cosine Similarity: Graph-Based Approaches

Everyone's first vector search implementation looks the same: embed your data, store it in a flat array, compute cosine similarity against every vector, return the top K.

This works beautifully until you have 50,000 vectors. Then it works slowly. At 500,000 vectors, it doesn't work at all.

The next chapter of vector search is graph-based, and it changes everything.

The Brute-Force Wall

Cosine similarity over N vectors is O(N) per query. There's no shortcut — you must compare against every single vector. At 100K vectors with 384 dimensions, that's 38.4 million floating-point operations per query.

```

Vectors Brute-Force Latency Vamana Latency

1,000 0.5ms 0.3ms

10,000 5ms 0.8ms

100,000 52ms 3.2ms

1,000,000 520ms 8.1ms

```

Brute-force scales linearly. Graph-based search scales logarithmically. At scale, this isn't a performance improvement — it's the difference between feasible and impossible.

How Graph-Based Search Works

The insight behind Vamana (the algorithm behind Microsoft's DiskANN) is brilliant: build a navigable graph where each node connects to its approximate nearest neighbors.

```

1. Start at a random entry point

2. Look at current node's neighbors

3. Move to the neighbor closest to query

4. Repeat until no neighbor is closer than current node

5. The path you traversed contains your nearest neighbors

```

This is greedy graph traversal. It works because the graph has a small-world property: any two nodes are connected through a short path. Like six degrees of separation, but for vectors.

The critical design parameter is R — the maximum number of edges per node. Higher R means more connections, better recall, more memory. Lower R means faster search, less memory, lower recall. shodh-memory uses R=64 for the Vamana index, balancing 98%+ recall with sub-5ms latency.

Scaling Beyond 100K: SPANN

Vamana keeps the entire graph in memory. At 100K vectors with 384 dimensions, that's about 150MB. At 1M vectors, it's 1.5GB. On a Raspberry Pi with 1GB RAM, that's a non-starter.

SPANN (Space Partition tree AND Navigable small-world graph) solves this by moving posting lists to disk while keeping centroids in memory. Think of it as IVF-PQ meets graph search:

```

Memory: Centroids (small) → Route query to cluster

Disk: Posting lists → Search within cluster

```

shodh-memory auto-switches from Vamana to SPANN when the vector count crosses 100K. The switch is invisible to the caller — same API, same results, different engine.

Product Quantization

SPANN uses Product Quantization (PQ) to compress vectors. A 384-dim float32 vector (1536 bytes) compresses to 48 bytes — a 32x reduction. The quality loss is surprisingly small: typically less than 2% recall degradation.

PQ works by splitting the vector into subspaces and quantizing each subspace independently. With 48 sub-quantizers and 256 centroids each, you get fine-grained approximation at minimal storage cost.

The Cognitive Connection

Interestingly, graph-based vector search mirrors how human memory works. You don't brute-force search through every memory you've ever had. You follow associations — one concept activates a related concept, which activates another. This spreading activation pattern is exactly what a navigable small-world graph does.

In shodh-memory, the vector search graph and the knowledge graph work in parallel. Vector search finds semantically similar memories. The knowledge graph finds associatively linked memories. The fusion of both produces retrievals that feel surprisingly human.

What Matters in Practice

Forget the benchmarks papers publish on ANN-Benchmarks. What matters for a memory system is:

• **Recall at low latency**: Can you find 95%+ of true neighbors in under 5ms?

• **Incremental insertion**: Can you add new memories without rebuilding the index?

• **Memory footprint**: Can you run on 512MB RAM?

• **Graceful degradation**: What happens when you exceed capacity?

Graph-based search answers all four better than flat search, IVF, or LSH. That's why the field has converged on it.