2025-12-15•8 min read

Benchmarking AI Memory Systems: Latency, Accuracy, and Scale

Name: shodh-memory
Author: Shodh

benchmarksperformancecomparison

benchmarking-memory-systems.md

Benchmarking AI Memory Systems

How do you measure whether an AI memory system is good? We share our methodology.

Key Metrics

1. Retrieval Latency

How fast can you get relevant context?

| Operation | Target | shodh-memory |

|-----------|--------|--------------|

| Graph lookup | <10μs | 0.8μs |

| Vector search (10 results) | <100ms | 42ms |

| Proactive context | <50ms | 31ms |

| Remember (store) | <100ms | 58ms |

2. Retrieval Quality

Do you get the RIGHT memories?

We use two metrics:

• **Precision@K**: Of top K results, how many are relevant?

• **MRR (Mean Reciprocal Rank)**: Where does the first relevant result appear?

On our coding-assistant benchmark:

| System | Precision@5 | MRR |

|--------|-------------|-----|

| Vector-only baseline | 0.67 | 0.71 |

| shodh-memory | 0.86 | 0.89 |

3. Memory Scaling

Does quality degrade with more memories?

| Memory count | Latency (p50) | Quality (P@5) |

|--------------|---------------|---------------|

| 1,000 | 28ms | 0.89 |

| 10,000 | 34ms | 0.87 |

| 100,000 | 52ms | 0.84 |

| 1,000,000 | 89ms | 0.81 |

Sub-linear scaling. Quality degrades gracefully.

Benchmark Methodology

Dataset

We created a coding-assistant memory dataset:

• 10,000 memories from real coding sessions

• 500 query-relevance pairs (human labeled)

• Mix of code, decisions, preferences, errors

Test Protocol

```python

for query, relevant_memories in test_set:

start = time.perf_counter()

results = memory.recall(query, limit=10)

latency = time.perf_counter() - start

retrieved_ids = [r.id for r in results]

precision = len(set(retrieved_ids) & set(relevant_memories)) / len(retrieved_ids)

mrr = compute_mrr(retrieved_ids, relevant_memories)

record(latency, precision, mrr)

```

Hardware

All benchmarks run on:

• Intel i7-12700K

• 32GB RAM

• NVMe SSD

And also on Raspberry Pi 4 (4GB) for edge comparison.

Comparison to Alternatives

We compared against:

• Pinecone (cloud vector DB)

• Chroma (local vector DB)

• Plain PostgreSQL with pgvector

|--------|---------|---------|----------|------|

| Pinecone | 45ms | 0.69 | No | $$$ |

| Chroma | 38ms | 0.67 | Yes | Free |

| pgvector | 52ms | 0.65 | Yes | Free |

| shodh-memory | 31ms | 0.86 | Yes | Free |

The quality difference comes from our hybrid architecture (vectors + graph + temporal).

Reproducibility

Our benchmarks are open source:

```bash

git clone https://github.com/varun29ankuS/shodh-memory

cd shodh-memory

cargo bench --bench memory_benchmarks

```