Benchmarking AI Memory Systems: Latency, Accuracy, and Scale
Benchmarking AI Memory Systems
How do you measure whether an AI memory system is good? We share our methodology.
Key Metrics
1. Retrieval Latency
How fast can you get relevant context?
2. Retrieval Quality
Do you get the RIGHT memories?
We use two metrics:
On our coding-assistant benchmark:
3. Memory Scaling
Does quality degrade with more memories?
Sub-linear scaling. Quality degrades gracefully.
Benchmark Methodology
Dataset
We created a coding-assistant memory dataset:
Test Protocol
for query, relevant_memories in test_set:
start = time.perf_counter()
results = memory.recall(query, limit=10)
latency = time.perf_counter() - start
retrieved_ids = [r.id for r in results]
precision = len(set(retrieved_ids) & set(relevant_memories)) / len(retrieved_ids)
mrr = compute_mrr(retrieved_ids, relevant_memories)
record(latency, precision, mrr)
Hardware
All benchmarks run on:
And also on Raspberry Pi 4 (4GB) for edge comparison.
Comparison to Alternatives
We compared against:
The quality difference comes from our hybrid architecture (vectors + graph + temporal).
Reproducibility
Our benchmarks are open source:
git clone https://github.com/varun29ankuS/shodh-memory
cd shodh-memory
cargo bench --bench memory_benchmarks