Running Embedding Models on Edge Devices: ONNX, Quantization, and Reality
Running Embedding Models on Edge Devices: ONNX, Quantization, and Reality
Getting a neural network to run on a Raspberry Pi sounds like a stunt. But for shodh-memory, it's a requirement — if the embedding model can't run locally, the entire offline-first promise collapses.
Here's what it actually took to get MiniLM-L6-v2 running at 34ms per embedding on a $35 computer.
Why MiniLM-L6-v2
Model selection for edge deployment is a constrained optimization problem:
MiniLM-L6-v2 hits the sweet spot: 82% quality at 34ms latency with only 22MB. The 1.4% quality improvement from L12 costs 2x latency. The jump to mpnet costs 10x latency and 20x model size. BGE-large doesn't fit in memory at all.
For a memory system where semantic similarity needs to be "good enough" for retrieval (the knowledge graph and temporal index correct for imprecision), 82% STS is more than sufficient.
ONNX Runtime: The Deployment Story
Training happens in PyTorch. Deployment happens in ONNX Runtime. The conversion is straightforward:
Export from PyTorch → ONNX
torch.onnx.export(model, dummy_input, "model.onnx",
input_names=["input_ids", "attention_mask", "token_type_ids"],
dynamic_axes={"input_ids": {0: "batch", 1: "seq"}})
ONNX Runtime gives us three things that PyTorch doesn't:
The Tokenization Pipeline
The embedding pipeline is: Text → Tokenize → Model Inference → Mean Pooling → L2 Normalize.
Tokenization is often the overlooked bottleneck. We use the `tokenizers` crate (from Hugging Face) compiled from Rust. It runs the WordPiece tokenizer in ~50 microseconds — negligible compared to model inference.
Pipeline Latency Breakdown (Raspberry Pi 4):
Tokenization: 0.05ms (0.1%)
Model Inference: 33.00ms (97.1%)
Mean Pooling: 0.45ms (1.3%)
L2 Normalize: 0.50ms (1.5%)
Total: 34.00ms
Inference dominates. Everything else is noise.
Batch Processing for Throughput
When storing multiple memories at once (e.g., during initial ingestion), batch processing amortizes the model loading overhead:
Single embedding: 34ms (29 embeddings/sec)
Batch of 8: 180ms (44 embeddings/sec)
Batch of 32: 650ms (49 embeddings/sec)
Diminishing returns above batch size 8 on a Pi — the model's compute dominates, and larger batches just increase memory pressure. We default to batch size 8 for optimal throughput-per-watt.
The Circuit Breaker
On edge devices, the embedding model can fail in creative ways: out of memory, thread pool exhaustion, corrupt model file, ONNX Runtime segfault. If the embedding pipeline fails, the entire memory system shouldn't go down.
We use a circuit breaker pattern:
States: Closed (normal) → Open (failing) → Half-Open (testing)
Closed: Forward requests to ONNX Runtime normally
Open: Skip embeddings, return zero vectors (graceful degradation)
Half-Open: Test one request, if OK → Closed, if fail → Open
When the circuit is open, shodh-memory degrades gracefully: vector search returns no results, but keyword search, knowledge graph traversal, and temporal retrieval still work. Memory is impaired but not dead.
This is how the brain handles injury. Damage to one system doesn't bring everything down. The other cognitive systems compensate.
Model Pinning and Integrity
We pin model URLs to specific HuggingFace commits with SHA-256 checksum verification. If the model file doesn't match the expected checksum, it's rejected and re-downloaded.
This prevents supply-chain attacks (a compromised model that exfiltrates data) and ensures reproducibility (same model bytes on every device, every time).
The Bottom Line
Running embeddings on edge devices is practical today. The trick is choosing the right model (small, fast, good enough), the right runtime (ONNX, not PyTorch), and building for failure (circuit breakers, graceful degradation).
A 34ms embedding is fast enough for memory. Not for real-time video processing. But for remembering what happened in a conversation? More than sufficient.