2026-01-15•10 min read

Running Embedding Models on Edge Devices: ONNX, Quantization, and Reality

Name: shodh-memory
Author: Shodh

embeddingsedgeperformance

embedding-models-edge-devices.md

Running Embedding Models on Edge Devices: ONNX, Quantization, and Reality

Getting a neural network to run on a Raspberry Pi sounds like a stunt. But for shodh-memory, it's a requirement — if the embedding model can't run locally, the entire offline-first promise collapses.

Here's what it actually took to get MiniLM-L6-v2 running at 34ms per embedding on a $35 computer.

Why MiniLM-L6-v2

Model selection for edge deployment is a constrained optimization problem:

|-------|-----------|------|---------------|-------------|

| all-MiniLM-L6-v2 | 384 | 22MB | 82.0 | 34ms |

| all-MiniLM-L12-v2 | 384 | 33MB | 83.4 | 71ms |

| all-mpnet-base-v2 | 768 | 420MB | 86.1 | 340ms |

| BGE-large | 1024 | 1.3GB | 87.2 | OOM |

MiniLM-L6-v2 hits the sweet spot: 82% quality at 34ms latency with only 22MB. The 1.4% quality improvement from L12 costs 2x latency. The jump to mpnet costs 10x latency and 20x model size. BGE-large doesn't fit in memory at all.

For a memory system where semantic similarity needs to be "good enough" for retrieval (the knowledge graph and temporal index correct for imprecision), 82% STS is more than sufficient.

ONNX Runtime: The Deployment Story

Training happens in PyTorch. Deployment happens in ONNX Runtime. The conversion is straightforward:

```python

Export from PyTorch → ONNX

torch.onnx.export(model, dummy_input, "model.onnx",

input_names=["input_ids", "attention_mask", "token_type_ids"],

dynamic_axes={"input_ids": {0: "batch", 1: "seq"}})

```

ONNX Runtime gives us three things that PyTorch doesn't:

• **No Python dependency**: The Rust `ort` crate links against the C++ ONNX Runtime directly. No Python interpreter on the Pi.

• **Hardware optimization**: ONNX Runtime auto-selects optimal execution providers (CPU, CUDA, CoreML, NNAPI) based on available hardware.

• **Minimal footprint**: The ONNX Runtime shared library is ~15MB. Compare to PyTorch's 800MB+ installation.

The Tokenization Pipeline

The embedding pipeline is: Text → Tokenize → Model Inference → Mean Pooling → L2 Normalize.

Tokenization is often the overlooked bottleneck. We use the `tokenizers` crate (from Hugging Face) compiled from Rust. It runs the WordPiece tokenizer in ~50 microseconds — negligible compared to model inference.

```

Pipeline Latency Breakdown (Raspberry Pi 4):

Tokenization: 0.05ms (0.1%)

Model Inference: 33.00ms (97.1%)

Mean Pooling: 0.45ms (1.3%)

L2 Normalize: 0.50ms (1.5%)

Total: 34.00ms

```

Inference dominates. Everything else is noise.

Batch Processing for Throughput

When storing multiple memories at once (e.g., during initial ingestion), batch processing amortizes the model loading overhead:

```

Single embedding: 34ms (29 embeddings/sec)

Batch of 8: 180ms (44 embeddings/sec)

Batch of 32: 650ms (49 embeddings/sec)

```

Diminishing returns above batch size 8 on a Pi — the model's compute dominates, and larger batches just increase memory pressure. We default to batch size 8 for optimal throughput-per-watt.

The Circuit Breaker

On edge devices, the embedding model can fail in creative ways: out of memory, thread pool exhaustion, corrupt model file, ONNX Runtime segfault. If the embedding pipeline fails, the entire memory system shouldn't go down.

We use a circuit breaker pattern:

```

States: Closed (normal) → Open (failing) → Half-Open (testing)

Closed: Forward requests to ONNX Runtime normally

Open: Skip embeddings, return zero vectors (graceful degradation)

Half-Open: Test one request, if OK → Closed, if fail → Open

```

When the circuit is open, shodh-memory degrades gracefully: vector search returns no results, but keyword search, knowledge graph traversal, and temporal retrieval still work. Memory is impaired but not dead.

This is how the brain handles injury. Damage to one system doesn't bring everything down. The other cognitive systems compensate.

Model Pinning and Integrity

We pin model URLs to specific HuggingFace commits with SHA-256 checksum verification. If the model file doesn't match the expected checksum, it's rejected and re-downloaded.

This prevents supply-chain attacks (a compromised model that exfiltrates data) and ensures reproducibility (same model bytes on every device, every time).

The Bottom Line

Running embeddings on edge devices is practical today. The trick is choosing the right model (small, fast, good enough), the right runtime (ONNX, not PyTorch), and building for failure (circuit breakers, graceful degradation).

A 34ms embedding is fast enough for memory. Not for real-time video processing. But for remembering what happened in a conversation? More than sufficient.