LEANN: The Vector Index That Throws Out the Vectors


LEANN: The Vector Index That Throws Out the Vectors Yichuan Wang et al., UC Berkeley (2025)

Like everyone else building RAG systems right now, I’ve been living with the assumption that vector search = store embeddings somewhere and let an ANN index chew on them. That’s the mental model. Every library assumes it. Every vendor reinforces it.

So when I read the LEANN paper this week, four months after the repo dropped, I had a very specific reaction:

Wait… they just don’t store the vectors at all? And it works?

Yes. That’s the idea. And it’s genuinely clever.

The Storage Problem Nobody Really Talks About

If you’ve done anything non-trivial with RAG, you’ve seen the issue: embeddings are huge. The index is bigger. And if you’re trying to run something on-device, forget it.

LEANN includes a table that basically says it out loud:

MethodIndex SizeRelative Size
Raw Dataset76 GB1.0×
PQ (compressed)20 GB0.26×
IVF172 GB2.26×
HNSW188 GB2.47×
DiskANN270 GB3.55×
LEANN4 GB0.05×

LEANN basically collapses 173 GB worth of embeddings into nothing. That alone is worth reading the paper. See Table 1 in the paper to see what I’m talking about!

The Insight: LLM Latency Makes Fast ANN a Waste of Engineering

The thing that really got me is the mindset shift the authors make. Every ANN system tries to shave off milliseconds of retrieval latency. But in RAG, that’s not where we’re spending time. LLM generation dominates. By a lot.

MethodRetrievalGeneration
Traditional ANN0.03s20.9s
LEANN2.5s20.8s

You squint and realize: 2 seconds vs 0.03 seconds does not matter when the model is going to think for 20+ seconds anyway. This tradeoff, accept slower retrieval in exchange for massive storage savings, is the core of LEANN’s value proposition.

So How Does LEANN Actually Work?

The idea is simple:

  1. Throw away dense embeddings entirely.
  2. Keep a tiny pruned proximity graph and PQ codes.
  3. At query time, recompute exact embeddings only for the promising candidates.

It sounds wrong until you remember that graph ANN touches only a tiny number of nodes (~log N). Recompute 20–200 embeddings per query? Totally fine. Especially with GPU batching.

The two-level search works like this:

  • Approximate stage: Use PQ vectors (100× smaller) to estimate distances.
  • Exact stage: Recompute embeddings only for the top α% of candidates.
  • Traverse: Use those exact vectors to continue the graph search.

This is the clever part: The graph tells you where to spend compute, and the PQ codes tell you when not to.

Why This Paper Impressed Me

  1. It flips a core assumption in the vector-search world - Everyone else assumes embeddings must be stored. LEANN asks: what if we didn’t?
  2. It unlocks real on-device RAG - Most local RAG tools die on the hill of embedding size. LEANN sidesteps that entirely.
  3. It exposes how much vector-database infrastructure we don’t actually need - I’ve been saying this for a while: for most teams, a giant distributed vector DB is overkill. LEANN gives a technical reason why — if your index is tiny and compute is cheap(ish), you don’t need a full distributed system.
  4. I’m honestly surprised I didn’t know about it sooner - Four months. A working repo. In this AI moment, that’s like finding an undiscovered species.

In terms of the kinds of use cases that make sense for this could be on-device or local-first software experiences or for very large datasets. In all cases though, this is well suited for which the amount of data is sufficient large for a constrained environment.

At the end of the day, LEANN feels refreshingly pragmatic.

Links Paper: https://arxiv.org/abs/2506.08276 Code: https://github.com/yichuan-w/LEANN