All NotesEngineering · 14 min
Engineering·14 min·February 2026

Eval-driven retrieval, in practice

A pragmatic framework for shipping retrieval systems that hold up in production after the launch.

Famous Engineering

The demo works. The queries you wrote during development return sensible results. The retrieval latency is acceptable. You ship. Six weeks later, the product team is filing tickets about irrelevant results on queries you never thought to test — and you are debugging a system that was never instrumented to surface the information you need to fix it.

This is the standard failure mode for retrieval-augmented generation systems built without an evaluation framework. The problem is not the retrieval architecture — it is the absence of a systematic way to know whether the system is working before users tell you it is not.

The eval set comes first

Our rule at Famous Labs is that no retrieval system goes to production without a curated eval set of at least two hundred query-and-expected-result pairs, covering the full distribution of query types the system will encounter. This set is built before the retrieval architecture is finalised — which forces the team to think about what the system needs to do before optimising how it does it.

No retrieval system goes to production without a curated eval set covering the full distribution of query types. Build the evals before you build the pipeline.

The eval set serves two purposes. It gates the initial deployment — if the system does not clear a defined threshold on the eval set, it does not ship. And it governs every subsequent change: a retrieval architecture change, a chunking strategy change, or a model upgrade is evaluated against the same set before it touches production. This is the only way to have a confident conversation about whether a change made things better or worse.