Eval-driven retrieval, in practice
A pragmatic framework for shipping retrieval systems that hold up in production after the launch.
The demo works. The queries you wrote during development return sensible results. The retrieval latency is acceptable. You ship. Six weeks later, the product team is filing tickets about irrelevant results on queries you never thought to test — and you are debugging a system that was never instrumented to surface the information you need to fix it.
This is the standard failure mode for retrieval-augmented generation systems built without an evaluation framework. The problem is not the retrieval architecture — it is the absence of a systematic way to know whether the system is working before users tell you it is not.
The eval set comes first
Our rule at Famous Labs is that no retrieval system goes to production without a curated eval set of at least two hundred query-and-expected-result pairs, covering the full distribution of query types the system will encounter. This set is built before the retrieval architecture is finalised — which forces the team to think about what the system needs to do before optimising how it does it.
No retrieval system goes to production without a curated eval set covering the full distribution of query types. Build the evals before you build the pipeline.
The eval set serves two purposes. It gates the initial deployment — if the system does not clear a defined threshold on the eval set, it does not ship. And it governs every subsequent change: a retrieval architecture change, a chunking strategy change, or a model upgrade is evaluated against the same set before it touches production. This is the only way to have a confident conversation about whether a change made things better or worse.
More notes
The brand stack in the age of AI
Why identity, voice, and visual systems matter more, not less, when generation is free and undifferentiated.
ReadOperating equity, in practice
How Famous structures venture engagements: the cap table, the carve-outs, and why we charge less than market upfront.
ReadMeasurement before marketing
Why fixing your attribution layer before running another campaign is always the right call — even when it costs you a quarter.
Read