Did this commit make search worse? A per-commit eval harness with a regression gate
Search quality is easy to break and hard to notice breaking. A refactor that “felt fine” can quietly drop the right answer from rank 1 to rank 8. So MediaFind measures retrieval quality and latency on every commit, with statistics — and fails the build when a change makes things worse.
Most of MediaFind's surface area is a ranking problem. You type a phrase, we return a list, and the only thing that matters is whether the thing you wanted is near the top. The trouble with ranking quality is that it degrades silently. Unit tests stay green. The app still launches. Search still returns ten results — they're just slightly worse, and no human reviewing a diff will catch that the right segment slid from rank 1 to rank 8.
The only defense is to measure it, continuously, and treat a measured regression like a failing test. That's what the eval harness is: a layer that turns “search feels good” into a number with error bars, recomputed on every commit and gated in CI.
The shape of the system
Each capability has a small run_*_eval.py script that proves one quality dimension against a fixed gold set and prints a human-readable table. Underneath them sits eval/harness.py — the shared layer that makes those numbers comparable over time and defensible under noise. It's stdlib + NumPy only, with no model imports, so tests can pull it in for free.
1 · Metrics: the standard IR set
The harness implements the canonical information-retrieval metrics as pure functions over (a ranking, a gold set), each unit-tested:
| Metric | Answers the question… |
|---|---|
Recall@k / Hit@k | Did the right answer make the top k at all? |
MRR | How high up was the first correct result? (rank 1 vs rank 8 matters) |
MAP | When there are several right answers, how well are they ordered overall? |
nDCG@k | The same, weighted so a hit at the top counts more than one near the bottom |
These run against a dedicated retrieval corpus — clean text with known topics, several segments per topic, plus deliberate adversarial cross-topic vocabulary so ranking has real work to do. A toy “one segment per topic” fixture would score 100% on everything and tell you nothing.
2 · Statistics: a number isn't a result
Here's the trap that makes naïve eval worse than none: you change something, the score moves from 0.81 to 0.83, and you declare victory. But your gold set has maybe 40 queries. Is +0.02 a real improvement or just which queries happened to be in the set? Without an answer to that, you'll chase noise and ship regressions that look like wins.
So the harness never reports a bare number. It reports two things alongside it:
- Bootstrap confidence intervals. Resample the per-query scores thousands of times to get a CI on each metric — the honest width of “how sure are we this number is what it is?”
- A paired permutation test. When comparing system B to system A on the same queries, shuffle which system each query's result is assigned to, many times, and ask how often the observed gap appears by chance. That's a p-value on the delta, not an eyeball.
3 · Latency: the other axis
Quality you can't serve fast enough isn't quality. Every eval call is timed and summarized into p50 / p90 / p95 / p99 percentiles — the x-axis of the quality-vs-latency trade-off. A reranker that bumps nDCG by a point but doubles p99 latency is a real trade, and the harness surfaces both sides of it so the decision is made with eyes open rather than discovered later in a “why is search slow now” bug report.
4 · History: append-only, keyed by SHA
Every CI run appends one row per (suite, system) to eval/history.jsonl, keyed by the commit's git SHA. That append-only log is the spine of the whole system — it's what lets you answer “when exactly did MRR drop?” by reading a file instead of bisecting from memory. Two consumers read it:
The dashboard
A static dashboard.html rebuilt from the history — quality over time and a Pareto view of quality against latency, so you can see at a glance which commits are on the efficient frontier and which paid latency for nothing. It's regenerated without re-running anything (make dashboard), because the history already holds every number.
The regression gate
This is the part that gives the harness teeth. A pinned baseline.json records the numbers we've agreed are the floor. On each run, check_regression.py compares the latest result against that baseline and fails CI if quality dropped or latency rose beyond tolerance:
make eval # run all suites → history.jsonl + latest.json + dashboard
make eval-gate # FAIL the build if latest regressed vs baseline.json
make eval-baseline # bless the current numbers as the new floor
The workflow is deliberate: a regression is a red build, same as a failing unit test. Beating the baseline doesn't silently move the goalposts either — you pin a new floor explicitly with eval-baseline, after a clean run, as a conscious act. The whole thing runs in CI via eval.yml.
make eval-baseline after a clean run.Beyond ranking: a faithfulness suite for Ask
Retrieval metrics don't cover everything. Ask generates prose, and the failure mode there isn't a bad rank — it's a confident sentence the sources don't support. So the harness carries a separate faithfulness suite that checks generated answers against their cited segments: every claim should trace back to retrieved evidence. A RAG system that retrieves perfectly and then hallucinates the summary has still failed the user, and this is the suite that catches it.
Why bother, for a local app?
It would be easy to argue a desktop app doesn't need this rigor. We think the opposite. MediaFind ships search, reranking, embeddings, a local LLM, and a dozen keyless channels that all interact — the surface for a silent regression is enormous, and there's no server-side A/B test to catch it after the fact. The eval harness is how a small team changes ranking code confidently: the build tells you, with statistics, whether the thing you just shipped made search better or worse.
The full design lives in the repo's docs/EVAL.md. The principle underneath it is the one we apply everywhere: if a quality claim can be measured, measure it on every commit — and let the build, not a vibe, decide whether it shipped.