Quality engineering

Did this commit make search worse? A per-commit eval harness with a regression gate

Search quality is easy to break and hard to notice breaking. A refactor that “felt fine” can quietly drop the right answer from rank 1 to rank 8. So MediaFind measures retrieval quality and latency on every commit, with statistics — and fails the build when a change makes things worse.

Most of MediaFind's surface area is a ranking problem. You type a phrase, we return a list, and the only thing that matters is whether the thing you wanted is near the top. The trouble with ranking quality is that it degrades silently. Unit tests stay green. The app still launches. Search still returns ten results — they're just slightly worse, and no human reviewing a diff will catch that the right segment slid from rank 1 to rank 8.

The only defense is to measure it, continuously, and treat a measured regression like a failing test. That's what the eval harness is: a layer that turns “search feels good” into a number with error bars, recomputed on every commit and gated in CI.

The shape of the system

Each capability has a small run_*_eval.py script that proves one quality dimension against a fixed gold set and prints a human-readable table. Underneath them sits eval/harness.py — the shared layer that makes those numbers comparable over time and defensible under noise. It's stdlib + NumPy only, with no model imports, so tests can pull it in for free.

gold corpus queries + answers run suites per commit metrics Recall/MRR/nDCG latency p50/p90/p95/p99 history.jsonl keyed by git SHA dashboard CI gate pass / fail
One gold corpus, run each commit. Metrics and latency are appended to an append-only history keyed by git SHA — the single source the dashboard charts and the gate reads to pass or fail the build.

1 · Metrics: the standard IR set

The harness implements the canonical information-retrieval metrics as pure functions over (a ranking, a gold set), each unit-tested:

MetricAnswers the question…
Recall@k / Hit@kDid the right answer make the top k at all?
MRRHow high up was the first correct result? (rank 1 vs rank 8 matters)
MAPWhen there are several right answers, how well are they ordered overall?
nDCG@kThe same, weighted so a hit at the top counts more than one near the bottom

These run against a dedicated retrieval corpus — clean text with known topics, several segments per topic, plus deliberate adversarial cross-topic vocabulary so ranking has real work to do. A toy “one segment per topic” fixture would score 100% on everything and tell you nothing.

2 · Statistics: a number isn't a result

Here's the trap that makes naïve eval worse than none: you change something, the score moves from 0.81 to 0.83, and you declare victory. But your gold set has maybe 40 queries. Is +0.02 a real improvement or just which queries happened to be in the set? Without an answer to that, you'll chase noise and ship regressions that look like wins.

So the harness never reports a bare number. It reports two things alongside it:

“B beats A” is a claim, not a vibe. A real improvement reads as “+0.04 nDCG, 95% CI [0.01, 0.07], p = 0.008.” A coin flip reads as “+0.02, CI [−0.03, 0.06], p = 0.4.” The second one does not ship as a win, no matter how good the diff looks.

3 · Latency: the other axis

Quality you can't serve fast enough isn't quality. Every eval call is timed and summarized into p50 / p90 / p95 / p99 percentiles — the x-axis of the quality-vs-latency trade-off. A reranker that bumps nDCG by a point but doubles p99 latency is a real trade, and the harness surfaces both sides of it so the decision is made with eyes open rather than discovered later in a “why is search slow now” bug report.

4 · History: append-only, keyed by SHA

Every CI run appends one row per (suite, system) to eval/history.jsonl, keyed by the commit's git SHA. That append-only log is the spine of the whole system — it's what lets you answer “when exactly did MRR drop?” by reading a file instead of bisecting from memory. Two consumers read it:

The dashboard

A static dashboard.html rebuilt from the history — quality over time and a Pareto view of quality against latency, so you can see at a glance which commits are on the efficient frontier and which paid latency for nothing. It's regenerated without re-running anything (make dashboard), because the history already holds every number.

The regression gate

This is the part that gives the harness teeth. A pinned baseline.json records the numbers we've agreed are the floor. On each run, check_regression.py compares the latest result against that baseline and fails CI if quality dropped or latency rose beyond tolerance:

make eval          # run all suites → history.jsonl + latest.json + dashboard
make eval-gate     # FAIL the build if latest regressed vs baseline.json
make eval-baseline # bless the current numbers as the new floor

The workflow is deliberate: a regression is a red build, same as a failing unit test. Beating the baseline doesn't silently move the goalposts either — you pin a new floor explicitly with eval-baseline, after a clean run, as a conscious act. The whole thing runs in CI via eval.yml.

latest run this commit baseline.json compare Δ quality · Δ latency ✓ within tolerance build passes ✗ regressed build fails
The gate turns a quality drop into a red build. Beating the baseline never auto-updates the floor — that's a deliberate make eval-baseline after a clean run.

Beyond ranking: a faithfulness suite for Ask

Retrieval metrics don't cover everything. Ask generates prose, and the failure mode there isn't a bad rank — it's a confident sentence the sources don't support. So the harness carries a separate faithfulness suite that checks generated answers against their cited segments: every claim should trace back to retrieved evidence. A RAG system that retrieves perfectly and then hallucinates the summary has still failed the user, and this is the suite that catches it.

Why bother, for a local app?

It would be easy to argue a desktop app doesn't need this rigor. We think the opposite. MediaFind ships search, reranking, embeddings, a local LLM, and a dozen keyless channels that all interact — the surface for a silent regression is enormous, and there's no server-side A/B test to catch it after the fact. The eval harness is how a small team changes ranking code confidently: the build tells you, with statistics, whether the thing you just shipped made search better or worse.


The full design lives in the repo's docs/EVAL.md. The principle underneath it is the one we apply everywhere: if a quality claim can be measured, measure it on every commit — and let the build, not a vibe, decide whether it shipped.

Try search you can trust

Free trial. No account, no API keys, nothing uploaded.

Download for macOS