Did this commit make search worse?

Most of MediaFind's surface area is a ranking problem. You type a phrase, we return a list, and the only thing that matters is whether the thing you wanted is near the top. The trouble with ranking quality is that it degrades silently. Unit tests stay green. The app still launches. Search still returns ten results — they're just slightly worse, and no human reviewing a diff will catch that the right segment slid from rank 1 to rank 8.

The only defense is to measure it, continuously, and treat a measured regression like a failing test. That's what the eval harness is: a layer that turns “search feels good” into a number with error bars, recomputed on every commit and gated in CI.

The shape of the system

Each capability has a small run_*_eval.py script that proves one quality dimension against a fixed gold set and prints a human-readable table. Underneath them sits eval/harness.py — the shared layer that makes those numbers comparable over time and defensible under noise. It's stdlib + NumPy only, with no model imports, so tests can pull it in for free.

One gold corpus, run each commit. Metrics and latency are appended to an append-only history keyed by git SHA — the single source the dashboard charts and the gate reads to pass or fail the build.

1 · Metrics: the standard IR set

The harness implements the canonical information-retrieval metrics as pure functions over (a ranking, a gold set), each unit-tested:

Metric	Answers the question…
`Recall@k` / `Hit@k`	Did the right answer make the top k at all?
`MRR`	How high up was the first correct result? (rank 1 vs rank 8 matters)
`MAP`	When there are several right answers, how well are they ordered overall?
`nDCG@k`	The same, weighted so a hit at the top counts more than one near the bottom

These run against a dedicated retrieval corpus — clean text with known topics, several segments per topic, plus deliberate adversarial cross-topic vocabulary so ranking has real work to do. A toy “one segment per topic” fixture would score 100% on everything and tell you nothing.

2 · Statistics: a number isn't a result

Here's the trap that makes naïve eval worse than none: you change something, the score moves from 0.81 to 0.83, and you declare victory. But your gold set has maybe 40 queries. Is +0.02 a real improvement or just which queries happened to be in the set? Without an answer to that, you'll chase noise and ship regressions that look like wins.

So the harness never reports a bare number. It reports two things alongside it:

Bootstrap confidence intervals. Resample the per-query scores thousands of times to get a CI on each metric — the honest width of “how sure are we this number is what it is?”
A paired permutation test. When comparing system B to system A on the same queries, shuffle which system each query's result is assigned to, many times, and ask how often the observed gap appears by chance. That's a p-value on the delta, not an eyeball.

“B beats A” is a claim, not a vibe. A real improvement reads as “+0.04 nDCG, 95% CI [0.01, 0.07], p = 0.008.” A coin flip reads as “+0.02, CI [−0.03, 0.06], p = 0.4.” The second one does not ship as a win, no matter how good the diff looks.

3 · Latency: the other axis

Quality you can't serve fast enough isn't quality. Every eval call is timed and summarized into p50 / p90 / p95 / p99 percentiles — the x-axis of the quality-vs-latency trade-off. A reranker that bumps nDCG by a point but doubles p99 latency is a real trade, and the harness surfaces both sides of it so the decision is made with eyes open rather than discovered later in a “why is search slow now” bug report.

4 · History: append-only, keyed by SHA

Every CI run appends one row per (suite, system) to eval/history.jsonl, keyed by the commit's git SHA. That append-only log is the spine of the whole system — it's what lets you answer “when exactly did MRR drop?” by reading a file instead of bisecting from memory. Two consumers read it:

The dashboard

A static dashboard.html rebuilt from the history — quality over time and a Pareto view of quality against latency, so you can see at a glance which commits are on the efficient frontier and which paid latency for nothing. It's regenerated without re-running anything (make dashboard), because the history already holds every number.

The regression gate

This is the part that gives the harness teeth. A pinned baseline.json records the numbers we've agreed are the floor. On each run, check_regression.py compares the latest result against that baseline and fails CI if quality dropped or latency rose beyond tolerance:

make eval          # run all suites → history.jsonl + latest.json + dashboard
make eval-gate     # FAIL the build if latest regressed vs baseline.json
make eval-baseline # bless the current numbers as the new floor

The workflow is deliberate: a regression is a red build, same as a failing unit test. Beating the baseline doesn't silently move the goalposts either — you pin a new floor explicitly with eval-baseline, after a clean run, as a conscious act. The whole thing runs in CI via eval.yml.

The gate turns a quality drop into a red build. Beating the baseline never auto-updates the floor — that's a deliberate make eval-baseline after a clean run.

Beyond ranking: a faithfulness suite for Ask

Retrieval metrics don't cover everything. Ask generates prose, and the failure mode there isn't a bad rank — it's a confident sentence the sources don't support. So the harness carries a separate faithfulness suite that checks generated answers against their cited segments: every claim should trace back to retrieved evidence. A RAG system that retrieves perfectly and then hallucinates the summary has still failed the user, and this is the suite that catches it.

Why bother, for a local app?

It would be easy to argue a desktop app doesn't need this rigor. We think the opposite. MediaFind ships search, reranking, embeddings, a local LLM, and a dozen keyless channels that all interact — the surface for a silent regression is enormous, and there's no server-side A/B test to catch it after the fact. The eval harness is how a small team changes ranking code confidently: the build tells you, with statistics, whether the thing you just shipped made search better or worse.

The full design lives in the repo's docs/EVAL.md. The principle underneath it is the one we apply everywhere: if a quality claim can be measured, measure it on every commit — and let the build, not a vibe, decide whether it shipped.

Try search you can trust

Free trial. No account, no API keys, nothing uploaded.

Download for macOS

Keep reading

Search by meaning: embeddings, CLIP and a local vector index · Search Ask your library: local RAG over your own media, with citations · Search Staying fast at 10,000 clips: the local-performance playbook · System

Did this commit make search worse? A per-commit eval harness with a regression gate

The shape of the system

1 · Metrics: the standard IR set

2 · Statistics: a number isn't a result

3 · Latency: the other axis

4 · History: append-only, keyed by SHA

The dashboard

The regression gate

Beyond ranking: a faithfulness suite for Ask

Why bother, for a local app?

Try search you can trust

Keep reading