Is the search any good? How MediaFind measures it — and won't let it regress
“It feels good” is not a metric. A cloud app can A/B-test quality on live traffic; a private, on-device app has no traffic to watch. So MediaFind proves its search and Ask quality the old-fashioned, scientific way: a fixed set of questions with known answers, scored on every commit, with error bars — and a gate that fails the build when the numbers slip.
Search quality is slippery. A new embedding model, a tweaked score floor, a refactor that quietly swaps a dependency — any of them can make results better, worse, or silently dead, and the demo on your screen won't tell you which. The honest way to know is the same one the information-retrieval field has used for decades: assemble queries whose right answers you already know, run them through the real system, and count.
That's the whole idea behind MediaFind's evaluation harness. It lives in the repository, runs on the same clips that ship in the app, and on every commit it produces one row of numbers per capability — retrieval, Ask, faces, color, named entities, sounds-like, and more. Thirteen measured systems across nine capabilities, each scored, timed, and compared against a pinned baseline. Here's how it works, and why we built it to be brutally honest rather than flattering.
Two questions, two kinds of test
“Is the search good?” actually hides two different questions, and they need different tests.
Did we pick a good model? That's answered by public academic benchmarks — BEIR and MTEB for the text encoders, MSR-VTT and MS-COCO for image search, LibriSpeech for transcription. They give absolute, comparable numbers (“this encoder scores X on a standard set”) and they're how you justify a model choice. You run them once when you swap a model, not on every commit.
Did this change break our pipeline? That's a different beast, and public benchmarks can't answer it. It's the question that catches the nasty, MediaFind-specific bugs: a packaging change that makes the app fall back to a useless encoder and silently return nothing; a dimension mismatch that drops every result on the floor with no error. For that you need your own regression set, run on your own code, every time.
MediaFind's harness is the second kind. It doesn't try to win a leaderboard; it tries to notice the day the search gets worse.
What “count” means: the metric stack
For retrieval, the harness reports the standard information-retrieval metrics, each a different lens on the same ranked list:
- Recall@k — was a right answer in the top k? The blunt “did it find it at all.”
- MRR — one over the rank of the first right answer. Rewards putting a correct hit high.
- MAP — mean average precision. Rewards ranking all the relevant clips high, not just one.
- nDCG — graded ranking quality with a position discount, the field's gold-standard ordering metric.
For Ask — the question-answering side — the metrics shift to trust. We measure grounding (do the cited sources actually contain the answer?), faithfulness (does every sentence of a generated answer trace back to its citations, or did the model embroider?), and an honesty signal: a handful of deliberately unanswerable questions, where the right behavior is low confidence, not a confident guess. On our set, Ask grounds its top citation correctly on every answerable question and separates answerable from unanswerable confidence by a healthy margin.
Small sets lie unless you show the error bars
Here's the trap most homegrown evals fall into: you run twenty queries, MRR goes up by 0.02, and you declare victory. But twenty queries is a tiny sample. That 0.02 could be real, or it could be noise that flips the other way on the next twenty.
So every number the harness reports carries a bootstrap confidence interval — resample the queries thousands of times, and report the range the metric actually occupies. And when comparing two systems (say, with and without the cross-encoder re-ranker), it runs a paired permutation test: a p-value on whether the difference is real or could easily be chance. The discipline this enforces is humility. If two systems' intervals overlap, you have not shown one is better — and the harness won't let you pretend otherwise.
It even caught us being wrong in our own favor: on a denser test corpus, adding the cross-encoder re-ranker nudged MAP down slightly (it demotes same-topic neighbors), and the permutation test said the change wasn't significant anyway. A flattering eval would have hidden that. Ours printed it.
Make the test hard enough to be interesting
An easy test is a useless test. Our first retrieval corpus had one clip per topic, and every metric pinned at a perfect 1.0 — which tells you nothing, because there's nowhere to go but down. So we built a denser one: many segments per topic, with deliberately confusable overlap — tomatoes that appear in both cooking and gardening, “interest rate” in both finance topics, night-sky language shared across rockets, astronomy, and photography. A search that keys on surface words gets these wrong.
The payoff is a real signal. Finding one right answer stays easy (MRR pins at 1.0), but ranking all of a topic's segments — MAP — drops to where it can move. And tagging each query easy / medium / hard reveals a clean gradient:
| Query difficulty | MAP (mean average precision) |
|---|---|
| Easy — literal wording | 0.86 |
| Medium — paraphrased | 0.74 |
| Hard — adversarial vocabulary | 0.71 |
Now you can see where quality lives, and a regression that only hurts the hard queries can't hide behind a healthy average.
Quality is half the story — the other half is latency
On a local machine, a slightly better answer that takes four times as long is often the wrong trade. So the harness times every query too, and the dashboard plots the two against each other: quality up, latency across, every configuration a point, the best ones sitting up-and-to-the-left.
The gate: a regression fails the build
A dashboard nobody watches is a vanity project. So the numbers feed a gate. After each run, the latest results are compared to a pinned baseline, and the build fails if any quality metric drops more than a small tolerance, if latency balloons past a multiple of baseline, or — the subtle one — if the backend that produced a result has changed underneath it. That last check is what turns the silent-fallback bug from a mystery support ticket into a red CI run on the pull request that caused it.
Honesty is the feature
It would be easy to write an eval that always prints reassuring 1.0s. We did the opposite on purpose, because a test you can't fail can't protect you. A few examples of the harness telling on us:
- The faithfulness check, when no local language model is available, openly reports that Ask fell back to quoting verbatim — so the row never pretends to have tested fluent generation it didn't run.
- The color search test grounds on the measured dominant color of each image, not its filename — which is how we learned a red circle on a white field is, correctly, white-dominant.
- The named-entity test includes names that aren't in the library at all, and checks the channel returns nothing — an exact-match facet should never invent a match.
- The sounds-like (phonetic) test is deliberately aspirational: it asks for misspellings like “huricane” and “expresso,” and scores about 0.79 because the compact, dependency-free phonetic encoder genuinely misses a few tricky ones. The eval is allowed to be imperfect, because it's measuring real quality, not rubber-stamping a known-good answer.
Why this is in the open
The harness, the test sets, and the exact numbers ship in the repository and run on the clips bundled with the app. That's deliberate. MediaFind's pitch is that a private, on-device tool can be verifiable in ways a cloud black box can't — and “our search is good, trust us” is exactly the kind of unverifiable claim we'd rather not make. You can run the eval yourself, read the queries, and check the numbers. The measurement is part of the product, not marketing around it.
One honest limit, stated plainly: because MediaFind keeps everything on your machine, there's no stream of real user queries to learn from. “Continuous” here means every commit in CI, not every user in production — and the test sets are dozens of queries, not thousands, so the confidence intervals are wider than we'd like. Growing them is ongoing work. But the direction is the point: search quality you can see a number for, watch over time, and stop from quietly rotting.
Search you can actually trust — and check
Private, on-device, and measured. Point it at your library and see for yourself.
Download for macOS