One moment, five channels, one result: deduping search with rank fusion
Type one query and MediaFind asks it of a dozen different search channels at once. That's the point — but it means the same instant in the same file can come back from several of them: the visual channel sees the frame, OCR reads the caption burned into it, the logo channel spots a brand in the corner. Show all three and you've got one moment masquerading as three results. Here's how MediaFind collapses them into one — and why it merges by rank, not score.
MediaFind's search bar fans a single query out across many channels — transcript, semantic, visual/CLIP, OCR, logos, actions, scenes, faces, entities, and more. Each one returns its own ranked list of hits. Run them all and you get broad recall: whatever modality your memory of a moment lives in, some channel finds it. (We wrote about the channels themselves in One search bar, fourteen channels.)
The catch is overlap. A single video frame is fair game for the visual channel, the OCR channel, the logo channel, the action channel and the scene channel simultaneously. A single spoken sentence can surface from both the transcript channel and the phonetic channel. Concatenate the lists and the same moment appears two, three, five times — pushing genuinely different results off the first page. So before anything reaches your screen, the lists are fused: duplicates of the same moment merge into one result. Two questions decide how.
1. When are two hits “the same moment”?
Before you can merge duplicates you need a definition of duplicate — an identity that's stable no matter which channel produced the hit. MediaFind computes a fusion key per hit, and it takes one of three shapes depending on what the hit points at:
- A frame. The visual, OCR, logo, action, scene, object, and color channels all point at a specific video frame, so they key on
(media_path, frame_index). Same file, same frame number → same moment, regardless of which channel got there. - A transcript clip. The transcript, emotion, and phonetic channels point at a span of speech, so they key on
(media_path, start, end)with the timestamps rounded to two decimals — close enough that the same sentence from two channels lands on the same key. - Everything else. File-level summaries, plus face and transcript-sourced entity appearances that only carry a synthesized id, keep their own per-channel
(modality, id)identity — so unrelated hits never accidentally collapse together.
The first two are the whole game: they let a frame found five ways, or a clip found two ways, resolve to one key. The third is a deliberate safety floor — when there's no physical anchor to agree on, hits stay separate rather than risk a wrong merge.
2. How do you rank the merged list?
Here's the trap. Each channel scores its hits on its own scale — a CLIP cosine similarity, a BM25-style transcript score, a logo-margin number — and those scales are not comparable. You cannot just take the max score across channels and sort by it; you'd be comparing a temperature to a shoe size.
So MediaFind sorts by rank, not score, using Reciprocal Rank Fusion (RRF). Every hit contributes 1 / (K + rank) from each list it appears in, and those contributions sum. Two consequences fall out of that one line:
- Corroboration is rewarded. A frame that ranked well in three channels accumulates three contributions and floats to the top — exactly the moment most likely to be what you meant. Agreement across channels is evidence, and RRF spends it.
- Scales never collide. Only the position in each list matters, so a channel with tiny scores and one with huge scores combine cleanly. No normalization, no per-channel tuning.
When two hits share a fusion key, the one that ranked best in its own channel becomes the representative — the card you actually see, and the modality badge it wears. The loser's RRF contribution still counts toward the merged score; it just doesn't drive the display.
Don't throw away the other channels' work
Picking one representative would be lossy if it stopped there. The visual channel knows the frame is visually relevant, but it's the OCR channel that read the caption and the logo channel that named the brand. Drop those and the result card gets thinner the more channels agreed — backwards.
So merging is additive. As hits fold together, the representative inherits any supplementary field it's missing from the others — on-screen text, detected brand, action and scene labels, object lists, thumbnails, the person's name, and so on. The rule is simple: a value already present on the representative always wins; the merge only ever fills gaps, never overwrites. The result also records which channels matched it, so the UI can show a “matched in N channels” badge — but only when more than one actually agreed, so a plain single-channel hit stays plain.
All of it, on your Mac
None of this is a re-ranking service or a cloud call. The channels run locally over the index MediaFind already built; the fusion is a few dozen lines of arithmetic over their ranked lists. Your query, the per-channel scores, and the merged results never leave the device — fusion is just bookkeeping on data you already own.
So when a single result quietly tells you it “matched in 3 channels,” that's the fusion step showing its work: several independent views of your library agreed on the same moment, and you got it once.
Search your library every way at once
A dozen channels, one ranked list, all on-device. Free trial.
Download for macOS