Search

One search bar, fourteen channels: how MediaFind finds anything

Type “a rocket on the launchpad,” “every clip that names Acme,” or “the part about budget cuts” into the same box and the right moments come back. They come back because behind that one bar are fourteen independent retrievers — each good at a different kind of question — blended into a single ranked list. Here's every one of them.

Most search products have one notion of a match: the document contains your words, or its embedding is near your query. MediaFind has fourteen, because a media library isn't one kind of haystack. The thing you're looking for might be a sentence somebody said, a logo that flashed on screen for two frames, a face, a kitchen, the word “QUARTERLY” on a slide, or a dog barking. No single retriever is good at all of those — so MediaFind doesn't use one.

Instead, search is a channel registry. Each channel is a self-contained retriever that takes your query, searches its own slice of the index, and returns a ranked list of moments. A dispatcher decides which channels to run for a given query, runs them in parallel, and fuses their lists into one. Adding a new way to find things — entities, scenes, song recognition — means adding a channel, not rewriting search.

The whole idea in one line. Different questions want different retrievers. So MediaFind runs many narrow, honest channels and blends them — rather than one fuzzy channel that's mediocre at everything.
query “rocket launch” transcript visual · CLIP OCR people · entity action · scene · logo · … RRF fuse + dedup one ranked list
One query fans out to every selected channel — each runs on its own thread against its own slice of the index — then reciprocal-rank fusion blends the lists and collapses the same physical moment found by several channels into a single, higher-ranked result.

The six default channels

Type a plain query and hit search, and six channels run — chosen because they're either free to run on every query or return nothing unless they're relevant, so they never add noise or cost you don't want.

1 · Transcript — what was said

The flagship. A bi-encoder embeds your query and retrieves the nearest spoken segments by meaning; a cross-encoder then reranks the top candidates for precision. “The part about budget cuts” finds a clip that says “we're trimming spend” — no shared words required. If the stored embeddings are ever stale, it falls back to literal keyword matching so a query never silently returns nothing. You can also scope it to a single speaker (“only what Dana said”).

2 · Visual — what's on screen

CLIP embeds sampled video frames and your query into the same space, so “a rocket blasting off” matches the footage even if nobody narrates it. This is image understanding, not text — it finds scenes by their look. (It needs the visual model; without it the channel honestly returns nothing rather than guessing.)

3 · OCR — text on screen

Every frame is read at index time, so printed text is searchable: lower-thirds, slide titles, product labels, street signs, chyrons. A frame found by its caption reads as an honest “on-screen text” hit, distinct from a generic visual match. Deep dive →

4 · Summary — which file is about this

Coarser than the rest: each file gets a short summary at index time, and this channel searches those. It answers “which video is about the merger,” not “which second.” Because summaries are file-level and a speaker filter is segment-level, this channel sits out speaker-scoped searches.

5 · People — who's in it

Rides along on every search for free, and returns nothing unless your query names a person MediaFind knows — a face cluster you've named, or a recognized public figure. Name one and you get every clip they appear in, linked back to their face. How recognition stays private →

6 · Entity — exact names

The one thing semantic search is worst at. Embeddings blur “Acme Corp,” “Acme Inc,” and “a generic widget company” together, so when you want every clip that names exactly Acme, you need a literal index. The entity channel matches transcript and on-screen text against names MediaFind already trusts — keyless, exact, high-precision. Deep dive →

The opt-in channels

These run a detector or cost real compute, so they're off by default and you switch them on when you want them — usually by picking that mode in the search bar. Keeping them opt-in is what keeps a default query fast.

7 · Action — what's happening

Zero-shot CLIP, but for verbs: “running,” “a handshake,” “someone cooking.” It's free and keyless, but it's opt-in so MediaFind doesn't run CLIP over every general text query. How zero-shot recognition works →

8 · Scene — where it is

The setting of a frame — kitchen, beach, office, courtroom, stadium, forest. CLIP scores each frame against a curated vocabulary of places, and a scene only “fires” when it clearly beats generic non-place anchors, so close-up talking-head shots correctly get no scene tag. You can also free-text any place (“rooftop bar”) and it searches frames directly, beyond the fixed list.

9 · Logo — which brands appear

Brand-logo detection by name or by a sample image, again via keyless zero-shot CLIP reusing the frame embeddings. This one is a Pro channel and never folds into the default mix. Recognition deep dive →

10 · Audio — non-speech sound

Sounds that aren't words: applause, a dog barking, music, a doorbell — plus on-device song recognition. It's opt-in and stays quiet (returns nothing) until a detector has populated the audio-event index for your library.

11 · Object — things, and how many

Detected objects with bounding boxes and counts, which powers count-filtered queries like “frames with three or more people.” Like audio, it's opt-in and waits until an object detector has run over your frames.

Three refinement modes

The last three live in the registry but are never part of the default blend — they're standalone modes you select on their own, to slice an existing search a different way.

ModeWhat it matches
12 · ColorThe dominant palette of a frame — find shots by their look (“mostly teal,” “warm golden-hour tones”)
13 · EmotionThe affect of what's said — the angry exchange, the excited pitch — over the transcript
14 · PhoneticSounds-like recall — catches names and words the transcript spelled differently than you typed

How the channels become one list

Running many retrievers is only half the design; the other half is blending them well. Three things make the fan-out work:

Every channel has a relevance floor. A channel that finds nothing genuinely relevant returns nothing, rather than its nearest-but-irrelevant guess. So switching more channels on widens what you can find without flooding clean queries with junk.

Why a registry, and not one big retriever

The channel design is what lets MediaFind grow sideways. Song recognition, scene detection, the entity index, count-filtered object search — each shipped as a new channel slotted behind the same dispatch-and-fuse contract, with the older channels untouched. It's also what keeps the product honest about its dependencies: a channel whose model isn't installed degrades to returning nothing, while every other channel keeps working. You lose a way to search, never the search.

And it's why the single search bar can answer such different questions. You're not choosing a mode for every query — you're aiming a fleet of specialists at your library at once, and reading back the handful of moments they agree on.


Said, shown, written, who's in it, what's happening, where, which brand, what sound. Eight kinds of question, fourteen channels, one ranked list — all on your Mac, with nothing uploaded.

Point every channel at your own library

Free trial. No account, no API keys, nothing leaves your Mac.

Download for macOS