One search bar, fourteen channels: how MediaFind finds anything
Type “a rocket on the launchpad,” “every clip that names Acme,” or “the part about budget cuts” into the same box and the right moments come back. They come back because behind that one bar are fourteen independent retrievers — each good at a different kind of question — blended into a single ranked list. Here's every one of them.
Most search products have one notion of a match: the document contains your words, or its embedding is near your query. MediaFind has fourteen, because a media library isn't one kind of haystack. The thing you're looking for might be a sentence somebody said, a logo that flashed on screen for two frames, a face, a kitchen, the word “QUARTERLY” on a slide, or a dog barking. No single retriever is good at all of those — so MediaFind doesn't use one.
Instead, search is a channel registry. Each channel is a self-contained retriever that takes your query, searches its own slice of the index, and returns a ranked list of moments. A dispatcher decides which channels to run for a given query, runs them in parallel, and fuses their lists into one. Adding a new way to find things — entities, scenes, song recognition — means adding a channel, not rewriting search.
The six default channels
Type a plain query and hit search, and six channels run — chosen because they're either free to run on every query or return nothing unless they're relevant, so they never add noise or cost you don't want.
1 · Transcript — what was said
The flagship. A bi-encoder embeds your query and retrieves the nearest spoken segments by meaning; a cross-encoder then reranks the top candidates for precision. “The part about budget cuts” finds a clip that says “we're trimming spend” — no shared words required. If the stored embeddings are ever stale, it falls back to literal keyword matching so a query never silently returns nothing. You can also scope it to a single speaker (“only what Dana said”).
2 · Visual — what's on screen
CLIP embeds sampled video frames and your query into the same space, so “a rocket blasting off” matches the footage even if nobody narrates it. This is image understanding, not text — it finds scenes by their look. (It needs the visual model; without it the channel honestly returns nothing rather than guessing.)
3 · OCR — text on screen
Every frame is read at index time, so printed text is searchable: lower-thirds, slide titles, product labels, street signs, chyrons. A frame found by its caption reads as an honest “on-screen text” hit, distinct from a generic visual match. Deep dive →
4 · Summary — which file is about this
Coarser than the rest: each file gets a short summary at index time, and this channel searches those. It answers “which video is about the merger,” not “which second.” Because summaries are file-level and a speaker filter is segment-level, this channel sits out speaker-scoped searches.
5 · People — who's in it
Rides along on every search for free, and returns nothing unless your query names a person MediaFind knows — a face cluster you've named, or a recognized public figure. Name one and you get every clip they appear in, linked back to their face. How recognition stays private →
6 · Entity — exact names
The one thing semantic search is worst at. Embeddings blur “Acme Corp,” “Acme Inc,” and “a generic widget company” together, so when you want every clip that names exactly Acme, you need a literal index. The entity channel matches transcript and on-screen text against names MediaFind already trusts — keyless, exact, high-precision. Deep dive →
The opt-in channels
These run a detector or cost real compute, so they're off by default and you switch them on when you want them — usually by picking that mode in the search bar. Keeping them opt-in is what keeps a default query fast.
7 · Action — what's happening
Zero-shot CLIP, but for verbs: “running,” “a handshake,” “someone cooking.” It's free and keyless, but it's opt-in so MediaFind doesn't run CLIP over every general text query. How zero-shot recognition works →
8 · Scene — where it is
The setting of a frame — kitchen, beach, office, courtroom, stadium, forest. CLIP scores each frame against a curated vocabulary of places, and a scene only “fires” when it clearly beats generic non-place anchors, so close-up talking-head shots correctly get no scene tag. You can also free-text any place (“rooftop bar”) and it searches frames directly, beyond the fixed list.
9 · Logo — which brands appear
Brand-logo detection by name or by a sample image, again via keyless zero-shot CLIP reusing the frame embeddings. This one is a Pro channel and never folds into the default mix. Recognition deep dive →
10 · Audio — non-speech sound
Sounds that aren't words: applause, a dog barking, music, a doorbell — plus on-device song recognition. It's opt-in and stays quiet (returns nothing) until a detector has populated the audio-event index for your library.
11 · Object — things, and how many
Detected objects with bounding boxes and counts, which powers count-filtered queries like “frames with three or more people.” Like audio, it's opt-in and waits until an object detector has run over your frames.
Three refinement modes
The last three live in the registry but are never part of the default blend — they're standalone modes you select on their own, to slice an existing search a different way.
| Mode | What it matches |
|---|---|
| 12 · Color | The dominant palette of a frame — find shots by their look (“mostly teal,” “warm golden-hour tones”) |
| 13 · Emotion | The affect of what's said — the angry exchange, the excited pitch — over the transcript |
| 14 · Phonetic | Sounds-like recall — catches names and words the transcript spelled differently than you typed |
How the channels become one list
Running many retrievers is only half the design; the other half is blending them well. Three things make the fan-out work:
- Parallel, isolated execution. Each selected channel runs on its own thread with its own database connection, so a five-channel search isn't five times slower than one. The query is encoded once and the read-only vector shared across the text channels — encoding is the dominant per-query cost, so this removes the redundant passes.
- Reciprocal-rank fusion. Channels return scores that aren't comparable — a CLIP cosine and a cross-encoder logit live on different scales — so MediaFind fuses by rank, not raw score. A moment that several channels each rank highly rises to the top.
- Corroboration, deduplicated. If the visual, on-screen-text, and logo channels all surface the same two-second moment, it appears once — ranked higher for the agreement — and the result lists which channels matched it. One moment, not three rows.
Why a registry, and not one big retriever
The channel design is what lets MediaFind grow sideways. Song recognition, scene detection, the entity index, count-filtered object search — each shipped as a new channel slotted behind the same dispatch-and-fuse contract, with the older channels untouched. It's also what keeps the product honest about its dependencies: a channel whose model isn't installed degrades to returning nothing, while every other channel keeps working. You lose a way to search, never the search.
And it's why the single search bar can answer such different questions. You're not choosing a mode for every query — you're aiming a fleet of specialists at your library at once, and reading back the handful of moments they agree on.
Said, shown, written, who's in it, what's happening, where, which brand, what sound. Eight kinds of question, fourteen channels, one ranked list — all on your Mac, with nothing uploaded.
Point every channel at your own library
Free trial. No account, no API keys, nothing leaves your Mac.
Download for macOS