Every search channel in MediaFind, and what each one does

Most search products have one notion of a match: the document contains your words, or its embedding is near your query. MediaFind has fourteen, because a media library isn't one kind of haystack. The thing you're looking for might be a sentence somebody said, a logo that flashed on screen for two frames, a face, a kitchen, the word “QUARTERLY” on a slide, or a dog barking. No single retriever is good at all of those — so MediaFind doesn't use one.

Instead, search is a channel registry. Each channel is a self-contained retriever that takes your query, searches its own slice of the index, and returns a ranked list of moments. A dispatcher decides which channels to run for a given query, runs them in parallel, and fuses their lists into one. Adding a new way to find things — entities, scenes, song recognition — means adding a channel, not rewriting search.

The whole idea in one line. Different questions want different retrievers. So MediaFind runs many narrow, honest channels and blends them — rather than one fuzzy channel that's mediocre at everything.

One query fans out to every selected channel — each runs on its own thread against its own slice of the index — then reciprocal-rank fusion blends the lists and collapses the same physical moment found by several channels into a single, higher-ranked result.

The six default channels

Type a plain query and hit search, and six channels run — chosen because they're either free to run on every query or return nothing unless they're relevant, so they never add noise or cost you don't want.

1 · Transcript — what was said

The flagship. A bi-encoder embeds your query and retrieves the nearest spoken segments by meaning; a cross-encoder then reranks the top candidates for precision. “The part about budget cuts” finds a clip that says “we're trimming spend” — no shared words required. If the stored embeddings are ever stale, it falls back to literal keyword matching so a query never silently returns nothing. You can also scope it to a single speaker (“only what Dana said”).

2 · Visual — what's on screen

CLIP embeds sampled video frames and your query into the same space, so “a rocket blasting off” matches the footage even if nobody narrates it. This is image understanding, not text — it finds scenes by their look. (It needs the visual model; without it the channel honestly returns nothing rather than guessing.)

3 · OCR — text on screen

Every frame is read at index time, so printed text is searchable: lower-thirds, slide titles, product labels, street signs, chyrons. A frame found by its caption reads as an honest “on-screen text” hit, distinct from a generic visual match. Deep dive →

4 · Summary — which file is about this

Coarser than the rest: each file gets a short summary at index time, and this channel searches those. It answers “which video is about the merger,” not “which second.” Because summaries are file-level and a speaker filter is segment-level, this channel sits out speaker-scoped searches.

5 · People — who's in it

Rides along on every search for free, and returns nothing unless your query names a person MediaFind knows — a face cluster you've named, or a recognized public figure. Name one and you get every clip they appear in, linked back to their face. How recognition stays private →

6 · Entity — exact names

The one thing semantic search is worst at. Embeddings blur “Acme Corp,” “Acme Inc,” and “a generic widget company” together, so when you want every clip that names exactly Acme, you need a literal index. The entity channel matches transcript and on-screen text against names MediaFind already trusts — keyless, exact, high-precision. Deep dive →

The opt-in channels

These run a detector or cost real compute, so they're off by default and you switch them on when you want them — usually by picking that mode in the search bar. Keeping them opt-in is what keeps a default query fast.

7 · Action — what's happening

Zero-shot CLIP, but for verbs: “running,” “a handshake,” “someone cooking.” It's free and keyless, but it's opt-in so MediaFind doesn't run CLIP over every general text query. How zero-shot recognition works →

8 · Scene — where it is

The setting of a frame — kitchen, beach, office, courtroom, stadium, forest. CLIP scores each frame against a curated vocabulary of places, and a scene only “fires” when it clearly beats generic non-place anchors, so close-up talking-head shots correctly get no scene tag. You can also free-text any place (“rooftop bar”) and it searches frames directly, beyond the fixed list.

9 · Logo — which brands appear

Brand-logo detection by name or by a sample image, again via keyless zero-shot CLIP reusing the frame embeddings. This one is a Pro channel and never folds into the default mix. Recognition deep dive →

10 · Audio — non-speech sound

Sounds that aren't words: applause, a dog barking, music, a doorbell — plus on-device song recognition. It's opt-in and stays quiet (returns nothing) until a detector has populated the audio-event index for your library.

11 · Object — things, and how many

Detected objects with bounding boxes and counts, which powers count-filtered queries like “frames with three or more people.” Like audio, it's opt-in and waits until an object detector has run over your frames.

Three refinement modes

The last three live in the registry but are never part of the default blend — they're standalone modes you select on their own, to slice an existing search a different way.

Mode	What it matches
12 · Color	The dominant palette of a frame — find shots by their look (“mostly teal,” “warm golden-hour tones”)
13 · Emotion	The affect of what's said — the angry exchange, the excited pitch — over the transcript
14 · Phonetic	Sounds-like recall — catches names and words the transcript spelled differently than you typed

How the channels become one list

Running many retrievers is only half the design; the other half is blending them well. Three things make the fan-out work:

Parallel, isolated execution. Each selected channel runs on its own thread with its own database connection, so a five-channel search isn't five times slower than one. The query is encoded once and the read-only vector shared across the text channels — encoding is the dominant per-query cost, so this removes the redundant passes.
Reciprocal-rank fusion. Channels return scores that aren't comparable — a CLIP cosine and a cross-encoder logit live on different scales — so MediaFind fuses by rank, not raw score. A moment that several channels each rank highly rises to the top.
Corroboration, deduplicated. If the visual, on-screen-text, and logo channels all surface the same two-second moment, it appears once — ranked higher for the agreement — and the result lists which channels matched it. One moment, not three rows.

Every channel has a relevance floor. A channel that finds nothing genuinely relevant returns nothing, rather than its nearest-but-irrelevant guess. So switching more channels on widens what you can find without flooding clean queries with junk.

Why a registry, and not one big retriever

The channel design is what lets MediaFind grow sideways. Song recognition, scene detection, the entity index, count-filtered object search — each shipped as a new channel slotted behind the same dispatch-and-fuse contract, with the older channels untouched. It's also what keeps the product honest about its dependencies: a channel whose model isn't installed degrades to returning nothing, while every other channel keeps working. You lose a way to search, never the search.

And it's why the single search bar can answer such different questions. You're not choosing a mode for every query — you're aiming a fleet of specialists at your library at once, and reading back the handful of moments they agree on.

Said, shown, written, who's in it, what's happening, where, which brand, what sound. Eight kinds of question, fourteen channels, one ranked list — all on your Mac, with nothing uploaded.

Point every channel at your own library

Free trial. No account, no API keys, nothing leaves your Mac.

Download for macOS

Keep reading

Search by meaning: embeddings, CLIP and a local vector index · Search Find every mention of a name: keyless entity search, then open-vocab NER · Search Recognizing logos, actions & famous faces with zero training data · Recognition

One search bar, fourteen channels: how MediaFind finds anything

The six default channels

1 · Transcript — what was said

2 · Visual — what's on screen

3 · OCR — text on screen

4 · Summary — which file is about this

5 · People — who's in it

6 · Entity — exact names

The opt-in channels

7 · Action — what's happening

8 · Scene — where it is

9 · Logo — which brands appear

10 · Audio — non-speech sound

11 · Object — things, and how many

Three refinement modes

How the channels become one list

Why a registry, and not one big retriever

Point every channel at your own library

Keep reading