Recognition

Recognizing logos, actions & famous faces with zero training data

“Find the shots with our logo.” “Where's the part where someone's cooking?” “Which clips have a senator in them?” MediaFind answers all three without ever being trained on your brand, your activities, or your people — here's the zero-shot machinery that makes that possible.

Traditional recognition is a trap: to find “your logo,” you'd collect a few hundred labelled examples, fine-tune a detector, and repeat for every new brand, object, or activity. That doesn't scale to a personal library, and it certainly doesn't run on a laptop the moment you install an app.

MediaFind takes the open-vocabulary route instead. The category isn't baked into a model — it's a phrase you type, compared against the picture at search time. The same trick powers three different features that look unrelated on the surface: brand-logo search, action search, and notable-face matching. All three are keyless, and none required us to train anything per category.

One idea: put images and words in the same space

The engine is CLIP — a model trained so that a picture and its description land at nearly the same point in a shared vector space. We already embed every sampled video frame with CLIP for visual search. Open-vocabulary recognition reuses those exact frame vectors; the only new thing is what we compare them against.

Frame CLIP vector “a photo of a coffee logo” “a photo of cooking” “a photo of dancing” “a photo of nothing” candidate prompts (incl. negatives) cosine sim + confidence gate “cooking” 0.31 > floor ✓
The same frame vector is scored against several text prompts — including a negative like “a photo of nothing.” The highest score wins only if it clears a confidence floor, which is what keeps a blank wall from becoming “a logo.”

Brand logos: a phrase, not a trained detector

To find your logo we don't need a sample of it — we embed a description (“a photo of the Acme logo,” plus a few neutral variants) and rank every frame by cosine similarity to it. If you do have a reference image, even better: we embed the image instead of the text and search by that vector. Either way it's a nearest-neighbour query over frame embeddings you already have, so a new brand costs nothing but a phrase.

Actions: the same machine, no Pro gate

Searching for “dancing,” “cooking,” or “a presentation on stage” works identically — different prompts, same CLIP frame vectors, same cosine ranking. We score a representative frame per moment rather than modelling motion over time, so this is single-frame recognition (true temporal action understanding is a Phase-2 problem we deliberately left a seam for). It's accurate enough to find the cooking segment in a two-hour stream, and because it's just another text query, action search ships free.

The confidence gate earns its keep. Cosine similarity always returns a “best” match, even for a frame that contains none of your candidates. Without a floor, every black frame becomes the nearest logo. We borrow negative anchors and a minimum-score threshold from the scene-tagging work so a weak, lonely cue never gets promoted into a confident label.

Famous faces: matching against a bundled gallery

Notable-face recognition is the one piece that isn't CLIP. It reuses the face-recognition pipeline — the same embeddings that cluster “the same person” across your library — and adds one comparison: each face cluster is matched against a bundled gallery of roughly a thousand public figures. Openly-licensed reference photos are used at build time to compute that gallery, but only the derived face embeddings ship inside the app — no photos are bundled.

If a cluster's embedding is close enough to a gallery entry, that cluster gets a name and a ⭐ badge; if not, it stays an anonymous “Person 4.” The gallery ships with the download, so this is keyless and offline — no face-search API, no third-party lookup, nothing uploaded. One caveat: because this builds on the on-device face pipeline, notable-face naming is a Pro feature — it's on by default for Pro users, but free users get no face clusters, so the celebrity match never runs. (Logo and action search, above, are free.)

Face cluster your library Bundled gallery ≈1,000 public figures 512-d face embeddings ship, not photos ⭐ named match distance < threshold “Person 4” (anonymous) no confident match
One extra comparison turns the private face pipeline into celebrity recognition: match a cluster against the bundled gallery, name it only above a strict distance threshold, and leave everyone else anonymous.

Click a result, see every appearance

Recognition isn't a dead-end label. Click a matched face — famous or one of your own clusters — and MediaFind renders every moment that person appears, the same way clicking a phrase shows every time it was said. Logos and actions behave identically: the “match” is a first-class search result that links to the exact frame, on the exact second.

Why open-vocabulary is the right call for a private library

A trained detector knows only the classes someone paid to label. An open-vocabulary system answers questions nobody anticipated — your obscure local brand, a niche activity, a one-off prop — because the “category” is decided at query time, by you. Pair that with a bundled gallery and a confidence gate, and you get broad recognition that needs no training set, no API key, and no upload.


All three features stand on the CLIP frame embeddings and face embeddings computed during indexing. Recognize once, reuse everywhere — the recurring theme of the whole pipeline.

Find the brands, actions & people in your footage

Free trial. No account, no API keys, nothing uploaded.

Download for macOS