Organization

Auto-organizing a messy library: zero-shot categories & a knowledge map

Nobody tags their footage. So MediaFind does it for you — assigning categories with no training data, detecting brand logos by name, and drawing a knowledge map of who and what your library is about. The trick is open-vocabulary models and a strict confidence gate.

Search finds the thing you're looking for. But sometimes you don't know what you're looking for — you want to browse: “show me the finance clips,” “which videos have a product demo,” “who shows up the most.” That needs structure: categories, entities, relationships. The old way to get structure is to label thousands of files by hand. MediaFind's way is to derive it, on-device, from embeddings you already computed.

Zero-shot categories: classifying with no training data

Recall from the search post that CLIP puts images and text in the same space. That's not just good for search — it's a classifier you never trained. To ask “is this frame about finance?” we embed a few descriptive prompts for the finance category, embed the keyframe, and compare. The closest category wins. No labeled dataset, no fine-tuning — the category list is just words, so adding a new one is editing a config, not retraining a model.

keyframe CLIP vector cosine vs. category anchor prompts finance 0.83 ✓ product demo 0.42 nature 0.21 confidence gate • beats score floor?yes • beats negative anchors yes (logo / abstract / UI)? tag: Finance assigned with confidence
Classification is just cosine similarity against category prompts. The catch: a lone weak cue shouldn't win. So a tag must clear a score floor and beat negative anchors (logo, UI, abstract) — otherwise the frame stays uncategorized.
The bug that motivated the gate. Early on, a slide of a logo or an abstract gradient would get confidently mislabeled, because something always scores highest even when nothing fits. The fix was twofold: a minimum-score floor so a faint cue (the word “rate” nudging toward “finance”) no longer assigns a category, and a set of negative anchors — “a brand logo,” “a UI screenshot,” “an abstract texture” — that a real category has to out-score. Confidence, not just the argmax.

Logos: zero-shot, but pointed

Brand-logo detection reuses the very same keyframe embeddings, just with a different question: instead of broad categories, it compares against the name or a sample image of a specific logo. Because it rides on embeddings already computed during indexing, turning it on costs almost nothing — and like everything else, it's keyless and local. Detecting “our logo appears at 3:40” never involves a vision API.

The knowledge map: a library as a graph

Categories tag files. The knowledge map connects them. Each file becomes a node, and two files are linked by an edge when they share a signal — a category, a salient keyword, a speaker, a clustered person, or a named entity. The entities MediaFind already extracts — people from faces and speakers, topics from transcripts, places and organizations from text and OCR — are the threads that tie recordings together. Lay that out with a force-directed graph and your library stops being a list and becomes a map that reveals which recordings relate to each other, and why.

shared: Alice shared: Bob shared: Acme, Berlin shared: pricing Q3 review recording standuprecording 1:1recording all-handsrecording demorecording webinarrecording pitchrecording recording (a file) shared signal
Each recording is a node; an edge appears when two files share a signal — a speaker, a clustered person, or a named entity. The same data that powers click-a-face and click-a-voice, drawn as a graph — so “which recordings relate to the Q3 review, and why?” is something you can see.

Because the graph is built from links you can click, it doubles as navigation: tap a person to see their moments, tap a topic to pull every file that touches it. It's the browse counterpart to search — and a way to discover connections you didn't know to look for.

The same principle, one more time

Categories, logos, and the knowledge map look like three separate features, but they share a spine with everything else MediaFind does: reuse the embeddings computed once during indexing, ask a new question of them, and keep every answer on the device. No labels to maintain, no classifier to train, no API to call — just geometry over vectors you already own.


That closes the loop. From a raw folder you get transcripts, meaning-based search, people, editable highlights, a fresh index, and now a self-organizing map — and not one byte of it left your Mac.

Let your library organize itself

Categories and the knowledge map come from the same on-device indexing. Free trial.

Download for macOS