Auto-organizing a messy library: zero-shot categories & a knowledge map
Nobody tags their footage. So MediaFind does it for you — assigning categories with no training data, detecting brand logos by name, and drawing a knowledge map of who and what your library is about. The trick is open-vocabulary models and a strict confidence gate.
Search finds the thing you're looking for. But sometimes you don't know what you're looking for — you want to browse: “show me the finance clips,” “which videos have a product demo,” “who shows up the most.” That needs structure: categories, entities, relationships. The old way to get structure is to label thousands of files by hand. MediaFind's way is to derive it, on-device, from embeddings you already computed.
Zero-shot categories: classifying with no training data
Recall from the search post that CLIP puts images and text in the same space. That's not just good for search — it's a classifier you never trained. To ask “is this frame about finance?” we embed a few descriptive prompts for the finance category, embed the keyframe, and compare. The closest category wins. No labeled dataset, no fine-tuning — the category list is just words, so adding a new one is editing a config, not retraining a model.
Logos: zero-shot, but pointed
Brand-logo detection reuses the very same keyframe embeddings, just with a different question: instead of broad categories, it compares against the name or a sample image of a specific logo. Because it rides on embeddings already computed during indexing, turning it on costs almost nothing — and like everything else, it's keyless and local. Detecting “our logo appears at 3:40” never involves a vision API.
The knowledge map: a library as a graph
Categories tag files. The knowledge map connects them. Each file becomes a node, and two files are linked by an edge when they share a signal — a category, a salient keyword, a speaker, a clustered person, or a named entity. The entities MediaFind already extracts — people from faces and speakers, topics from transcripts, places and organizations from text and OCR — are the threads that tie recordings together. Lay that out with a force-directed graph and your library stops being a list and becomes a map that reveals which recordings relate to each other, and why.
Because the graph is built from links you can click, it doubles as navigation: tap a person to see their moments, tap a topic to pull every file that touches it. It's the browse counterpart to search — and a way to discover connections you didn't know to look for.
The same principle, one more time
Categories, logos, and the knowledge map look like three separate features, but they share a spine with everything else MediaFind does: reuse the embeddings computed once during indexing, ask a new question of them, and keep every answer on the device. No labels to maintain, no classifier to train, no API to call — just geometry over vectors you already own.
That closes the loop. From a raw folder you get transcripts, meaning-based search, people, editable highlights, a fresh index, and now a self-organizing map — and not one byte of it left your Mac.
Let your library organize itself
Categories and the knowledge map come from the same on-device indexing. Free trial.
Download for macOS