How a folder becomes a searchable index

The individual models in MediaFind are the easy headline. The hard, boring engineering is everything around them: how thousands of files get processed without melting your laptop, where the results live, and — the part most tools get wrong — how the index stays correct when you add, edit, or delete files. This post is about that machinery.

The shape of the system

At a high level MediaFind is a pipeline with three layers: an intake that discovers work, a per-file pipeline that does the expensive ML, and a local index that everything queries. A small web UI and the CLI sit on top, talking to the same index.

Three layers: intake → job queue → per-file pipeline → local index. The UI and CLI are thin readers over the same on-disk store. Every box runs on your Mac; nothing in this diagram makes a network call.

The job queue: parallelism without the meltdown

Indexing is embarrassingly parallel across files but expensive per file. So intake turns each discovered file into a job, and a pool of workers drains the queue — bounded so we saturate your cores without swapping the machine to its knees. Jobs run asynchronously: the UI stays responsive and shows live progress while a backlog churns in the background.

A subtlety we learned the hard way: the background workers write to the real on-disk index, not to whatever object a request is holding. Conflating the two produced tests that passed against a mocked index while the actual job wrote elsewhere. The fix — and the rule — is that the index path is the single source of truth, and jobs always target it directly.

The storage layer: metadata, vectors, and frames

The index isn't one thing; it's a few stores that play to their strengths:

Structured metadata — files, segments, speakers, chapters, categories — lives in SQLite. It's transactional, file-based, and needs no server.
Vectors — segment, CLIP, and face embeddings — are stored as compact float32 blobs in that same SQLite index. For a typical library a brute-force scan over them is plenty; past roughly ten thousand segments an optional on-disk ANN index (hnswlib) kicks in to keep nearest-neighbor lookups sublinear.
Keyframes & thumbnails — extracted images — are cached on disk so the UI is instant and re-tagging never re-decodes video.

It all lives in one application-data directory you own. There is no cloud database and no embeddings API — a property you can verify, not just take on faith.

Staying fresh: the part most tools skip

A library is never static. You drop in new footage, re-export an edit, rename a folder. A naïve indexer either re-processes everything (wasteful) or trusts file paths (wrong the moment you edit a file in place). MediaFind keys on a fast (modified-time + size) fingerprint: if the file's size and modified-time are unchanged, the work is already done; if either moved, only that file is reprocessed.

Reindex is a diff, not a rebuild. Fingerprint each file by its size and modified-time, compare to the index, and act only on the delta: skip unchanged, reprocess changed, ingest new, prune deleted. A 3,000-file library with three new clips costs three jobs, not three thousand.

This is also why a stale state can appear: if embeddings were built by an older model, or a file changed while a worker was busy, the index flags those entries and offers a one-click refresh rather than silently serving outdated results. Honest about what it knows, explicit about what needs redoing.

Why build it this way

Every architectural choice here bends toward the same two goals: scale on a laptop and never phone home. A bounded job queue keeps a big library tractable on consumer hardware. A file-based index keeps the whole thing portable and serverless. A cheap size-and-modified-time fingerprint keeps it correct over months of edits. None of it requires — or permits — a backend.

With a fresh, queryable index in place, MediaFind can do more than find things — it can organize them. Next up: zero-shot categories and a knowledge map that turns your library into a browsable graph.

Index a real library and see

Point it at a folder and watch it work — locally, with live progress. Free trial.

Download for macOS

Keep reading

Auto-organizing a messy library: zero-shot categories & a knowledge map · Organization From transcript to highlight reel: chapters, key moments & query reels · Editing Who said it, who's in it — diarization & face recognition, privately · People & privacy

How a folder becomes a searchable index — and stays fresh

The shape of the system

The job queue: parallelism without the meltdown

The storage layer: metadata, vectors, and frames

Staying fresh: the part most tools skip

Why build it this way

Index a real library and see

Keep reading