Ask your library with local RAG and citations

Once your library is a pile of timestamped, embedded segments, two different questions become possible. Search answers “where is this?” — it ranks segments and links you to them. Ask answers “what happened?” — it reads the relevant segments and, by default, stitches the exact lines that answer your question into a reply, with the receipts attached. (Prefer a fluent paragraph? An optional on-device model can rewrite those same lines — more on that below.)

The temptation is to pipe a transcript into a cloud model and call it a day. We don't, for the same reason we don't upload audio: the transcript of a deposition, a 1:1, or an unreleased cut is exactly the thing you don't want leaving the machine. So Ask is retrieval-grounded answering that runs locally. Retrieval does the heavy lifting: it finds the source segments first, then the bundled Mini model answers only from that local context. The tight retrieval is exactly what lets a small local model punch above its weight while citations keep every answer checkable.

Ask is retrieve-then-read. The question and your transcript live in the same vector space, so finding the right segments is a nearest-neighbour lookup — and the answer is built only from evidence that retrieval handed it.

Step 1: The question becomes a vector

Every segment in your library was already embedded when it was indexed (see the search deep-dive). Ask embeds your question with the same model, into the same space. That's the whole trick behind retrieval: a question and the answer that satisfies it land near each other geometrically, even when they share no words. “What's our refund window?” sits close to “…customers can return it within thirty days…” without either string mentioning the other.

Step 2: Retrieve the few segments that matter

We pull the top handful of nearest segments — not the whole transcript. This is the part people skip, and it's the part that makes a faithful answer possible — whether the reply is stitched straight from those passages or handed to the optional local model. A small model handed three precise, relevant passages writes a better, faster, more faithful answer than a large model drowning in a two-hour transcript.

Retrieval also crosses file boundaries for free. Ask a question and the best evidence might come from three different recordings made months apart — the index doesn't care which file a segment came from, only how close its vector is.

Why retrieval beats a bigger context window. Stuffing an entire transcript into a model is slow, expensive, and ironically less accurate — relevant lines get lost in the noise (“needle in a haystack”). Retrieving the right 1% first means the model reasons over signal, not filler.

Step 3: Build the answer — verbatim by default, fluent if you opt in

The default answer uses the bundled Mini on-device model: MediaFind retrieves the segments that best answer your question, sends only that local context to the model, and keeps the citations pinned to the source moments. No cloud, no account, no API key, and no first-run model download in the packaged app.

If you'd rather have a fluent paragraph, you can turn on a local language model (it downloads once, then runs entirely on-device). It gets the same retrieved segments under a strict instruction: answer using only what's provided, and if the evidence doesn't contain the answer, say so. Each passage carries its timestamp and source, so the model attributes claims rather than inventing them.

Answer the question using ONLY the context below.
If the context doesn't contain the answer, say you don't know.
Cite the segments you used by their [id].

[1] (mtg-q3.mp4 · 11:48) …we agreed to push the price increase to January…
[2] (mtg-q3.mp4 · 27:03) …grandfather existing customers at the old rate…

Question: what did we decide about the pricing change?

That grounding is the single biggest lever against hallucination on the optional model path. The model isn't asked to recall facts from training — it's asked to read and summarize a small, supplied document. That's a task small local models are genuinely good at.

Step 4: An answer you can click

Either way, the result isn't a wall of text you have to trust. Every claim is tied back to the segments that produced it, and each citation is a link to the exact second in the source file. With the optional model on, the answer reads like prose:

“You decided to push the price increase to January [1] and grandfather existing customers at the old rate [2].” — and [1] jumps to mtg-q3.mp4 at 11:48.

With the default local answer, you get synthesized prose plus the same two citations. The link back to the tape is the grounding contract.

If the answer looks off, you don't argue with a chatbot — you click the citation and check the tape. Grounded answers with a path back to the source are the difference between a search tool you can rely on and a party trick.

What it costs you: nothing, and no one

There's no API key, no per-query meter, and no upload. The embedding model and bundled Mini LLM run on-device. If you want higher-quality synthesis, Small is a one-tap download and then runs offline too. You can confirm the boundary the same way you can for transcription:

$ mediafind audit
✓ core path opened 0 external sockets.

Your questions — which are often more revealing than the recordings themselves — stay on your machine.

Ask is what you build once retrieval is solid. It leans entirely on the embedded-segment index from the search post, and on the timestamped transcript from the transcription post — every feature in MediaFind keeps cashing in on that one good artifact.

Ask your own footage anything

Free trial. No account, no API keys, nothing uploaded.

Download for macOS

Keep reading

Search by meaning: embeddings, CLIP and a local vector index · Search How MediaFind transcribes your media entirely on-device with Whisper · Transcription Recognizing logos, actions & famous faces with zero training data · Recognition

Ask your library: local RAG over your own media, with citations

Step 1: The question becomes a vector

Step 2: Retrieve the few segments that matter

Step 3: Build the answer — verbatim by default, fluent if you opt in

Step 4: An answer you can click

What it costs you: nothing, and no one

Ask your own footage anything

Keep reading