Which local LLM should you pick?

When you turn on MediaFind's optional on-device LLM, you're asked to choose a tier — from Mini up to Ultra. Most "which model?" advice online is written for people building AI systems, full of benchmark tables and quantization jargon. This isn't that. The only question that matters here is practical: what's the largest model that will run comfortably on your Mac — and do you even need it?

The honest answer is that the recommended middle option is right for most people, and the biggest model is right for fewer people than you'd think. Let's see why.

The four dials behind one choice

Every tier is a single point on four sliders that all move together. Understand these and the whole decision collapses into common sense:

Download size — a one-time fetch, from ~0.4 GB to ~9 GB. You pay it once; afterwards the model runs fully offline.
RAM — the recurring cost. The model has to live in memory while it answers, competing with everything else open on your Mac. This is the dial that actually decides what you can run.
Speed — bigger models think more slowly. On Apple Silicon with Metal acceleration the gap is tolerable; on an older CPU, a large model can crawl.
Answer quality — bigger models write more fluently and, crucially, are better at synthesis: weaving several clips into one coherent answer rather than paraphrasing a single segment.

Quality has a ceiling here — and that's by design. In MediaFind the model is a rephraser sitting after retrieval, not an oracle answering from memory. The keyless pipeline finds the segments that answer your question first; the LLM only makes them read nicely. So a bigger model gives you smoother prose and better cross-clip synthesis — but it can't invent facts your media doesn't contain, and it won't make a small library "smarter." The grounding is what keeps answers true; the model size only changes the polish.

The ladder, with real numbers

MediaFind's tiers are all the same well-regarded open model family (Qwen2.5-Instruct), just at different parameter counts, quantized to a compact Q4_K_M GGUF and run with llama.cpp. Here's the actual trade-off:

A ladder, not a catalogue. Mini is bundled and keyless; Small is the recommended upgrade. Higher tiers buy fluency and synthesis at a real cost in disk, RAM and speed.

Tier	Size / download	RAM to keep free	Best for
Mini	0.5B · ~0.4 GB bundled	~1 GB	Old or low-RAM Macs. Works out of the box; fine for short, single-clip answers.
Small	1.5B · ~1 GB	~2 GB	Most people. Fluent, faithful answers; fast on nearly any machine. Start here.
Medium	3B · ~2 GB	~3 GB	8 GB+ Macs that ask questions spanning several clips and want better synthesis.
Large	7B · ~4.7 GB	~6 GB	16 GB Apple Silicon. Nuanced answers; noticeably slow on a plain CPU.
Ultra	14B · ~9 GB	11 GB+	16–32 GB Macs with Metal. The most demanding questions, at the highest cost.

A rule that fits on one line

If you don't want to think about it: start on Small. It's the recommended tier because it's the point where answers become genuinely fluent without asking much of your machine. If answers feel thin when you're asking questions that span many clips, step up one tier. If your Mac feels sluggish while answering, step down one. You can change your mind at any time — switching tiers just downloads (or reuses) a different file.

Two longer rules of thumb behind that:

Match the model to your RAM, not your ambition. The headline number to respect is memory. A 7B or 14B model that has to swap because your RAM is full will be slower and worse than a 1.5B model that fits — bigger is only better when it actually fits. Check how much memory you typically have free, and leave the model room to breathe alongside your browser and editor.
Bigger helps synthesis, not facts. Because the model only rephrases retrieved segments, you feel its size most on questions like "summarize what these five meetings decided" — where it has to merge many sources. For "what did Dana say about the budget?", even Mini does fine. If your questions are mostly look-ups, don't pay for Ultra.

What you're not trading away

Whichever tier you choose, three things never change: it runs entirely on your Mac, it needs no API key or account, and downloaded tiers fetch once and then work offline. Mini remains the bundled fallback when you want the smallest footprint; larger tiers only change answer quality and speed.

Where to choose: open Settings → Ask & summaries model in MediaFind. Each tier shows its size, RAM hint and a one-line description right next to the button, and marks which ones you've already downloaded — so you're picking against real numbers, not guessing. Selecting a tier re-points both Ask and per-file summaries to it.

Picking the model is half the story; the other half is what it's grounded on. Next: how Ask retrieves the right few segments and ties every claim to a link at the exact second — the part that makes a fluent answer trustworthy.

Try it — pick a tier

Keyless on-device answers out of the box, larger local models when you want more synthesis. Everything runs on your Mac. Free trial.

Download for macOS

Keep reading

A local LLM, downloaded once: fluent Ask & summaries that stay keyless · Local AI Ask your library: local RAG over your own media, with citations · Ask Pick your Whisper: model tiers and a CoreML engine for Apple Silicon · Transcription

Which local LLM should you pick? A plain-language guide to the trade-offs

The four dials behind one choice

The ladder, with real numbers

A rule that fits on one line

What you're not trading away

Try it — pick a tier

Keep reading