Which local LLM should you pick? A plain-language guide to the trade-offs
MediaFind lets you run an on-device model to make Ask and summaries read more fluently — but it asks you to pick a size first. Bigger looks better on paper and isn't always the right call. Here's what each tier actually costs, what it buys, and a one-line rule for choosing.
When you turn on MediaFind's optional on-device LLM, you're asked to choose a tier — from Mini up to Ultra. Most "which model?" advice online is written for people building AI systems, full of benchmark tables and quantization jargon. This isn't that. The only question that matters here is practical: what's the largest model that will run comfortably on your Mac — and do you even need it?
The honest answer is that the recommended middle option is right for most people, and the biggest model is right for fewer people than you'd think. Let's see why.
The four dials behind one choice
Every tier is a single point on four sliders that all move together. Understand these and the whole decision collapses into common sense:
- Download size — a one-time fetch, from ~0.4 GB to ~9 GB. You pay it once; afterwards the model runs fully offline.
- RAM — the recurring cost. The model has to live in memory while it answers, competing with everything else open on your Mac. This is the dial that actually decides what you can run.
- Speed — bigger models think more slowly. On Apple Silicon with Metal acceleration the gap is tolerable; on an older CPU, a large model can crawl.
- Answer quality — bigger models write more fluently and, crucially, are better at synthesis: weaving several clips into one coherent answer rather than paraphrasing a single segment.
The ladder, with real numbers
MediaFind's tiers are all the same well-regarded open model family (Qwen2.5-Instruct), just at different parameter counts, quantized to a compact Q4_K_M GGUF and run with llama.cpp. Here's the actual trade-off:
| Tier | Size / download | RAM to keep free | Best for |
|---|---|---|---|
| Mini | 0.5B · ~0.4 GB bundled | ~1 GB | Old or low-RAM Macs. Works out of the box; fine for short, single-clip answers. |
| Small | 1.5B · ~1 GB | ~2 GB | Most people. Fluent, faithful answers; fast on nearly any machine. Start here. |
| Medium | 3B · ~2 GB | ~3 GB | 8 GB+ Macs that ask questions spanning several clips and want better synthesis. |
| Large | 7B · ~4.7 GB | ~6 GB | 16 GB Apple Silicon. Nuanced answers; noticeably slow on a plain CPU. |
| Ultra | 14B · ~9 GB | 11 GB+ | 16–32 GB Macs with Metal. The most demanding questions, at the highest cost. |
A rule that fits on one line
If you don't want to think about it: start on Small. It's the recommended tier because it's the point where answers become genuinely fluent without asking much of your machine. If answers feel thin when you're asking questions that span many clips, step up one tier. If your Mac feels sluggish while answering, step down one. You can change your mind at any time — switching tiers just downloads (or reuses) a different file.
Two longer rules of thumb behind that:
- Match the model to your RAM, not your ambition. The headline number to respect is memory. A 7B or 14B model that has to swap because your RAM is full will be slower and worse than a 1.5B model that fits — bigger is only better when it actually fits. Check how much memory you typically have free, and leave the model room to breathe alongside your browser and editor.
- Bigger helps synthesis, not facts. Because the model only rephrases retrieved segments, you feel its size most on questions like "summarize what these five meetings decided" — where it has to merge many sources. For "what did Dana say about the budget?", even Mini does fine. If your questions are mostly look-ups, don't pay for Ultra.
What you're not trading away
Whichever tier you choose, three things never change: it runs entirely on your Mac, it needs no API key or account, and downloaded tiers fetch once and then work offline. Mini remains the bundled fallback when you want the smallest footprint; larger tiers only change answer quality and speed.
Picking the model is half the story; the other half is what it's grounded on. Next: how Ask retrieves the right few segments and ties every claim to a link at the exact second — the part that makes a fluent answer trustworthy.
Try it — pick a tier
Keyless on-device answers out of the box, larger local models when you want more synthesis. Everything runs on your Mac. Free trial.
Download for macOS