Transcription

Pick your Whisper: model tiers and a CoreML engine for Apple Silicon

How accurate your transcripts are — and how long they take — comes down to two choices most apps hide: which size of Whisper model, and which engine runs it. MediaFind surfaces both as a one-time, plain-language picker, and adds a whisper.cpp path that puts your Mac's GPU and Neural Engine to work.

Our earlier post on transcription walked the pipeline: decode with ffmpeg, run Whisper locally, get word-level timestamps, never touch the cloud. This one is about the two dials that decide how good and how fast that transcription is — and why MediaFind shows them to you once, in plain language, instead of burying them or pretending there's only one answer.

Dial one: model size is the real accuracy knob

Whisper ships in sizes, and the size is the trade-off. tiny is quick and light but error-prone; small, medium, and large-v3 get progressively more accurate — and are the documented fix for non-English audio — at the cost of more download, more RAM, and more time per file. There's no universally right pick: a podcaster batch-transcribing clean English speech wants something different from someone captioning noisy multilingual field recordings.

So MediaFind asks, once, with a first-run picker that frames each tier as an honest one-liner rather than a parameter count:

more download · RAM · time per file → more accuracy tiny fastest, lightest rough on noise small good default medium more accurate large-v3 most accurate · best non-English Chosen once on first transcription · weights fetched once into the local cache · switch anytime in Settings.
The picker is a ladder with honest rungs. Bigger models read messy and non-English audio far better, but cost download, memory, and time. MediaFind states the trade-off in words and lets you choose — rather than silently defaulting you to whichever extreme.

Like every model in MediaFind, the weights are open and fetched once on first use into a local cache, then run fully offline. Changed your mind? Switch tiers in Settings; the next file uses the new one.

Dial two: the engine that runs it

A model is just weights — something has to execute them. MediaFind offers two engines, because the right one depends on your hardware:

Whisper weights your chosen tier faster-whisper CPU · NVIDIA CUDA · default whisper.cpp Apple Metal GPU + CoreML Apple Silicon Metal GPU · Neural Engine (idle on CPU-only paths) unavailable → falls back to faster-whisper
One model, two engines. whisper.cpp lights up the Metal GPU and the Neural Engine on Apple Silicon; if that path isn't available, transcription quietly falls back to faster-whisper. You never end up with no transcription because an accelerator was missing.
Nothing here can hard-fail. The whisper.cpp engine and its CoreML encoder are both optional. A missing engine, uncached weights, or a failed load isn't an exception that stops a batch — it's reported as a status the UI can show, and transcription proceeds on the engine that is available. Choosing the fast path is a free upgrade, never a risk.

What you get for the trouble

Beyond the words themselves, the bigger models and the right engine buy you better metadata: word-level timestamps and per-segment confidence that get persisted alongside the text. That's what makes "jump to the exact second someone said this" precise rather than approximate, and what lets the UI flag a shaky transcript instead of presenting every guess as gospel. Accuracy isn't just fewer typos — it's timestamps you can trust to cut on.

Why expose any of this

The easy product decision is to hide both dials and ship one model on one engine. We didn't, because the honest answer to "which model should I use?" really is "it depends" — on your audio, your hardware, and how much time you'll trade for accuracy. A one-time picker with plain-language trade-offs respects that without turning the app into a control panel: most people pick once and forget it, and the people with noisy multilingual footage or a Neural Engine to exploit get to make the call that's right for them. All of it stays keyless, on-device, and offline after the first download.


Accurate, well-timestamped transcripts are the raw material everything else reads from — search, Ask, chapters, and the people who said it. Next: how those words become a searchable semantic index.

Transcribe on your terms

Pick a model, pick an engine, keep it all on your Mac. Free trial.

Download for macOS