Pick your Whisper model and engine

Our earlier post on transcription walked the pipeline: decode with ffmpeg, run Whisper locally, get word-level timestamps, never touch the cloud. This one is about the two dials that decide how good and how fast that transcription is — and why MediaFind shows them to you once, in plain language, instead of burying them or pretending there's only one answer.

Dial one: model size is the real accuracy knob

Whisper ships in sizes, and the size is the trade-off. tiny is quick and light but error-prone; small, medium, and large-v3 get progressively more accurate — and are the documented fix for non-English audio — at the cost of more download, more RAM, and more time per file. There's no universally right pick: a podcaster batch-transcribing clean English speech wants something different from someone captioning noisy multilingual field recordings.

So MediaFind asks, once, with a first-run picker that frames each tier as an honest one-liner rather than a parameter count:

The picker is a ladder with honest rungs. Bigger models read messy and non-English audio far better, but cost download, memory, and time. MediaFind states the trade-off in words and lets you choose — rather than silently defaulting you to whichever extreme.

Like every model in MediaFind, the weights are open and fetched once on first use into a local cache, then run fully offline. Changed your mind? Switch tiers in Settings; the next file uses the new one.

Dial two: the engine that runs it

A model is just weights — something has to execute them. MediaFind offers two engines, because the right one depends on your hardware:

faster-whisper — the default. Built on CTranslate2, it's fast and reliable on CPU and, where present, NVIDIA CUDA. It's the safe everywhere-engine and what you get unless you choose otherwise.
whisper.cpp — optional, via pywhispercpp. On Apple Silicon it runs the model on the Metal GPU by default and, when available, uses a CoreML encoder that offloads the heaviest stage onto the Neural Engine — the dedicated ML silicon in every modern Mac that otherwise sits idle during transcription.

One model, two engines. whisper.cpp lights up the Metal GPU and the Neural Engine on Apple Silicon; if that path isn't available, transcription quietly falls back to faster-whisper. You never end up with no transcription because an accelerator was missing.

Nothing here can hard-fail. The whisper.cpp engine and its CoreML encoder are both optional. A missing engine, uncached weights, or a failed load isn't an exception that stops a batch — it's reported as a status the UI can show, and transcription proceeds on the engine that is available. Choosing the fast path is a free upgrade, never a risk.

What you get for the trouble

Beyond the words themselves, the bigger models and the right engine buy you better metadata: word-level timestamps and per-segment confidence that get persisted alongside the text. That's what makes "jump to the exact second someone said this" precise rather than approximate, and what lets the UI flag a shaky transcript instead of presenting every guess as gospel. Accuracy isn't just fewer typos — it's timestamps you can trust to cut on.

Why expose any of this

The easy product decision is to hide both dials and ship one model on one engine. We didn't, because the honest answer to "which model should I use?" really is "it depends" — on your audio, your hardware, and how much time you'll trade for accuracy. A one-time picker with plain-language trade-offs respects that without turning the app into a control panel: most people pick once and forget it, and the people with noisy multilingual footage or a Neural Engine to exploit get to make the call that's right for them. All of it stays keyless, on-device, and offline after the first download.

Accurate, well-timestamped transcripts are the raw material everything else reads from — search, Ask, chapters, and the people who said it. Next: how those words become a searchable semantic index.

Transcribe on your terms

Pick a model, pick an engine, keep it all on your Mac. Free trial.

Download for macOS

Keep reading

Which Whisper model should you pick? An ASR model-selection guide · Guide How MediaFind transcribes your media entirely on-device with Whisper · Transcription Who said it, who's in it — diarization & face recognition, privately · People & privacy Search by meaning: embeddings, CLIP and a local vector index · Search

Pick your Whisper: model tiers and a CoreML engine for Apple Silicon

Dial one: model size is the real accuracy knob

Dial two: the engine that runs it

What you get for the trouble

Why expose any of this

Transcribe on your terms

Keep reading