Pick your Whisper: model tiers and a CoreML engine for Apple Silicon
How accurate your transcripts are — and how long they take — comes down to two choices most apps hide: which size of Whisper model, and which engine runs it. MediaFind surfaces both as a one-time, plain-language picker, and adds a whisper.cpp path that puts your Mac's GPU and Neural Engine to work.
Our earlier post on transcription walked the pipeline: decode with ffmpeg, run Whisper locally, get word-level timestamps, never touch the cloud. This one is about the two dials that decide how good and how fast that transcription is — and why MediaFind shows them to you once, in plain language, instead of burying them or pretending there's only one answer.
Dial one: model size is the real accuracy knob
Whisper ships in sizes, and the size is the trade-off. tiny is quick and light but error-prone; small, medium, and large-v3 get progressively more accurate — and are the documented fix for non-English audio — at the cost of more download, more RAM, and more time per file. There's no universally right pick: a podcaster batch-transcribing clean English speech wants something different from someone captioning noisy multilingual field recordings.
So MediaFind asks, once, with a first-run picker that frames each tier as an honest one-liner rather than a parameter count:
Like every model in MediaFind, the weights are open and fetched once on first use into a local cache, then run fully offline. Changed your mind? Switch tiers in Settings; the next file uses the new one.
Dial two: the engine that runs it
A model is just weights — something has to execute them. MediaFind offers two engines, because the right one depends on your hardware:
- faster-whisper — the default. Built on CTranslate2, it's fast and reliable on CPU and, where present, NVIDIA CUDA. It's the safe everywhere-engine and what you get unless you choose otherwise.
- whisper.cpp — optional, via
pywhispercpp. On Apple Silicon it runs the model on the Metal GPU by default and, when available, uses a CoreML encoder that offloads the heaviest stage onto the Neural Engine — the dedicated ML silicon in every modern Mac that otherwise sits idle during transcription.
What you get for the trouble
Beyond the words themselves, the bigger models and the right engine buy you better metadata: word-level timestamps and per-segment confidence that get persisted alongside the text. That's what makes "jump to the exact second someone said this" precise rather than approximate, and what lets the UI flag a shaky transcript instead of presenting every guess as gospel. Accuracy isn't just fewer typos — it's timestamps you can trust to cut on.
Why expose any of this
The easy product decision is to hide both dials and ship one model on one engine. We didn't, because the honest answer to "which model should I use?" really is "it depends" — on your audio, your hardware, and how much time you'll trade for accuracy. A one-time picker with plain-language trade-offs respects that without turning the app into a control panel: most people pick once and forget it, and the people with noisy multilingual footage or a Neural Engine to exploit get to make the call that's right for them. All of it stays keyless, on-device, and offline after the first download.
Accurate, well-timestamped transcripts are the raw material everything else reads from — search, Ask, chapters, and the people who said it. Next: how those words become a searchable semantic index.
Transcribe on your terms
Pick a model, pick an engine, keep it all on your Mac. Free trial.
Download for macOS