Which Whisper model should you pick? An ASR model-selection guide
MediaFind asks you once: tiny, base, small, medium, or large-v3? It's a real trade-off — accuracy against speed, memory and disk — and the right answer depends on your audio, your language, and your Mac. Here's how to choose without guessing.
Every transcript MediaFind makes starts with one decision: which speech-to-text model to run. We don't hide it behind a parameter count, and we don't silently pick an extreme for you. The first-run picker lays out five tiers in plain language — but if you want to understand why one fits your library better than another, this is the guide.
Two companion posts cover the surrounding machinery: how the on-device pipeline works end to end, and the two engines (faster-whisper everywhere, or whisper.cpp with Apple's GPU and Neural Engine) that execute whichever model you choose. This one is narrowly about picking the model.
The one thing to understand first
Model size is the accuracy knob, and it's not free. A bigger model transcribes messy, accented, noisy, and non-English audio far more reliably — but it downloads more, holds more in RAM, and takes longer per file. There is no universally correct pick. A podcaster batch-captioning clean studio English wants something very different from someone transcribing noisy multilingual field recordings on a 4-year-old laptop.
The five tiers, with honest numbers
These are the actual figures MediaFind's picker is built on — download is the on-disk footprint of the quantized weights; RAM is a rough working-set estimate while transcribing.
| Model | Params | Download | RAM | Speed | Accuracy |
|---|---|---|---|---|---|
tiny | 39M | ~75 MB | ~0.5 GB | Very fast (real-time on any CPU) | Basic — frequent errors, weak on non-English |
base | 74M | ~145 MB | ~0.7 GB | Fast | Fair — noticeably better than tiny |
small (recommended) | 244M | ~480 MB | ~1.5 GB | Moderate | Good — reliable for most audio & languages |
medium | 769M | ~1.5 GB | ~4 GB | Slower | Strong — handles hard audio & languages well |
large-v3 | 1550M | ~3.1 GB | ~8 GB | Slow on CPU (GPU/Metal recommended) | Best — state-of-the-art accuracy |
The jumps aren't linear. The biggest accuracy gain for the least cost is the step from tiny/base up to small — which is exactly why small is the default recommendation. Going small → medium → large-v3 buys progressively more on the hard cases (accents, names, overlapping speech, other languages) while costing a lot more time and memory on the easy ones.
small; hard or non-English audio rewards medium, and large-v3 if you have the GPU/Metal and memory to run it at a usable speed.Match the model to your situation
The four factors that should drive your choice — in roughly this order:
1. How hard is the audio?
This matters most. Clean, close-mic, single-speaker English is forgiving — even tiny is usable, and small is genuinely good. The moment you add accents, crosstalk, background noise, music beds, or far-field mics, the smaller models start dropping and inventing words. That's where medium and large-v3 earn their cost.
2. What language is it?
Non-English audio is the single clearest reason to size up. The bigger models are the documented fix for other languages — tiny and base are explicitly weak here, while medium and large-v3 handle multilingual and code-switching content far better. If your library isn't English, start at small at the very least, and reach for medium/large-v3 if the transcripts read poorly.
3. What hardware are you on?
large-v3 is slow on a bare CPU — it really wants NVIDIA CUDA or, on a Mac, the whisper.cpp engine driving the Metal GPU and Neural Engine, plus around 16 GB of RAM. medium wants 8 GB+ and benefits from acceleration. small and below run comfortably on any modern machine. Picking a model your machine can't feed just means watching a progress bar crawl.
4. How big is the library, and how soon do you need it?
Transcription is a one-time cost per file, but it adds up across a backlog. For thousands of clips you want searchable today, a faster model (small, or even base) gets the whole library indexed sooner; you can always re-transcribe the handful of files that matter most at higher quality later. For a small set of high-value recordings, just use the best model your hardware allows.
Quick recipes
| If you're… | Start with |
|---|---|
| Not sure / general use | small — the recommended sweet spot |
| Captioning clean English podcasts or screen recordings | small (or base for a huge backlog) |
| Indexing thousands of clips fast on a modest laptop | base → re-run key files at small/medium |
| Transcribing accented, noisy, or multilingual audio | medium, or large-v3 if you can run it |
| Producing publication-grade transcripts on a capable Mac/GPU | large-v3 with the Metal / CUDA engine |
| On a low-RAM or older machine | tiny/base; avoid medium+ |
Why accuracy is worth more than the words
A better model doesn't just mean fewer typos. MediaFind keeps word-level timestamps and per-segment confidence from every transcription, and those feed everything downstream — semantic search, Ask, chapters, and speaker labels. Accurate timing is what makes "jump to the exact second this was said" precise instead of approximate, and reliable text is what keeps search and Ask from chasing words that were never spoken. When you size up, you're upgrading the foundation the whole app reads from.
The bottom line
When in doubt, start at small — it's the best accuracy-per-cost on the ladder and runs anywhere. Size up to medium or large-v3 when your audio is hard or non-English and your hardware can keep up; size down to base or tiny when you're racing through a clean-English backlog. Because the choice is a one-time, reversible setting that stays entirely on your Mac, the cost of getting it slightly wrong is just re-selecting and moving on.
Picked a model? The next question is what runs it. Our companion post covers the two on-device engines — and why Apple Silicon owners can light up the GPU and Neural Engine for a free speed-up.
Try it on your own library
Pick a model, transcribe on your Mac, search by meaning. Free trial — no account, no API keys, nothing uploaded.
Download for macOS