Which Whisper model should you pick?

Every transcript MediaFind makes starts with one decision: which speech-to-text model to run. We don't hide it behind a parameter count, and we don't silently pick an extreme for you. The first-run picker lays out five tiers in plain language — but if you want to understand why one fits your library better than another, this is the guide.

Two companion posts cover the surrounding machinery: how the on-device pipeline works end to end, and the two engines (faster-whisper everywhere, or whisper.cpp with Apple's GPU and Neural Engine) that execute whichever model you choose. This one is narrowly about picking the model.

The one thing to understand first

Model size is the accuracy knob, and it's not free. A bigger model transcribes messy, accented, noisy, and non-English audio far more reliably — but it downloads more, holds more in RAM, and takes longer per file. There is no universally correct pick. A podcaster batch-captioning clean studio English wants something very different from someone transcribing noisy multilingual field recordings on a 4-year-old laptop.

It's reversible. You're not locked in. The weights are open and fetched once into a local cache, then run fully offline. Change tiers anytime in Settings; the next file uses the new one. So pick the one that looks right, and adjust if the transcripts or the speed disappoint.

The five tiers, with honest numbers

These are the actual figures MediaFind's picker is built on — download is the on-disk footprint of the quantized weights; RAM is a rough working-set estimate while transcribing.

Model	Params	Download	RAM	Speed	Accuracy
`tiny`	39M	~75 MB	~0.5 GB	Very fast (real-time on any CPU)	Basic — frequent errors, weak on non-English
`base`	74M	~145 MB	~0.7 GB	Fast	Fair — noticeably better than tiny
`small` (recommended)	244M	~480 MB	~1.5 GB	Moderate	Good — reliable for most audio & languages
`medium`	769M	~1.5 GB	~4 GB	Slower	Strong — handles hard audio & languages well
`large-v3`	1550M	~3.1 GB	~8 GB	Slow on CPU (GPU/Metal recommended)	Best — state-of-the-art accuracy

The jumps aren't linear. The biggest accuracy gain for the least cost is the step from tiny/base up to small — which is exactly why small is the default recommendation. Going small → medium → large-v3 buys progressively more on the hard cases (accents, names, overlapping speech, other languages) while costing a lot more time and memory on the easy ones.

A rough decision flow. Clean English on a single speaker rarely needs more than small; hard or non-English audio rewards medium, and large-v3 if you have the GPU/Metal and memory to run it at a usable speed.

Match the model to your situation

The four factors that should drive your choice — in roughly this order:

1. How hard is the audio?

This matters most. Clean, close-mic, single-speaker English is forgiving — even tiny is usable, and small is genuinely good. The moment you add accents, crosstalk, background noise, music beds, or far-field mics, the smaller models start dropping and inventing words. That's where medium and large-v3 earn their cost.

2. What language is it?

Non-English audio is the single clearest reason to size up. The bigger models are the documented fix for other languages — tiny and base are explicitly weak here, while medium and large-v3 handle multilingual and code-switching content far better. If your library isn't English, start at small at the very least, and reach for medium/large-v3 if the transcripts read poorly.

3. What hardware are you on?

large-v3 is slow on a bare CPU — it really wants NVIDIA CUDA or, on a Mac, the whisper.cpp engine driving the Metal GPU and Neural Engine, plus around 16 GB of RAM. medium wants 8 GB+ and benefits from acceleration. small and below run comfortably on any modern machine. Picking a model your machine can't feed just means watching a progress bar crawl.

4. How big is the library, and how soon do you need it?

Transcription is a one-time cost per file, but it adds up across a backlog. For thousands of clips you want searchable today, a faster model (small, or even base) gets the whole library indexed sooner; you can always re-transcribe the handful of files that matter most at higher quality later. For a small set of high-value recordings, just use the best model your hardware allows.

Quick recipes

If you're…	Start with
Not sure / general use	`small` — the recommended sweet spot
Captioning clean English podcasts or screen recordings	`small` (or `base` for a huge backlog)
Indexing thousands of clips fast on a modest laptop	`base` → re-run key files at `small`/`medium`
Transcribing accented, noisy, or multilingual audio	`medium`, or `large-v3` if you can run it
Producing publication-grade transcripts on a capable Mac/GPU	`large-v3` with the Metal / CUDA engine
On a low-RAM or older machine	`tiny`/`base`; avoid `medium`+

Why accuracy is worth more than the words

A better model doesn't just mean fewer typos. MediaFind keeps word-level timestamps and per-segment confidence from every transcription, and those feed everything downstream — semantic search, Ask, chapters, and speaker labels. Accurate timing is what makes "jump to the exact second this was said" precise instead of approximate, and reliable text is what keeps search and Ask from chasing words that were never spoken. When you size up, you're upgrading the foundation the whole app reads from.

The bottom line

When in doubt, start at small — it's the best accuracy-per-cost on the ladder and runs anywhere. Size up to medium or large-v3 when your audio is hard or non-English and your hardware can keep up; size down to base or tiny when you're racing through a clean-English backlog. Because the choice is a one-time, reversible setting that stays entirely on your Mac, the cost of getting it slightly wrong is just re-selecting and moving on.

Picked a model? The next question is what runs it. Our companion post covers the two on-device engines — and why Apple Silicon owners can light up the GPU and Neural Engine for a free speed-up.

Try it on your own library

Pick a model, transcribe on your Mac, search by meaning. Free trial — no account, no API keys, nothing uploaded.

Download for macOS

Keep reading

Pick your Whisper: model tiers and a CoreML engine for Apple Silicon · Transcription How MediaFind transcribes your media entirely on-device with Whisper · Transcription Search by meaning: embeddings, CLIP and a local vector index · Search

Which Whisper model should you pick? An ASR model-selection guide

The one thing to understand first

The five tiers, with honest numbers

Match the model to your situation

1. How hard is the audio?

2. What language is it?

3. What hardware are you on?

4. How big is the library, and how soon do you need it?

Quick recipes

Why accuracy is worth more than the words

The bottom line

Try it on your own library

Keep reading