OCR search: find text on screen across your entire video library

There are two kinds of text in a video: the words people say, and the words that appear on screen. Transcription handles the first. OCR handles the second — and for a surprisingly large category of footage, it's the only way to find the clip you're looking for.

Think about screen recordings of a product demo, interview lower-thirds naming a guest, a conference talk with slides advancing in the background, footage of a storefront, a price tag in a product review, subtitles burned into an export. None of that is in the transcript. All of it is in the pixels.

The problem with searching video text the hard way

Most people's workflow for finding text-on-screen is: remember roughly when it was, open the file, scrub. If you have one clip and a rough timestamp, that's annoying but survivable. If you have a library of hundreds of files and no memory of which one, it's genuinely broken.

Cloud tools can help — upload your footage, run it through a vision API, get back a searchable transcript. But that means your footage leaves your machine, you pay per minute, and you wait for the pipeline to finish. For sensitive material, that's a non-starter. For large libraries, the cost adds up fast.

How MediaFind does it at index time

When MediaFind indexes a file, it samples frames at a regular interval and runs OCR on each one. The extracted text is stored in your local index alongside the timestamp of the frame it came from. By the time you search, the heavy lifting is already done — a query takes milliseconds regardless of how long the source clip is.

The OCR runs entirely on your machine using Apple's Vision framework on macOS. No API key. No upload. No per-query cost. The frames never leave your disk.

Sampling rather than processing every frame is intentional. On-screen text — especially the kind worth searching for — is usually displayed for several seconds. A frame sampled every few seconds captures it reliably without multiplying your index time by 30.

OCR runs at index time (the highlighted step), entirely on your Mac. By the time you search, the heavy lifting is already done.

What OCR search actually finds

The range is wider than you might expect:

Lower-thirds — "Dr. Jane Smith, Chief Medical Officer" in an interview. Search her name, jump straight to her introduction.
Presentation slides — a conference talk, a webinar, a screen recording. Each slide title is searchable.
Burned-in subtitles — exports from tools that bake captions into the video. If the original audio is unclear or in another language, the subtitles may be your only hook.
Product labels and prices — product review footage, unboxing videos, retail walkthroughs.
Street signs and storefronts — travel footage, location scouting, documentary B-roll.
Scoreboards and stats — sports footage, live event recordings.
Ticker text — news broadcasts, financial screen recordings.
UI text in screen recordings — error messages, menu items, button labels.

Each of these is a case where neither keyword search over a transcript nor CLIP visual search would reliably surface the clip. OCR is its own distinct modality — and it fills a gap the others can't.

Using it

From your perspective, OCR search is invisible. You type a word into MediaFind's search box, and if that word appears on screen in any of your indexed clips, those clips surface in the results. You'll see a timestamp next to each result, and clicking it takes you directly to the frame where the text appears.

On the play page, the On-screen text panel shows every piece of text MediaFind found in that file, with timestamps. It's useful for skimming the text content of a clip before you decide whether to watch it.

OCR results participate in the same unified ranking as transcript matches, visual matches, and entity matches. If a clip matches on multiple channels — someone says "Acme Corp" and the logo appears on screen — the score reflects both signals.

What it doesn't do well

OCR is a strong modality but not a perfect one. A few things to keep in mind:

Handwriting is hit-or-miss. Printed text on a clean background is reliable. Cursive handwriting, especially at low resolution, often isn't.

Low contrast and motion blur reduce accuracy. Text that's superimposed on a busy background, or that appears while the camera is panning, may be missed or garbled. The frame sampler tries to pick relatively stable frames, but it can't guarantee every frame it picks is blur-free.

Very small text — think fine print at the bottom of a screen — may not be legible at the resolution MediaFind samples at. If it matters, zoom in while shooting.

Non-Latin scripts work for many languages Apple Vision supports, but coverage varies. For best results with right-to-left scripts or CJK characters, test on a sample clip first.

None of these are reasons to avoid OCR search — they're just reasons to also use transcript and visual search, and let the ranking combine the signals.

The on-device guarantee

It's worth being explicit about what "on-device OCR" means for your footage. MediaFind never sends your frames to Apple's servers — the Vision framework runs locally on your Mac's CPU and Neural Engine. There's no API key to configure, no account to log in to, and no usage meter ticking. The OCR runs as part of indexing, the results live in your local database, and none of it is observable from outside your machine.

For footage that contains sensitive information — financial data, medical records, internal product roadmaps, client materials — that's not a minor detail. It's the whole point.

How it fits with the other search channels

MediaFind searches across several independent channels at once: spoken words (ASR transcript), meaning (semantic embeddings), visuals (CLIP), entities (people, places, organisations), faces, speakers, logos, actions, songs, and on-screen text (OCR). Each channel contributes a score, and the results you see are ranked by the combined signal.

OCR is the channel that covers what's literally written on screen — the one that no amount of audio analysis or visual similarity search can substitute for. Add it to the mix and you have one fewer class of clip that slips through the cracks.

Scores are illustrative. OCR dominates here because the text appears verbatim on multiple slides — a signal no transcript or visual search could match. All channels combine into the final ranking.

On-screen text is just one of the channels MediaFind searches simultaneously. For the visual side of things — finding clips by what they look like, not what's written — see how CLIP and semantic embeddings work together.

Find what's written in your footage

Index your library once. Search on-screen text, transcripts, faces, and more — all on your Mac.

Download for macOS

Keep reading

Search by meaning with embeddings and CLIP · Search Recognize logos, actions, and famous faces — keyless · Search Who said it, who's in it — diarization & face recognition, privately · People & privacy