Finding the music: on-device song detection & same-track clustering
“Show me the parts with music.” “Which of these clips use the same track?” Both are easy to ask and surprisingly hard to answer without uploading your audio to someone else's fingerprint service. Here's how MediaFind answers them with pure on-device DSP — no API key, no model download, nothing leaving your Mac.
MediaFind already turns speech into a searchable transcript. But a lot of what's interesting in a video library isn't speech — it's the music. The intro sting on every episode, the licensed track under a montage, the song playing in the background of a birthday video. Two questions come up constantly:
“Find the stretches that are music, not talking.” · “Which clips share the same track?”
The obvious way to build this is to call a cloud fingerprint API — the kind that names a song from a few seconds of audio. We deliberately didn't. Naming a track means shipping every musical second of your library off to a third party, and it bolts a network dependency onto a product whose whole promise is that it runs offline. So we scoped the feature to exactly what we can do on-device and named it honestly: detect music, and cluster same-track clips — without ever claiming to know the song's title.
Two phases, one decode
The work splits cleanly into two passes over the same waveform — the 16 kHz mono signal the transcription stage already produced, so music understanding adds zero extra decode cost.
audio_events search channel; Phase 2 fingerprints those spans and unions clips that share a track.Phase 1: music vs. speech, without a model
The first job is a discriminator: is this stretch of audio music or speech? A learned audio tagger (YAMNet, PANNs) would do this and emit the full AudioSet vocabulary — but that's another heavy model to download, bundle, and watch silently die in a frozen app. Instead we lean on two robust, decades-old envelope features that separate music from speech with no training at all:
| Cue | What it measures | Reads as… |
|---|---|---|
| 4 Hz modulation energy | How strongly the amplitude pulses near the ~4 Hz syllabic rate of speech | High ⇒ speech |
| Low-energy fraction | The share of frames sitting well below the window's mean energy (the pauses between words) | High ⇒ speech |
Speech is bursty — syllables, then gaps. Sustained music is continuous and isn't modulated at 4 Hz. So a window is labelled "Music" only when both cues look music-like and it isn't silence. That “both” is deliberate: it makes the detector precision-biased. We'd rather miss some music than mislabel someone talking as a song.
The waveform is scanned in overlapping windows (3 s wide, hopping 1.5 s), and adjacent music windows are merged into spans. Each detection comes out as a time-anchored tuple:
(start, end, "Music", score) # e.g. (42.0, 71.5, "Music", 0.78)
Those tuples are written to the audio_events channel — the same non-speech audio-event index the search registry already knew how to query. So “music” becomes searchable the moment indexing finishes, with results that link to the exact second the music starts, exactly like a transcript hit.
"Music" — and the AudioEventBackend seam stays open so a learned model can be dropped in later to emit the full vocabulary without touching anything downstream.Phase 2: clustering same-track clips with a chromagram
Detecting music is half the ask. The other half — “which clips use the same track?” — is a matching problem, and it's where naïve approaches fail. You can't compare raw audio: the same song re-encoded at a different bitrate, in a different container, behind different dialogue produces a totally different waveform. You need a representation that survives all that and still says “same song.”
That representation is the chromagram. It folds the spectrum down onto the 12 pitch classes of the musical octave (C, C♯, D, …) — throwing away timbre, loudness, and which octave a note lands in, keeping only the harmonic content. Two recordings of the same track produce nearly the same chroma pattern across bitrates and containers, which is precisely the invariance same-track matching needs.
The pipeline is:
- Fingerprint. Take only the samples Phase 1 marked as music, compute the chromagram, and average it into
32time chunks of12pitch classes — a small, fixed-size vector per clip. Chunk count sets the temporal resolution. - Compare. Two clips are the “same track” when their fingerprints' cosine similarity clears
0.92(MUSIC_FP_SIM). High by design — we want confident matches, not loose vibes. - Cluster. A union-find pass merges every pair above threshold into connected components, so if A matches D and D matches G, all three land in one group — no quadratic re-comparison, no central “query song” needed.
Every threshold here is env-overridable (MUSIC_WIN_SEC, MUSIC_MIN_SCORE, MUSIC_FP_CHUNKS, MUSIC_FP_SIM), so the precision/recall trade-off is a tuning knob, not a recompile.
Best-effort, like every other modality
Music understanding follows the same rule as the rest of the pipeline: it never aborts indexing. If detection or fingerprinting throws on one weird file, that file degrades to “no music events” and the indexer moves on. A song feature that could wedge your whole library on a single corrupt clip isn't worth shipping.
And because the entire path is NumPy over a local waveform, the privacy story is unchanged — you can confirm it yourself:
$ mediafind audit
✓ core path opened 0 external sockets.
Detecting music reuses the dormant audio_events channel; clustering reuses the same union-find approach the face pipeline uses for people. That's the pattern across MediaFind — a new capability is usually an old seam, filled in keyless-first. Next time we'll look at the opposite end of the recall problem: finding every clip that mentions a specific name.
Make your own library searchable
Free trial. No account, no API keys, nothing uploaded.
Download for macOS