On-device song detection and same-track clustering

MediaFind already turns speech into a searchable transcript. But a lot of what's interesting in a video library isn't speech — it's the music. The intro sting on every episode, the licensed track under a montage, the song playing in the background of a birthday video. Two questions come up constantly:

“Find the stretches that are music, not talking.” · “Which clips share the same track?”

The obvious way to build this is to call a cloud fingerprint API — the kind that names a song from a few seconds of audio. We deliberately didn't. Naming a track means shipping every musical second of your library off to a third party, and it bolts a network dependency onto a product whose whole promise is that it runs offline. So we scoped the feature to exactly what we can do on-device and named it honestly: detect music, and cluster same-track clips — without ever claiming to know the song's title.

What we cut, on purpose. No AcoustID, no Shazam-style lookup, no “🎵 Now playing: …”. Title resolution is the one piece that requires a remote database, so it's the one piece we left out. Everything below is keyless, model-free NumPy DSP over audio the transcription pipeline already decoded.

Two phases, one decode

The work splits cleanly into two passes over the same waveform — the 16 kHz mono signal the transcription stage already produced, so music understanding adds zero extra decode cost.

Both phases read the already-decoded waveform. Phase 1 labels musical spans into the audio_events search channel; Phase 2 fingerprints those spans and unions clips that share a track.

Phase 1: music vs. speech, without a model

The first job is a discriminator: is this stretch of audio music or speech? A learned audio tagger (YAMNet, PANNs) would do this and emit the full AudioSet vocabulary — but that's another heavy model to download, bundle, and watch silently die in a frozen app. Instead we lean on two robust, decades-old envelope features that separate music from speech with no training at all:

Cue	What it measures	Reads as…
4 Hz modulation energy	How strongly the amplitude pulses near the ~4 Hz syllabic rate of speech	High ⇒ speech
Low-energy fraction	The share of frames sitting well below the window's mean energy (the pauses between words)	High ⇒ speech

Speech is bursty — syllables, then gaps. Sustained music is continuous and isn't modulated at 4 Hz. So a window is labelled "Music" only when both cues look music-like and it isn't silence. That “both” is deliberate: it makes the detector precision-biased. We'd rather miss some music than mislabel someone talking as a song.

The waveform is scanned in overlapping windows (3 s wide, hopping 1.5 s), and adjacent music windows are merged into spans. Each detection comes out as a time-anchored tuple:

(start, end, "Music", score)   # e.g. (42.0, 71.5, "Music", 0.78)

Those tuples are written to the audio_events channel — the same non-speech audio-event index the search registry already knew how to query. So “music” becomes searchable the moment indexing finishes, with results that link to the exact second the music starts, exactly like a transcript hit.

An honest, defensible label. A real audio tagger could tell music from applause from a dog barking. This detector only claims the one label it can defend — "Music" — and the AudioEventBackend seam stays open so a learned model can be dropped in later to emit the full vocabulary without touching anything downstream.

Phase 2: clustering same-track clips with a chromagram

Detecting music is half the ask. The other half — “which clips use the same track?” — is a matching problem, and it's where naïve approaches fail. You can't compare raw audio: the same song re-encoded at a different bitrate, in a different container, behind different dialogue produces a totally different waveform. You need a representation that survives all that and still says “same song.”

That representation is the chromagram. It folds the spectrum down onto the 12 pitch classes of the musical octave (C, C♯, D, …) — throwing away timbre, loudness, and which octave a note lands in, keeping only the harmonic content. Two recordings of the same track produce nearly the same chroma pattern across bitrates and containers, which is precisely the invariance same-track matching needs.

Each music span is reduced to a chromagram, averaged into a compact 32 × 12 fingerprint, and compared pairwise by cosine similarity. Clips above the threshold are unioned into one “same track” group.

The pipeline is:

Fingerprint. Take only the samples Phase 1 marked as music, compute the chromagram, and average it into 32 time chunks of 12 pitch classes — a small, fixed-size vector per clip. Chunk count sets the temporal resolution.
Compare. Two clips are the “same track” when their fingerprints' cosine similarity clears 0.92 (MUSIC_FP_SIM). High by design — we want confident matches, not loose vibes.
Cluster. A union-find pass merges every pair above threshold into connected components, so if A matches D and D matches G, all three land in one group — no quadratic re-comparison, no central “query song” needed.

Every threshold here is env-overridable (MUSIC_WIN_SEC, MUSIC_MIN_SCORE, MUSIC_FP_CHUNKS, MUSIC_FP_SIM), so the precision/recall trade-off is a tuning knob, not a recompile.

Best-effort, like every other modality

Music understanding follows the same rule as the rest of the pipeline: it never aborts indexing. If detection or fingerprinting throws on one weird file, that file degrades to “no music events” and the indexer moves on. A song feature that could wedge your whole library on a single corrupt clip isn't worth shipping.

And because the entire path is NumPy over a local waveform, the privacy story is unchanged — you can confirm it yourself:

$ mediafind audit
✓ core path opened 0 external sockets.

Detecting music reuses the dormant audio_events channel; clustering reuses the same union-find approach the face pipeline uses for people. That's the pattern across MediaFind — a new capability is usually an old seam, filled in keyless-first. Next time we'll look at the opposite end of the recall problem: finding every clip that mentions a specific name.

Make your own library searchable

Free trial. No account, no API keys, nothing uploaded.

Download for macOS

Keep reading

Find every mention of a name: keyless entity search, then open-vocab NER · Search Who said it, who's in it — diarization & face recognition, privately · People & privacy How MediaFind transcribes your media entirely on-device with Whisper · Transcription

Finding the music: on-device song detection & same-track clustering

Two phases, one decode

Phase 1: music vs. speech, without a model

Phase 2: clustering same-track clips with a chromagram

Best-effort, like every other modality

Make your own library searchable

Keep reading