Quality engineering

Continuous batch QA: rolling loops, not nightly runs

A nightly QA run finds last night's bugs. A rolling 15-minute inspector loop finds this commit's bugs — before they compound. Here's how MediaFind's continuous batch QA model keeps a complex local-first app honest, with parallel discovery, serialized fixes, and a risk table that knows which code needs what check.

MediaFind is retrieval software. Search, Ask, clip export, face recognition, speaker diarization — it's a stack of ML models, embeddings, a local database, and a packaged desktop app, all running offline on your Mac. The failure surface is enormous. A change to the embedding pipeline can silently drop the right result from rank 1 to rank 8. An ffmpeg flag can corrupt an export that opens fine. A privacy-sensitive feature can turn itself on by default when it shouldn't. A lazy import can vanish inside PyInstaller's tree-shaking and leave an entire feature returning empty results with no error message.

Unit tests catch none of that. The only way to stay confident through active development is to look at the running app, continuously, with the same skepticism a user would bring — and to do it fast enough that problems surface before they compound.

This is the approach we call continuous batch QA: a structured model for running parallel inspection, triaging findings, and shipping fixes in tight loops, all day, while code is changing.

The core model

The operating principle is simple:

One conductor, many isolated inspectors, shared evidence, minimal shared state. Do not maximize agent count. Maximize independent coverage per unit of coordination cost.

The conductor owns the plan and the single source of truth for results. They assign non-overlapping missions, maintain the issue ledger, deduplicate findings, decide severity, hand confirmed defects to fix agents, and run final verification. Critically, the conductor avoids doing broad manual QA themselves — their job is to keep the swarm organized, not to run it all.

Inspector subagents discover issues. They don't edit code during discovery. Each inspector gets a single workflow or surface, isolated state (its own MEDIAFIND_DATA directory, port, browser profile, and file prefix), and a clear definition of what counts as failure and what evidence is required.

Fix agents work only after the conductor has deduplicated and confirmed a finding. At most two fix agents run at once, and never when their file ownership overlaps. Shared files — index.py, app.py, templates, desktop code, packaging specs — are serialized. Parallel fixes are for isolated modules and tests only.

Verifiers close findings. Whenever possible, the verifier is not the same agent that found or fixed the issue. For blocker and high severity findings, self-verification is never allowed.

Seven invariants that every batch must protect

Before running anything, it helps to be clear about what you're protecting. MediaFind's QA operates around six invariants that can never regress:

  1. Private, local, keyless by default. Core indexing, search, Ask, playback, export, and summaries must not make external network calls.
  2. Sensitive features stay controlled. Faces and people follow the entitlement-aware default (on for Pro/trial, off for free), and privacy-sensitive features stay local-only.
  3. The index remains durable. Schema migrations are additive, idempotent, and safe across old databases. No QA run should corrupt user data.
  4. Critical journeys stay usable. Add media, index, search, play, Ask, export, organize, license, backup, and relink must work in the real app — not only in unit tests.
  5. Packaged artifacts work. The wheel, frozen server binary, and desktop app must include templates, static assets, migrations, and required runtime resources.
  6. Quality is measured. Search, visual retrieval, Ask grounding, faithfulness, and latency regressions are tracked by evals, not judged only by manual inspection.

These aren't aspirations — they're things a QA run can verify or falsify. The batch structure maps to them.

Six kinds of batches

We group the work into six named batch types. Each targets a failure class that other batches don't cover well.

Batch 0 Automated gate CI, evals, wheel-smoke Batch 1 Core CUJs Import · Search · Ask · Admin Batch 2 Privacy patrol Network blocks, opt-in gates Batch 3 Real-computer UI 4 viewports, keyboard-only Batch 4 Packaging artifacts Wheel · binary · desktop app Batch 5 Quality regression Evals + regression gate Issue ledger conductor-owned · deduplicated
Six batch types, each targeting a distinct failure class. All findings flow to a single conductor-owned ledger — the canonical source of truth that prevents duplicates and decides severity.

Batch 0 runs deterministic gates before any real-computer time is spent: scripts/ci-local.sh, make eval, make eval-gate, make wheel-smoke. If the base is red, stop. Don't launch parallel inspection on a broken base unless the batch is explicitly about diagnosing that break.

Batch 1 splits critical user journeys by workflow — not by "test everything." One agent per workflow surface: first run and import; search, playback, and export; Ask and citations; library organization. Splitting this way keeps mission scope tight enough that an agent can actually cover it in one session.

Batch 2 is a standing batch because privacy is a product promise. Core paths with external sockets blocked, explicit opt-in/opt-out behaviors, SSRF rejection, redacted export checks. A privacy regression is always a blocker, regardless of cadence.

Batch 3 uses real browser or desktop control — because broken focus, overlapping text, and stalled UI don't show up in API tests. Four inspectors, four viewports: 1440×900 desktop, 1280×720 small desktop, 390px narrow/mobile, and keyboard-only for accessibility.

Batch 4 runs separately because packaged-app failures routinely pass source-level tests. The wheel, the frozen server binary, and the signed macOS app are each tested with their dependencies hidden.

Batch 5 covers quality regression — retrieval metrics, faithfulness, latency percentiles, eval history drift. MediaFind is retrieval software, so quality regressions are product regressions. The eval harness runs automatically; the conductor is the only one who updates baselines, and only after reviewing three to five concrete query/result diffs per regressed suite.

The rolling cadence

The key design decision in our QA model is that nightly and weekly runs are backstops, not the main engine. The main engine is a rolling 15-to-30-minute loop that runs while code is changing.

parallel loops Smoke Privacy UI Regression 15 – 30 min window 10 – 15 min 10 – 20 min 15 – 30 min 10 – 20 min evidence → ledger
The four inspector loops run in parallel, not sequentially — so the wall-clock time is the longest single loop, not their sum. A round finishes in 15-30 minutes regardless of how many inspectors are running.

In each loop, an inspector:

The inspector labels (Smoke, Privacy, UI, Regression) are roles, not fixed staff. The same person covers whichever role is most relevant to what's actively changing.

Risk-triggered escalation

The cadence ladder — rolling loops, then hourly, then daily, then release candidate — tells you how often to run. But it doesn't tell you what to run when a specific file changes. That's what the escalation table is for.

The table is an overlay, not a calendar slot. Regardless of where you are in the cadence, when a covered surface is touched, its check goes into the queue immediately:

If touchedAdd immediately
index.py, migrations, import/exportOld-schema migration, backup/restore, relink, corruption smoke
app.py, templates, static JSReal-computer UI pass, console/network capture, accessibility smoke
Search, embeddings, rerank, Ask, visualEval subset plus three representative manual queries
Downloader, capture, update, licensePrivacy/security patrol
Faces, people, speakersEntitlement-aware default check, Pro/free gate check
Clip, ffmpeg, export, shareExport artifact openability, redaction/leak check
Packaging, desktop, release scriptsWheel/frozen app smoke: asset inventory, migrations, runtime resources
API routes, request/response schemasContract smoke, call-site audit, backward-compat check

The table short-circuits the question "does this change need QA?" — almost always yes, but now you know exactly which kind.

Why privacy gets a standing batch

Every other batch is event-triggered or cadence-driven. Batch 2 (privacy patrol) is both — it runs on a schedule and whenever a covered surface changes, because privacy isn't a feature, it's the product's core promise.

The specific checks aren't complex. Block all external sockets and run the indexing and search path — anything that tries to phone home shows up immediately. Verify that face indexing follows the entitlement-aware default (on for Pro/trial, off for free) and that speaker diarization uses its expected default. Test that a "redacted" export bundle doesn't contain file paths, transcripts, or face data from items the user didn't include. Reject credentialed and private-network URLs in the downloader.

Release blockers from this batch are absolute: any unprompted external network call, any privacy leak in a redacted artifact, any sensitive feature that turns on without an explicit user action.

The issue ledger

Every finding goes into one canonical ledger, owned by the conductor. The shape of each entry is fixed:

ID:
Status: new | needs-repro | reproduced | duplicate | fixing | fixed | verified | wontfix
Severity: blocker | high | medium | low
Area:
Environment:
Steps:
Expected:
Actual:
Evidence:
Verification:

A finding without evidence stays at needs-repro and doesn't block a release unless it involves credible data loss, privacy leakage, or launch failure. The status lifecycle matters: fixed moves to verified only when a verifier (ideally not the fixer) reruns the exact steps. The conductor closes verified items. For blocker and high severity, self-verification is never allowed — if no independent verifier is available, the item holds.

Conflict reduction

Parallel work on a shared codebase needs explicit rules to avoid stepping on itself:

The reason these rules exist isn't process for its own sake — it's because a shared git index is a shared resource, and two agents touching overlapping files produce merge conflicts that cost more time than the parallelism saved.

What this looks like in practice

On a heavy development day, the rhythm looks like this: a developer pushes a change to search ranking. The Smoke inspector verifies the app still launches and returns results. The Privacy inspector confirms the change didn't add an outbound call. The UI inspector checks that the results page renders correctly at desktop and narrow widths. The Regression inspector reruns the last three fixed bugs to confirm they're still closed. All four run at the same time. If anything fails, the conductor triages, confirms with evidence, and hands it to a fixer — who works on a file no other agent is touching. The verifier closes it. The whole cycle is under 30 minutes.

The nightly run then aggregates: full test suite, evals and eval gate, security scans. The release candidate run adds the signed and notarized macOS app, the frozen binary, and a final verification pass over every release-blocking ledger item.

The winning pattern for MediaFind QA: parallel discovery, serialized fixes, independent verification. Do not maximize agent count — maximize independent coverage per unit of coordination cost.

The full strategy is documented in docs/CONTINUOUS_BATCH_QA.md in the repo, alongside the release QA tiers in docs/RELEASE_QA.md and the eval harness in docs/EVAL.md. The principle underneath all of it: a local-first app with no server-side telemetry has to find its own regressions — and 30-minute rolling loops find them faster than waiting for a nightly run.

Try a private media library that takes quality seriously

Free trial. No account, no API keys, nothing uploaded.

Download for macOS