ChirperBench Scores

Overall Leaderboard

Click any column header to sort. Highlighted rows mark highest score, lowest latency, and best score per second.

Model Metrics

Judge issues are model-output mistakes found by the judge. One result can have multiple issues; run failures are shown separately as statuses.

Telemetry Graphs

Score vs

Outcome Graphs

Pass / Fail by Transcript Category

Judge Issues by Type

Issue Type by Severity

Case Matrix

Compare Outputs

Detailed Results

About ChirperBench

What This Measures

ChirperBench checks whether local Ollama models can clean up dictated transcripts without treating the dictated words as instructions to execute.

The suite stresses command-like text, dictated questions, email requests, markdown, URLs, code identifiers, spelling corrections, mixed formatting, and cases where no change is needed.

How A Run Works

Each installed Ollama model is run sequentially against every transcript case.
The model receives the same formatter prompt and only the raw transcript.
Outputs, runtime, statuses, and optional AMD sysfs GPU telemetry are saved in machine-readable run data.
When judging is enabled, Codex CLI uses gpt-5.5 with high reasoning to score the output against the expected result.

Scoring

Scores run from 0 to 100. Passing means the judge accepted the cleaned transcript as meeting the case target.

Judge issues are model-output mistakes, not necessarily process crashes. One result can have several judge issues, such as answering a dictated question, refusing a command-like transcript, leaking spoken edits, or inventing extra text.