What This Measures
ChirperBench checks whether local Ollama models can clean up dictated transcripts without treating the dictated words as instructions to execute.
The suite stresses command-like text, dictated questions, email requests, markdown, URLs, code identifiers, spelling corrections, mixed formatting, and cases where no change is needed.
Scoring
Scores run from 0 to 100. Passing means the judge accepted the cleaned transcript as meeting the case target.
Judge issues are model-output mistakes, not necessarily process crashes. One result can have several judge issues, such as answering a dictated question, refusing a command-like transcript, leaking spoken edits, or inventing extra text.