Evaluating your Dawn agent
Scenario tests and @dawn-ai/testing pin down behaviour: given this input and these fixtures, the agent calls this tool and says this. They answer "is it correct?" for a handful of cases. Evals answer a different question — "how good is it?" — by running your agent over a whole dataset and scoring each output, then aggregating the scores into a single verdict you can gate on.
Use evals when you want a quality bar that survives prompt edits, model swaps, and refactors: a number that goes up or down across a dataset, with an optional threshold that fails CI when quality regresses.
Evals are deterministic and CI-safe by default. Like @dawn-ai/testing, each case replays pre-written aimock fixtures — your tools, prompts, capabilities, and state run normally; only the LLM is replaced. A separate dawn eval --live mode runs the real model locally when you want to measure against actual model output.
Install
pnpm add -D @dawn-ai/evals @dawn-ai/testing@dawn-ai/evals provides defineEval and the scorers; @dawn-ai/testing provides script() and the harness the dawn eval command drives.
The *.eval.ts convention
Evals live next to the route they exercise, under an evals/ directory:
src/app/chat/
index.ts
evals/
quality.eval.tsdawn eval discovers src/app/<route>/evals/*.eval.ts the same way dawn test discovers run.test.ts. An eval co-located under a route directory binds to that route automatically; the route field is only needed when the file lives elsewhere.
defineEval
import { contains, defineEval } from "@dawn-ai/evals"
import { script } from "@dawn-ai/testing"
export default defineEval({
name: "chat quality",
// route: "/chat#agent", // optional — inferred from the file location
dataset: [
{
name: "greets the user",
input: "hello",
fixtures: script().user("hello").replies("Hi! How can I help?"),
},
],
scorers: [contains("help", { threshold: 1 })],
threshold: 1,
})The fields:
name— required label for the eval, shown in the report.route— the Agent Protocol route key ("/chat#agent"). Optional when the file is co-located under a route; required otherwise.dataset— the cases to run (see below).scorers— one or more scorers; each runs against every case.threshold— sugar forgate.mean(threshold)(see Gating).gate— a composable gate policy; takes precedence overthreshold.
A dataset case is { input, expected?, name?, fixtures?, metadata? }. input is the user message for an agent route. fixtures is the per-case aimock script used in replay mode (ignored under --live).
Datasets
dataset accepts three shapes:
// 1. Inline array of cases
dataset: [{ input: "hello", expected: "Hi!" }]
// 2. A path to a committed file (.json or .jsonl), relative to the eval file
dataset: "./cases.json"
dataset: "./cases.jsonl"
// 3. A sync or async function returning cases
dataset: async () => loadCasesFromSomewhere()A .json file must contain a JSON array of cases; a .jsonl file holds one case object per line. Relative paths resolve against the eval file's directory.
Scorers
A scorer reads the run result and the case and returns a score: a number in 0..1, a boolean (true = 1, false = 0), or a rich verdict { score, label?, reason? }. Built-in scorers:
exactMatch()— final message equalscase.expected.contains(substring)— final message includes the substring.regex(re)— the regex matches the final message.jsonEquals()—case.expecteddeep-equals the parsed final message (or a value youselect).toolCalled(name, { withArgs? })— a tool was called (optionally with matching args).tokensUnder(budget)— fewer thanbudgettokens were streamed.
Each built-in takes an optional { threshold } — its own pass bar, used by gate.perScorer and per-case pass logic.
For anything custom, write your own:
import { custom, llmJudge } from "@dawn-ai/evals"
// Arbitrary scoring logic
custom((run, testCase) => (run.finalMessage.length < 200 ? 1 : 0), { name: "concise" })
// Grade output quality with a model
llmJudge({
criteria: "Does the answer fully address {{input}}? Output: {{output}}",
model: "gpt-4o-mini", // optional
threshold: 0.7, // optional
})custom((run, testCase) => Score | Promise<Score>) gives you the full run result and case. llmJudge asks a model to grade the output against criteria (which interpolates {{input}}, {{expected}}, {{output}}) and returns { score, reason }. The judge calls a real model, so it needs OPENAI_API_KEY — reserve it for --live runs.
Gating
After every case is scored, a gate turns the aggregated scores into pass/fail. Gate policies compose via the gate helper:
import { gate } from "@dawn-ai/evals"
gate: gate.mean(0.8) // dataset-wide mean ≥ 0.8
gate: gate.passRate(0.9) // ≥ 90% of cases pass (every scorer met its bar)
gate: gate.everyCase(0.6) // every case's mean ≥ 0.6
gate: gate.perScorer() // each scorer's mean ≥ that scorer's own threshold
// Combine them:
gate: gate.all(gate.mean(0.8), gate.perScorer()) // all must pass
gate: gate.any(gate.passRate(0.9), gate.mean(0.95)) // any one passing is enoughThere are two shorthands:
threshold:on the eval is sugar forgate.mean(threshold). Ifgateis set,thresholdis ignored.- A per-scorer
thresholdsets that scorer's own bar — used bygate.perScorer()and to decide whether a case "passed" (every scorer must meet its bar; default bar0.5).
If you set neither gate nor threshold, the eval is informational: it always passes and just reports scores. This lets you land an eval and watch the numbers before you commit to a bar.
Execution: replay vs live
By default dawn eval runs in replay mode: each case replays its fixtures through aimock, so no real model is called. Fixtures come from a script() builder inline on the case, or from committed fixture files — the same deterministic, CI-safe path @dawn-ai/testing uses. This is the only mode that should run in CI.
dawn evalPass --live to run the real model locally instead:
dawn eval --liveLive mode ignores per-case fixtures and calls the actual provider. It requires OPENAI_API_KEY and is meant for local runs while you tune prompts — never wire it into CI.
Command reference
dawn eval [path] [--live] [--json [file]] [--cwd <path>][path]— narrow discovery to a subdirectory.--live— run the real model (requiresOPENAI_API_KEY); never use in CI.--json [file]— also write a JSON report. Defaults to.dawn/eval-report.json.--cwd <path>— operate on a different app root.
A gated eval that fails causes a non-zero exit, so CI fails when quality drops below the bar. Informational evals never affect the exit code. See the CLI reference for the full command surface.
A full example
import { contains, defineEval, gate, toolCalled } from "@dawn-ai/evals"
import { script } from "@dawn-ai/testing"
export default defineEval({
name: "chat quality",
dataset: [
{
name: "filters open items",
input: "Filter open items",
fixtures: script()
.user("Filter open items")
.callsTool("applyFilter", { status: "open" })
.replies("Found 2 open items — let me know if I can help further."),
},
],
scorers: [
contains("help", { threshold: 0.5 }),
toolCalled("applyFilter", { threshold: 0.5 }),
],
gate: gate.perScorer(),
})Run it:
dawn evalThe single case calls applyFilter and its reply contains "help", so both scorers
score 1.00. gate.perScorer() checks each scorer's mean against its own bar — both
are 1.00 ≥ 0.5, so the eval passes:
PASS chat quality › filters open items mean=1.00 [contains(help)=1.00 toolCalled(applyFilter)=1.00]
PASS chat quality mean=1.00Commit the eval and any fixture files alongside your route. In CI, run dawn eval (replay only); locally, run dawn eval --live when you want to measure against the real model before updating fixtures.