Evaluating your Dawn agent

Scenario tests and @dawn-ai/testing pin down behaviour: given this input and these fixtures, the agent calls this tool and says this. They answer "is it correct?" for a handful of cases. Evals answer a different question — "how good is it?" — by running your agent over a whole dataset and scoring each output, then aggregating the scores into a single verdict you can gate on.

Use evals when you want a quality bar that survives prompt edits, model swaps, and refactors: a number that goes up or down across a dataset, with an optional threshold that fails CI when quality regresses.

Evals make the agent run deterministic by default. Like @dawn-ai/testing, each case replays pre-written aimock fixtures, so the agent's model call is replaced while your tools, prompts, capabilities, and state run normally. Scorers still execute after the replayed run; deterministic scorers are CI-safe, and model-graded scorers such as llmJudge are CI-safe when their chat-completions requests are fixture-backed or mocked. A separate dawn eval --live mode runs the real model locally when you want to measure against actual model output.

Install

bash

pnpm add -D @dawn-ai/evals @dawn-ai/testing

@dawn-ai/evals provides defineEval and the scorers; @dawn-ai/testing provides script() and the harness the dawn eval command drives.

The `*.eval.ts` convention

Evals live next to the route they exercise, under an evals/ directory:

text

src/app/chat/
  index.ts
  evals/
    quality.eval.ts

dawn eval discovers src/app/<route>/evals/*.eval.ts the same way dawn test discovers run.test.ts. An eval co-located under a route directory binds to that route automatically; the route field is only needed when the file lives elsewhere.

`defineEval`

src/app/chat/evals/quality.eval.ts

import { contains, defineEval } from "@dawn-ai/evals"
import { script } from "@dawn-ai/testing"
 
export default defineEval({
  name: "chat quality",
  // route: "/chat#agent",  // optional — inferred from the file location
  dataset: [
    {
      name: "greets the user",
      input: "hello",
      fixtures: script().user("hello").replies("Hi! How can I help?"),
    },
  ],
  scorers: [contains("help", { threshold: 1 })],
  threshold: 1,
})

The fields:

name — required label for the eval, shown in the report.
route — the Agent Protocol route key ("/chat#agent"). Optional when the file is co-located under a route; required otherwise.
dataset — the cases to run (see below).
scorers — one or more scorers; each runs against every case.
threshold — sugar for gate.mean(threshold) (see Gating).
gate — a composable gate policy; takes precedence over threshold.

A dataset case is { input, expected?, name?, fixtures?, metadata? }. input is the user message for an agent route. fixtures is the per-case aimock script used in replay mode (ignored under --live).

Datasets

dataset accepts three shapes:

// 1. Inline array of cases
dataset: [{ input: "hello", expected: "Hi!" }]
 
// 2. A path to a committed file (.json or .jsonl), relative to the eval file
dataset: "./cases.json"
dataset: "./cases.jsonl"
 
// 3. A sync or async function returning cases
dataset: async () => loadCasesFromSomewhere()

A .json file must contain a JSON array of cases; a .jsonl file holds one case object per line. Relative paths resolve against the eval file's directory.

Scorers

A scorer reads the run result and the case and returns a score: a number in 0..1, a boolean (true = 1, false = 0), or a rich verdict { score, label?, reason? }. Built-in scorers:

exactMatch() — final message equals case.expected.
contains(substring) — final message includes the substring.
regex(re) — the regex matches the final message.
jsonEquals() — case.expected deep-equals the parsed final message (or a value you select).
toolCalled(name, { withArgs? }) — a tool was called (optionally with matching args).
tokensUnder(budget) — fewer than budget tokens were streamed.

Each built-in takes an optional { threshold } — its own pass bar, used by gate.perScorer and per-case pass logic.

For anything custom, write your own:

import { custom, llmJudge } from "@dawn-ai/evals"
 
// Arbitrary scoring logic
custom((run, testCase) => (run.finalMessage.length < 200 ? 1 : 0), { name: "concise" })
 
// Grade output quality with a model
llmJudge({
  criteria: "Does the answer fully address {{input}}? Output: {{output}}",
  model: "gpt-5-mini", // optional
  threshold: 0.7,       // optional
})

custom((run, testCase) => Score | Promise<Score>) gives you the full run result and case. llmJudge asks a model to grade the output against criteria (which interpolates {{input}}, {{expected}}, {{output}}) and returns { score, reason }. The scorer sends a chat-completions request whenever it runs. In Dawn CLI replay, that request goes through aimock and can be satisfied by fixtures; in live, record, or unmocked programmatic runs, it needs model credentials unless you inject a mocked fetchImpl.

Gating

After every case is scored, a gate turns the aggregated scores into pass/fail. Gate policies compose via the gate helper:

import { gate } from "@dawn-ai/evals"
 
gate: gate.mean(0.8)        // dataset-wide mean ≥ 0.8
gate: gate.passRate(0.9)    // ≥ 90% of cases pass (every scorer met its bar)
gate: gate.everyCase(0.6)   // every case's mean ≥ 0.6
gate: gate.perScorer()      // each scorer's mean ≥ that scorer's own threshold
 
// Combine them:
gate: gate.all(gate.mean(0.8), gate.perScorer())   // all must pass
gate: gate.any(gate.passRate(0.9), gate.mean(0.95)) // any one passing is enough

There are two shorthands:

threshold: on the eval is sugar for gate.mean(threshold). If gate is set, threshold is ignored.
A per-scorer threshold sets that scorer's own bar — used by gate.perScorer() and to decide whether a case "passed" (every scorer must meet its bar; default bar 0.5).

If you set neither gate nor threshold, the eval is informational: it always passes and just reports scores. This lets you land an eval and watch the numbers before you commit to a bar.

Execution: replay vs live

Evals use the same aimock fixture system as agent tests — see Testing Agents for how fixtures, script(), and replay-vs-live mode work, and why live runs must never touch CI.

By default dawn eval runs in replay mode: each case replays its fixtures (a script() builder inline on the case, or a committed fixture file) through aimock, so the agent's model call is replaced. This is the mode to run in CI. If the eval includes llmJudge, the scorer still runs during replay, so include a fixture for its judge request or inject a mocked fetchImpl.

bash

dawn eval

Pass --live to run the real model locally instead:

bash

dawn eval --live

Live mode ignores per-case fixtures and calls the actual provider. It requires OPENAI_API_KEY and is meant for local runs while you tune prompts — never wire it into CI.

Recording fixtures with `--record`

dawn eval --record runs the suite against the real model (requires OPENAI_API_KEY; never run in CI) and writes the responses as committed fixture files you can replay later. For each case that has no inline script() fixtures, it writes a sibling file <evalBasename>.<caseSlug>.fixtures.json next to the .eval.ts. A plain dawn eval then auto-loads those files, so the agent run stays deterministic in CI without any code changes.

Cases that already have inline script() fixtures are left alone — --record skips them (logging skipped record (inline fixtures)) and never overwrites them. Inline fixtures stay authoritative. The gate still applies during --record: scores are aggregated and checked against the threshold after all cases run, but fixture files are written per-case before the verdict, so a gate failure never discards captured responses.

--record and --live are mutually exclusive.

bash

# Record real-model responses into sibling fixture files
dawn eval --record
 
# Replay those committed fixtures, including scorer model calls
dawn eval

Command reference

text

dawn eval [path] [--live] [--record] [--json [file]] [--cwd <path>]

[path] — narrow discovery to a subdirectory.
--live — run the real model (requires OPENAI_API_KEY); never use in CI.
--record — record real-model responses into sibling fixture files (requires OPENAI_API_KEY); never use in CI. Mutually exclusive with --live.
--json [file] — also write a JSON report. Defaults to .dawn/eval-report.json.
--cwd <path> — operate on a different app root.

A gated eval that fails causes a non-zero exit, so CI fails when quality drops below the bar. Informational evals never affect the exit code. See the CLI reference for the full command surface.

A full example

src/app/chat/evals/quality.eval.ts

import { contains, defineEval, gate, toolCalled } from "@dawn-ai/evals"
import { script } from "@dawn-ai/testing"
 
export default defineEval({
  name: "chat quality",
  dataset: [
    {
      name: "filters open items",
      input: "Filter open items",
      fixtures: script()
        .user("Filter open items")
        .callsTool("applyFilter", { status: "open" })
        .replies("Found 2 open items — let me know if I can help further."),
    },
  ],
  scorers: [
    contains("help", { threshold: 0.5 }),
    toolCalled("applyFilter", { threshold: 0.5 }),
  ],
  gate: gate.perScorer(),
})

Run it:

bash

dawn eval

The single case calls applyFilter and its reply contains "help", so both scorers score 1.00. gate.perScorer() checks each scorer's mean against its own bar — both are 1.00 ≥ 0.5, so the eval passes:

text

PASS chat quality › filters open items mean=1.00 [contains(help)=1.00 toolCalled(applyFilter)=1.00]
PASS chat quality mean=1.00

Commit the eval and any fixture files alongside your route. In CI, run dawn eval in replay mode with fixtures for every model call, including any llmJudge scorer calls. Locally, run dawn eval --live when you want to measure the agent against the real model before updating fixtures.

Your scaffolded app already has an eval

create-dawn-ai-app generates src/app/research/evals/research-quality.eval.ts next to the default research route, adds @dawn-ai/evals as a dev dependency, and wires an eval npm script. Run it right after scaffolding:

bash

npm run eval

This wraps dawn eval and runs the agent cases in replay mode by default. The generated eval (src/app/research/evals/research-quality.eval.ts) has two dataset cases and four scorers:

toolCalled("searchCorpus", { threshold: 1 }) — asserts the agent called searchCorpus.
contains("[corpus/", { threshold: 1 }) — asserts the reply cites a corpus source.
A custom cites-source scorer: custom((run) => run.finalMessage.includes("corpus/") ? 1 : 0, { name: "cites-source", threshold: 1 }).
llmJudge({ criteria: "The report answers the question and cites at least one source document.", model: "gpt-5-mini", threshold: 0.7 }).

The gate is gate.all(gate.passRate(1), gate.perScorer()) — every case must pass and every scorer must meet its threshold. The generated fixtures include the extra judge turns used by llmJudge, so the default npm run eval stays offline and no-key.

To run the agent cases against the real model locally:

bash

npm run eval -- --live

Grow it by adding cases to the dataset and more scorers.

Programmatic API

runEval from @dawn-ai/evals lets you drive an eval definition from your own script — useful for custom CI tooling, reporting pipelines, or running evals as part of a larger orchestration.

import { defineEval, contains, runEval } from "@dawn-ai/evals"
import { script } from "@dawn-ai/testing"
import { fileURLToPath } from "node:url"
 
const myEval = defineEval({
  name: "chat quality",
  dataset: [{ input: "hello", fixtures: script().user("hello").replies("Hi!") }],
  scorers: [contains("Hi", { threshold: 1 })],
  threshold: 1,
})
 
const report = await runEval(myEval, {
  runCase: async (testCase) => {
    // Provide your own AgentRunResult — typically from createAgentHarness.
    return harness.run({ input: testCase.input, fixtures: testCase.fixtures })
  },
  baseDir: fileURLToPath(new URL(".", import.meta.url)), // for resolving dataset paths
})
 
console.log(report.passed, report.mean)

runEval(def, options) accepts a RunEvalOptions with two fields: runCase (required — executes one case and returns an AgentRunResult) and baseDir (optional — base directory for resolving a string dataset path). It returns a Promise<EvalReport> containing { name, cases, byScorer, mean, gated, passed, reason? }.