Testing your Dawn agent
Agent tests should be deterministic and CI-safe — they must never call a live model. @dawn-ai/testing achieves this by intercepting the model's HTTP endpoint with aimock, a local mock that replays pre-written fixtures. Your tools, prompts, capabilities, and state all run normally; only the LLM is replaced.
The fixture is the source of truth. Never assert against a live model response.
Install
pnpm add -D @dawn-ai/testing vitestFirst test
Ten lines. This is the full shape you'll use for most agent tests.
import { fileURLToPath } from "node:url"
import { afterAll, it } from "vitest"
import { createAgentHarness, expectFinalMessage, expectToolCalled, script } from "@dawn-ai/testing"
const appRoot = fileURLToPath(new URL("../..", import.meta.url))
const h = await createAgentHarness({ appRoot, route: "/chat#agent" })
afterAll(() => h.close())
it("filters open items", async () => {
const run = await h.run({
input: "Filter open items",
fixtures: script()
.user("Filter open items")
.callsTool("applyFilter", { status: "open" })
.replies("Found 2 open items."),
})
expectToolCalled(run, "applyFilter").withArgs({ status: "open" })
expectFinalMessage(run).toContain("Found 2")
}, 60_000)createAgentHarness boots aimock on a random port, runs typegen, and resolves your route — all in the same process. The route string is the Agent Protocol key: "/chat#agent" means the agent export from src/app/chat/index.ts. Call h.close() in afterAll to stop aimock and restore env vars.
Fixture script
script() compiles a multi-turn conversation into aimock fixtures. Each .user() starts a new turn group. .callsTool() tells aimock to respond with a tool call; .replies() tells it to respond with a text message.
script()
.user("Summarize my project") // aimock matches when user message contains this
.callsTool("readFile", { path: "README.md" }) // model responds: call readFile
.replies("Here's a summary…") // model responds after tool result
.build() // returns AimockFixture[]; or pass builder directly to h.run()aimock matches fixtures by substring on the latest user message plus turnIndex (count of assistant messages already in the thread). You do not need to call .build() — pass the builder directly to h.run({ fixtures }) and Dawn unwraps it for you.
Recipes
Assert a tool was called with specific args
expectToolCalled(run, "applyFilter").withArgs({ status: "open" })withArgs does a partial (subset) match — extra args are ignored. Chain .times(n) to assert exact call count, or .never() to assert the tool was not called.
Assert the final message
expectFinalMessage(run).toContain("Found 2")
expectFinalMessage(run).toMatch(/found \d+ items/i)
expectFinalMessage(run).toEqual("Found 2 open items.")Assert streamed tokens arrived
expectStreamedTokens(run) // throws if zero tokens were streamedMulti-turn: run the agent twice
const run1 = await h.run({ input: "List tasks", fixtures: script().user("List tasks").replies("You have 3 tasks.") })
h.reset() // start a fresh thread; aimock port stays stable
const run2 = await h.run({ input: "Mark first done", fixtures: script().user("Mark first done").replies("Done.") })Call h.reset() between turns when you want a clean thread. Without it, follow-up messages accumulate in the same thread (useful for testing multi-turn memory).
Tool output offloading
When large tool outputs are offloaded to a stub file, assert the offload marker:
expectOffloaded(run, "generateReport")Assert tools were called in order
expectToolSequence(run, ["searchCorpus", "readDoc", "writeFile"])Asserts that the named tools were called in that order (subsequence match by default — other tools may appear between them). Pass { strict: true } to require contiguity: the tools must appear consecutively with nothing else in between.
expectToolSequence(run, ["validate", "save"], { strict: true })Assert no tool returned an error
expectNoToolErrors(run)Asserts that no tool returned an error result. HITL permission interrupts are not counted as errors.
To inspect tool results manually, use run.toolResults — a ReadonlyArray<ObservedToolResult> derived from the final conversation messages by deriveToolResults (also exported). Each entry has the shape { name: string, status?: "error" | "success", content: unknown, isError: boolean }. This is useful for asserting on specific error messages or content when expectNoToolErrors is too broad.
import { deriveToolResults } from "@dawn-ai/testing"
const failing = run.toolResults.filter((r) => r.isError)
expect(failing).toHaveLength(0)State assertions
expectState(run).messages.toHaveLength(3)
expectState(run).field("runningSummary").toBeTruthy()
expectState(run).field("todos").toEqual([{ content: "ship it", status: "pending" }])run.state is the full agent state after the run — the same object a checkpointer would persist.
Record-first workflow
For complex conversations you don't want to hand-write, record a real provider interaction locally:
import { record } from "@dawn-ai/testing"
// Run once locally — requires OPENAI_API_KEY in env. Never run in CI.
record({ out: "test/fixtures/my-scenario.json" })Then hand-trim the JSON to keep only the turns that matter, commit it, and replay in CI. CI replays strict and read-only — it never re-records. Add a git diff --exit-code test/fixtures/ guard in CI to catch uncommitted fixture edits.
The three execution modes
createAgentHarness supports three modes via the mode option (default: "in-process"):
"in-process"(default) — runs your tools, prompts, capabilities, and state inside the test process via Dawn's runtime. Fastest; covers the vast majority of assertions. No port binding."http-inject"— not yet implemented. Passingmode: "http-inject"throws at harness construction time.injectAgentProtocolis exported as a standalone helper for advanced use, but the harnessmodeoption does not support it yet."subprocess"— not yet implemented. Passingmode: "subprocess"throws at harness construction time.startSubprocessAppis exported as a standalone helper for advanced use, but the harnessmodeoption does not support it yet.
For now, write all harness tests with the default "in-process" mode. The standalone injectAgentProtocol and startSubprocessApp exports are available for custom orchestration outside the harness API.
CI setup
No extra setup needed. @dawn-ai/testing is CI-safe by default: aimock runs on a random port and stops when the harness closes. The only thing to guard is that fixture files are committed:
- run: pnpm exec vitest --run
- run: git diff --exit-code test/fixtures/ # catch uncommitted fixture editsFixture files: author, commit, replay
For most tests an inline script() is all you need. When a scenario grows complex — multi-tool chains, multi-turn conversations — save the fixture to a JSON file so it is shared, version-controlled, and auditable.
Author inline and snapshot to a file
writeFixtures serialises a script() builder (or a bare FixtureSet) to { "fixtures": [...] } JSON and creates parent directories automatically.
import { fileURLToPath } from "node:url"
import { writeFixtures, loadFixtures, script } from "@dawn-ai/testing"
const fixturesPath = fileURLToPath(new URL("fixtures/filter-open.fixture.json", import.meta.url))
// Run once to mint the file, then commit it.
writeFixtures(
fixturesPath,
script()
.user("Filter open items")
.callsTool("applyFilter", { status: "open" })
.replies("Found 2 open items."),
)After running this once, commit the resulting .fixture.json. From that point on, tests load it rather than re-building the script every time.
Record from a real model (local only)
When you want the fixture to match what a real model would actually say, record a live interaction:
import { record } from "@dawn-ai/testing"
// Run once locally — requires OPENAI_API_KEY. Never run in CI.
record({ out: "test/fixtures/filter-open.fixture.json" })record() calls aimock's --record mode, which proxies through to the real OpenAI API and writes the captured exchange to disk. Trim the JSON to the turns that matter, then commit it.
Replay a fixture file in tests
Pass loadFixtures(path) wherever fixtures is accepted — at harness creation or per run:
import { fileURLToPath } from "node:url"
import { afterAll, it } from "vitest"
import {
createAgentHarness,
expectFinalMessage,
expectToolCalled,
loadFixtures,
} from "@dawn-ai/testing"
const appRoot = fileURLToPath(new URL("../..", import.meta.url))
const fixturesPath = fileURLToPath(new URL("fixtures/filter-open.fixture.json", import.meta.url))
// Fixture at harness level — used for every run unless overridden.
const h = await createAgentHarness({
appRoot,
route: "/chat#agent",
fixtures: loadFixtures(fixturesPath),
})
afterAll(() => h.close())
it("filters open items", async () => {
// Per-run override — loadFixtures() returns a plain FixtureSet.
const run = await h.run({
input: "Filter open items",
fixtures: loadFixtures(fixturesPath),
})
expectToolCalled(run, "applyFilter").withArgs({ status: "open" })
expectFinalMessage(run).toContain("Found 2")
}, 60_000)CI rules. Commit every .fixture.json that tests depend on. CI replays fixtures strict and read-only — aimock never re-records in replay mode. Add a drift guard to catch edits that were not committed:
- run: git diff --exit-code test/fixtures/Live mode (real model)
Pass live: true to run the agent against the real OpenAI API via aimock's proxy-record mode. The model responses are real; run.systemPrompt is still captured from the intercepted requests.
import { fileURLToPath } from "node:url"
import { afterAll, it } from "vitest"
import {
createAgentHarness,
expectFinalMessage,
expectNoToolErrors,
expectSystemPrompt,
expectToolCalled,
expectToolSequence,
} from "@dawn-ai/testing"
const appRoot = fileURLToPath(new URL("../..", import.meta.url))
const h = await createAgentHarness({ appRoot, route: "/chat#agent", live: true })
afterAll(() => h.close())
it("filters open items against real model", async () => {
const run = await h.run({ input: "Filter open items" })
// Real models are nondeterministic — assert loosely.
expectToolCalled(run, "applyFilter")
expectFinalMessage(run).toMatch(/open/i)
expectSystemPrompt(run).toContain("You are")
}, 120_000)
it("calls tools in order and cleanly", async () => {
const run = await h.run({ input: "…" })
expectToolSequence(run, ["searchCorpus", "readDoc", "writeFile"])
expectNoToolErrors(run)
}, 120_000)Requirements and constraints.
OPENAI_API_KEYmust be set in the environment.createAgentHarnessthrows immediately without it.- Real models are nondeterministic. Assert on presence and shape, not exact values: use
expectToolCalled(run, name),expectFinalMessage(run).toContain(...),expectSystemPrompt(run).toContain(...). AvoidwithArgswith exact values orexpectFinalMessage(run).toEqual(...). - Gate live tests so CI skips them automatically:
it.skipIf(!process.env.OPENAI_API_KEY)("filters open items against real model", async () => {
// ...
})Do not put your API key in CI. Live tests are local-development tools for validating prompt changes against a real model before you mint or update fixture files.
Your scaffolded app already has a test
create-dawn-ai-app generates test/research.test.ts that imports @dawn-ai/testing and covers the default /research#agent route. Run it immediately after scaffolding:
npm testThe generated file has four tests. The first verifies the corpus search and citation path verbatim:
import { fileURLToPath } from "node:url"
import { afterAll, it } from "vitest"
import type { FixtureSet } from "@dawn-ai/testing"
import {
createAgentHarness,
expectFinalMessage,
expectInterrupt,
expectOffloaded,
expectSubagent,
expectToolCalled,
script,
} from "@dawn-ai/testing"
const appRoot = fileURLToPath(new URL("..", import.meta.url))
const h = await createAgentHarness({ appRoot, route: "/research#agent" })
afterAll(() => h.close())
it("searches the corpus and writes a cited answer", async () => {
h.reset()
const run = await h.run({
input: "What are common agent architectures?",
fixtures: script()
.user("What are common agent architectures?")
.callsTool("searchCorpus", { query: "agent architectures" })
.callsTool("readDoc", { path: "corpus/agent-architectures.md" })
.replies("ReAct and plan-and-execute are common. [corpus/agent-architectures.md]"),
})
expectToolCalled(run, "searchCorpus")
expectToolCalled(run, "readDoc")
expectFinalMessage(run).toContain("[corpus/")
}, 60_000)The remaining three tests cover: researcher subagent dispatch (expectSubagent); offloading of a large readDoc result (expectOffloaded); and the HITL permission gate for runBash with resume (expectInterrupt + harness.resume) — the last uses its own dedicated harness so the interrupt is clean.
Grow it by:
- Adding more
script()scenarios — oneitblock per behaviour you want to pin. - Committing fixtures for complex flows — use
writeFixtures/record+loadFixturesfor multi-turn or multi-tool scenarios that are tedious to hand-write.