Testing your Dawn agent

Agent tests should be deterministic and CI-safe — they must never call a live model. @dawn-ai/testing achieves this by intercepting the model's HTTP endpoint with aimock, a local mock that replays pre-written fixtures. Your tools, prompts, capabilities, and state all run normally; only the LLM is replaced.

The fixture is the source of truth. Never assert against a live model response.

Install

bash
pnpm add -D @dawn-ai/testing vitest

First test

Ten lines. This is the full shape you'll use for most agent tests.

src/app/chat/agent.test.ts
import { fileURLToPath } from "node:url"
import { afterAll, it } from "vitest"
import { createAgentHarness, expectFinalMessage, expectToolCalled, script } from "@dawn-ai/testing"
 
const appRoot = fileURLToPath(new URL("../..", import.meta.url))
const h = await createAgentHarness({ appRoot, route: "/chat#agent" })
afterAll(() => h.close())
 
it("filters open items", async () => {
  const run = await h.run({
    input: "Filter open items",
    fixtures: script()
      .user("Filter open items")
      .callsTool("applyFilter", { status: "open" })
      .replies("Found 2 open items."),
  })
  expectToolCalled(run, "applyFilter").withArgs({ status: "open" })
  expectFinalMessage(run).toContain("Found 2")
}, 60_000)

createAgentHarness boots aimock on a random port, runs typegen, and resolves your route — all in the same process. The route string is the Agent Protocol key: "/chat#agent" means the agent export from src/app/chat/index.ts. Call h.close() in afterAll to stop aimock and restore env vars.

Fixture script

script() compiles a multi-turn conversation into aimock fixtures. Each .user() starts a new turn group. .callsTool() tells aimock to respond with a tool call; .replies() tells it to respond with a text message.

ts
script()
  .user("Summarize my project")      // aimock matches when user message contains this
  .callsTool("readFile", { path: "README.md" })  // model responds: call readFile
  .replies("Here's a summary…")      // model responds after tool result
  .build()                           // returns AimockFixture[]; or pass builder directly to h.run()

aimock matches fixtures by substring on the latest user message plus turnIndex (count of assistant messages already in the thread). You do not need to call .build() — pass the builder directly to h.run({ fixtures }) and Dawn unwraps it for you.

Recipes

Assert a tool was called with specific args

ts
expectToolCalled(run, "applyFilter").withArgs({ status: "open" })

withArgs does a partial (subset) match — extra args are ignored. Chain .times(n) to assert exact call count, or .never() to assert the tool was not called.

Assert the final message

ts
expectFinalMessage(run).toContain("Found 2")
expectFinalMessage(run).toMatch(/found \d+ items/i)
expectFinalMessage(run).toEqual("Found 2 open items.")

Assert streamed tokens arrived

ts
expectStreamedTokens(run)  // throws if zero tokens were streamed

Multi-turn: run the agent twice

ts
const run1 = await h.run({ input: "List tasks", fixtures: script().user("List tasks").replies("You have 3 tasks.") })
h.reset()  // start a fresh thread; aimock port stays stable
const run2 = await h.run({ input: "Mark first done", fixtures: script().user("Mark first done").replies("Done.") })

Call h.reset() between turns when you want a clean thread. Without it, follow-up messages accumulate in the same thread (useful for testing multi-turn memory).

Tool output offloading

When large tool outputs are offloaded to a stub file, assert the offload marker:

ts
expectOffloaded(run, "generateReport")

Assert tools were called in order

ts
expectToolSequence(run, ["searchCorpus", "readDoc", "writeFile"])

Asserts that the named tools were called in that order (subsequence match by default — other tools may appear between them). Pass { strict: true } to require contiguity: the tools must appear consecutively with nothing else in between.

ts
expectToolSequence(run, ["validate", "save"], { strict: true })

Assert no tool returned an error

ts
expectNoToolErrors(run)

Asserts that no tool returned an error result. HITL permission interrupts are not counted as errors.

To inspect tool results manually, use run.toolResults — a ReadonlyArray<ObservedToolResult> derived from the final conversation messages by deriveToolResults (also exported). Each entry has the shape { name: string, status?: "error" | "success", content: unknown, isError: boolean }. This is useful for asserting on specific error messages or content when expectNoToolErrors is too broad.

ts
import { deriveToolResults } from "@dawn-ai/testing"
 
const failing = run.toolResults.filter((r) => r.isError)
expect(failing).toHaveLength(0)

State assertions

ts
expectState(run).messages.toHaveLength(3)
expectState(run).field("runningSummary").toBeTruthy()
expectState(run).field("todos").toEqual([{ content: "ship it", status: "pending" }])

run.state is the full agent state after the run — the same object a checkpointer would persist.

Record-first workflow

For complex conversations you don't want to hand-write, record a real provider interaction locally:

ts
import { record } from "@dawn-ai/testing"
 
// Run once locally — requires OPENAI_API_KEY in env. Never run in CI.
record({ out: "test/fixtures/my-scenario.json" })

Then hand-trim the JSON to keep only the turns that matter, commit it, and replay in CI. CI replays strict and read-only — it never re-records. Add a git diff --exit-code test/fixtures/ guard in CI to catch uncommitted fixture edits.

The three execution modes

createAgentHarness supports three modes via the mode option (default: "in-process"):

  • "in-process" (default) — runs your tools, prompts, capabilities, and state inside the test process via Dawn's runtime. Fastest; covers the vast majority of assertions. No port binding.
  • "http-inject"not yet implemented. Passing mode: "http-inject" throws at harness construction time. injectAgentProtocol is exported as a standalone helper for advanced use, but the harness mode option does not support it yet.
  • "subprocess"not yet implemented. Passing mode: "subprocess" throws at harness construction time. startSubprocessApp is exported as a standalone helper for advanced use, but the harness mode option does not support it yet.

For now, write all harness tests with the default "in-process" mode. The standalone injectAgentProtocol and startSubprocessApp exports are available for custom orchestration outside the harness API.

CI setup

No extra setup needed. @dawn-ai/testing is CI-safe by default: aimock runs on a random port and stops when the harness closes. The only thing to guard is that fixture files are committed:

.github/workflows/ci.yml
- run: pnpm exec vitest --run
- run: git diff --exit-code test/fixtures/   # catch uncommitted fixture edits

Fixture files: author, commit, replay

For most tests an inline script() is all you need. When a scenario grows complex — multi-tool chains, multi-turn conversations — save the fixture to a JSON file so it is shared, version-controlled, and auditable.

Author inline and snapshot to a file

writeFixtures serialises a script() builder (or a bare FixtureSet) to { "fixtures": [...] } JSON and creates parent directories automatically.

test/agent.test.ts
import { fileURLToPath } from "node:url"
import { writeFixtures, loadFixtures, script } from "@dawn-ai/testing"
 
const fixturesPath = fileURLToPath(new URL("fixtures/filter-open.fixture.json", import.meta.url))
 
// Run once to mint the file, then commit it.
writeFixtures(
  fixturesPath,
  script()
    .user("Filter open items")
    .callsTool("applyFilter", { status: "open" })
    .replies("Found 2 open items."),
)

After running this once, commit the resulting .fixture.json. From that point on, tests load it rather than re-building the script every time.

Record from a real model (local only)

When you want the fixture to match what a real model would actually say, record a live interaction:

test/agent.test.ts
import { record } from "@dawn-ai/testing"
 
// Run once locally — requires OPENAI_API_KEY. Never run in CI.
record({ out: "test/fixtures/filter-open.fixture.json" })

record() calls aimock's --record mode, which proxies through to the real OpenAI API and writes the captured exchange to disk. Trim the JSON to the turns that matter, then commit it.

Replay a fixture file in tests

Pass loadFixtures(path) wherever fixtures is accepted — at harness creation or per run:

test/agent.test.ts
import { fileURLToPath } from "node:url"
import { afterAll, it } from "vitest"
import {
  createAgentHarness,
  expectFinalMessage,
  expectToolCalled,
  loadFixtures,
} from "@dawn-ai/testing"
 
const appRoot = fileURLToPath(new URL("../..", import.meta.url))
const fixturesPath = fileURLToPath(new URL("fixtures/filter-open.fixture.json", import.meta.url))
 
// Fixture at harness level — used for every run unless overridden.
const h = await createAgentHarness({
  appRoot,
  route: "/chat#agent",
  fixtures: loadFixtures(fixturesPath),
})
afterAll(() => h.close())
 
it("filters open items", async () => {
  // Per-run override — loadFixtures() returns a plain FixtureSet.
  const run = await h.run({
    input: "Filter open items",
    fixtures: loadFixtures(fixturesPath),
  })
  expectToolCalled(run, "applyFilter").withArgs({ status: "open" })
  expectFinalMessage(run).toContain("Found 2")
}, 60_000)

CI rules. Commit every .fixture.json that tests depend on. CI replays fixtures strict and read-only — aimock never re-records in replay mode. Add a drift guard to catch edits that were not committed:

.github/workflows/ci.yml
- run: git diff --exit-code test/fixtures/

Live mode (real model)

Pass live: true to run the agent against the real OpenAI API via aimock's proxy-record mode. The model responses are real; run.systemPrompt is still captured from the intercepted requests.

test/agent.live.test.ts
import { fileURLToPath } from "node:url"
import { afterAll, it } from "vitest"
import {
  createAgentHarness,
  expectFinalMessage,
  expectNoToolErrors,
  expectSystemPrompt,
  expectToolCalled,
  expectToolSequence,
} from "@dawn-ai/testing"
 
const appRoot = fileURLToPath(new URL("../..", import.meta.url))
const h = await createAgentHarness({ appRoot, route: "/chat#agent", live: true })
afterAll(() => h.close())
 
it("filters open items against real model", async () => {
  const run = await h.run({ input: "Filter open items" })
  // Real models are nondeterministic — assert loosely.
  expectToolCalled(run, "applyFilter")
  expectFinalMessage(run).toMatch(/open/i)
  expectSystemPrompt(run).toContain("You are")
}, 120_000)
 
it("calls tools in order and cleanly", async () => {
  const run = await h.run({ input: "…" })
  expectToolSequence(run, ["searchCorpus", "readDoc", "writeFile"])
  expectNoToolErrors(run)
}, 120_000)

Requirements and constraints.

  • OPENAI_API_KEY must be set in the environment. createAgentHarness throws immediately without it.
  • Real models are nondeterministic. Assert on presence and shape, not exact values: use expectToolCalled(run, name), expectFinalMessage(run).toContain(...), expectSystemPrompt(run).toContain(...). Avoid withArgs with exact values or expectFinalMessage(run).toEqual(...).
  • Gate live tests so CI skips them automatically:
ts
it.skipIf(!process.env.OPENAI_API_KEY)("filters open items against real model", async () => {
  // ...
})

Do not put your API key in CI. Live tests are local-development tools for validating prompt changes against a real model before you mint or update fixture files.

Your scaffolded app already has a test

create-dawn-ai-app generates test/research.test.ts that imports @dawn-ai/testing and covers the default /research#agent route. Run it immediately after scaffolding:

bash
npm test

The generated file has four tests. The first verifies the corpus search and citation path verbatim:

test/research.test.ts
import { fileURLToPath } from "node:url"
import { afterAll, it } from "vitest"
import type { FixtureSet } from "@dawn-ai/testing"
import {
  createAgentHarness,
  expectFinalMessage,
  expectInterrupt,
  expectOffloaded,
  expectSubagent,
  expectToolCalled,
  script,
} from "@dawn-ai/testing"
 
const appRoot = fileURLToPath(new URL("..", import.meta.url))
const h = await createAgentHarness({ appRoot, route: "/research#agent" })
afterAll(() => h.close())
 
it("searches the corpus and writes a cited answer", async () => {
  h.reset()
  const run = await h.run({
    input: "What are common agent architectures?",
    fixtures: script()
      .user("What are common agent architectures?")
      .callsTool("searchCorpus", { query: "agent architectures" })
      .callsTool("readDoc", { path: "corpus/agent-architectures.md" })
      .replies("ReAct and plan-and-execute are common. [corpus/agent-architectures.md]"),
  })
  expectToolCalled(run, "searchCorpus")
  expectToolCalled(run, "readDoc")
  expectFinalMessage(run).toContain("[corpus/")
}, 60_000)

The remaining three tests cover: researcher subagent dispatch (expectSubagent); offloading of a large readDoc result (expectOffloaded); and the HITL permission gate for runBash with resume (expectInterrupt + harness.resume) — the last uses its own dedicated harness so the interrupt is clean.

Grow it by:

  1. Adding more script() scenarios — one it block per behaviour you want to pin.
  2. Committing fixtures for complex flows — use writeFixtures / record + loadFixtures for multi-turn or multi-tool scenarios that are tedious to hand-write.

Related