Skip to content

Regression testing

AgentDbg ships three CLI commands that turn traced runs into lightweight regression tests: baseline, assert, and diff. Together they let you capture a known-good run, check future runs against it, and drill into what changed when something breaks.


Why

Agent behavior is non-deterministic. A prompt tweak, model upgrade, or tool change can silently increase token usage, add unexpected tool calls, or introduce loops. agentdbg assert gives you a one-line check — locally or in CI — that catches these regressions before they reach production.


Workflow overview

1. Run your agent              agentdbg view
2. Capture a baseline          agentdbg baseline <run_id>
3. Run the agent again         python your_agent.py
4. Assert against baseline     agentdbg assert <new_run_id> --baseline .agentdbg/baselines/my_agent.json
5. If it fails, diff           agentdbg diff <new_run_id> --baseline .agentdbg/baselines/my_agent.json

Step 1: Capture a baseline

After a successful run that represents the expected behavior:

agentdbg baseline <RUN_ID>

This creates a JSON snapshot at .agentdbg/baselines/<run_name>.json (or the run ID if no name was set). Use --out to control the path:

agentdbg baseline a1b2c3d4 --out baselines/support_agent_v1.json

What gets captured:

Field Description
schema_version Baseline format version ("0.1")
source_run_id The run this baseline was created from
source_run_name Run name (if set)
summary Aggregate metrics: total events, LLM calls, tool calls, errors, loop warnings, duration, tokens
tool_path Sorted list of unique tool names used
tool_call_counts Per-tool invocation counts
llm_models_used Models seen in LLM_CALL events
event_type_sequence Ordered list of event types
guardrail_events Any guardrail-triggered events
final_status Run status ("ok" or "error")

Check the baseline file into version control so the team shares the same reference point.


Step 2: Assert against a baseline

agentdbg assert <RUN_ID> --baseline .agentdbg/baselines/my_agent.json

Exit codes: 0 = all checks pass, 1 = one or more checks failed, 2 = run or baseline not found, 10 = internal error.

What gets checked

Checks are controlled by the assertion policy — a combination of a policy YAML file and CLI flags. By default, if a baseline is provided, every numeric metric is compared with a 50% tolerance. You can tighten or customize this with a policy file or CLI flags.

Standalone thresholds (no baseline needed)

You can assert without a baseline by setting hard caps:

agentdbg assert <RUN_ID> --max-steps 80 --max-tool-calls 30 --no-loops

Combining baseline and thresholds

When both a baseline and a max_* threshold are set, the effective limit is the lesser of the two:

limit = min(baseline_value * (1 + tolerance), max_value)

See the Policy YAML reference for the full decision table.


Step 3: Use a policy file

Instead of passing many CLI flags, commit a .agentdbg/policy.yaml file:

assert:
  max_steps: 80
  step_tolerance: 0.2
  max_tool_calls: 30
  no_loops: true
  no_new_tools: true
  expect_status: ok

agentdbg assert auto-detects .agentdbg/policy.yaml in the current directory. To use a different path:

agentdbg assert <RUN_ID> --baseline baseline.json --policy ci-policy.yaml

Precedence: CLI flags > policy file > defaults. See the full policy reference for all fields, threshold semantics, and override rules.


Output formats

Use --format (-f) to choose the output format.

Text (default)

agentdbg assert <RUN_ID> --baseline baseline.json
  ✓ step_count: 42 steps (baseline: 38, tolerance: 50%)
  ✓ tool_calls: 12 tool calls (baseline: 10, tolerance: 50%)
  ✗ no_loops: 2 loop warning(s) detected
  ✓ expect_status: status is 'ok'

RESULT: FAILED (1 of 4 checks failed)

JSON

agentdbg assert <RUN_ID> --baseline baseline.json --format json
{
  "run_id": "a1b2c3d4-...",
  "baseline_run_id": "e5f6a7b8-...",
  "passed": false,
  "results": [
    {
      "check_name": "step_count",
      "passed": true,
      "message": "42 steps (baseline: 38, tolerance: 50%)",
      "expected": "57",
      "actual": "42"
    },
    {
      "check_name": "no_loops",
      "passed": false,
      "message": "2 loop warning(s) detected",
      "expected": null,
      "actual": "2"
    }
  ]
}

Markdown

Designed for GitHub PR comments and step summaries:

agentdbg assert <RUN_ID> --baseline baseline.json --format markdown
## AgentDbg Regression Report

| Check | Status | Details |
|-------|--------|---------|
| step_count | ✅ Pass | 42 steps (baseline: 38, tolerance: 50%) |
| no_loops | ❌ Fail | 2 loop warning(s) detected |

Result: **FAILED**

Step 4: Drill into failures with diff

When agentdbg assert fails, use agentdbg diff to see exactly what changed.

Diff against a baseline

agentdbg diff <RUN_ID> --baseline .agentdbg/baselines/my_agent.json

Diff two runs directly

agentdbg diff <RUN_A> <RUN_B>

Sample output

Run comparison: a1b2c3d4 vs e5f6a7b8

Summary:
  total_events: 38 -> 42 (+11%)
  tool_calls: 10 -> 14 (+40%)
  loop_warnings: 0 -> 2 (NEW)

Tool path changes:
  + web_search (new)

Event type distribution:
  LLM_CALL: 8 -> 8
  TOOL_CALL: 10 -> 14 (+40%)
  LOOP_WARNING: 0 -> 2 (NEW)

The diff shows summary-level metric changes, new or removed tools, and shifts in the event type distribution.


GitHub Actions example

Run your agent in CI, then assert against the checked-in baseline:

- name: Run agent
  run: python my_agent.py

- name: Assert agent behavior
  run: |
    RUN_ID=$(agentdbg list --json | python -c "import sys,json; print(json.load(sys.stdin)['runs'][0]['run_id'])")
    agentdbg assert "$RUN_ID" \
      --baseline .agentdbg/baselines/my_agent.json \
      --format markdown >> "$GITHUB_STEP_SUMMARY"

If the assertion fails, the step exits with code 1 and the markdown report appears in the GitHub Actions step summary.