Skip to content

Regression testing

Maida ships three CLI commands that turn traced runs into lightweight regression tests: baseline, assert, and diff. Together they let you capture a known-good run, check future runs against it, and drill into what changed when something breaks.


Why

Agent behavior is non-deterministic. A prompt tweak, model upgrade, or tool change can silently increase token usage, add unexpected tool calls, or introduce loops. maida assert gives you a one-line check — locally or in CI — that catches these regressions before they reach production.


Workflow overview

1. Run your agent              python your_agent.py
2. Capture a baseline          maida baseline
3. Run the agent again         python your_agent.py
4. Assert against baseline     maida assert --baseline .maida/baselines/my_agent.json
5. If it fails, diff           maida diff --baseline .maida/baselines/my_agent.json

baseline, assert, and diff default to the latest run when no run ID is given; pass an OTel trace ID or short prefix to target a specific run.

To see the whole workflow on canned data first, run maida demo --regression.


Step 1: Capture a baseline

After a successful run that represents the expected behavior:

maida baseline <RUN_ID>

This creates a JSON snapshot at .maida/baselines/<run_name>.json (or the run ID if no name was set). Use --out to control the path:

maida baseline a1b2c3d4 --out baselines/support_agent_v1.json

What gets captured:

Field Description
schema_version Baseline format version ("0.2")
source_run_id The resolved OTel trace ID this baseline was created from
source_run_name Run name (if set)
summary Aggregate metrics: total events, LLM calls, tool calls, errors, loop warnings, duration, tokens
tool_path Sorted list of unique tool names used
tool_call_counts Per-tool invocation counts
llm_models_used Models seen in LLM_CALL events
event_type_sequence Ordered list of event types
guardrail_events Any guardrail-triggered events
final_status Run status ("ok" or "error")

Check the baseline file into version control so the team shares the same reference point.


Step 2: Assert against a baseline

maida assert <RUN_ID> --baseline .maida/baselines/my_agent.json

Exit codes: 0 = all checks pass, 1 = one or more checks failed, 2 = run or baseline not found, 10 = internal error.

What gets checked

Checks are controlled by the assertion policy — a combination of a policy YAML file and CLI flags. By default, if a baseline is provided, every numeric metric is compared with a 50% tolerance. You can tighten or customize this with a policy file or CLI flags.

Standalone thresholds (no baseline needed)

You can assert without a baseline by setting hard caps:

maida assert <RUN_ID> --max-steps 80 --max-tool-calls 30 --no-loops

Combining baseline and thresholds

When both a baseline and a max_* threshold are set, the effective limit is the lesser of the two:

limit = min(baseline_value * (1 + tolerance), max_value)

See the Policy YAML reference for the full decision table.


Step 3: Use a policy file

Instead of passing many CLI flags, commit a .maida/policy.yaml file:

assert:
  max_steps: 80
  step_tolerance: 0.2
  max_tool_calls: 30
  no_loops: true
  no_new_tools: true
  expect_status: ok

maida assert auto-detects .maida/policy.yaml in the current directory. To use a different path:

maida assert <RUN_ID> --baseline baseline.json --policy ci-policy.yaml

Precedence: CLI flags > policy file > defaults. See the full policy reference for all fields, threshold semantics, and override rules.


Output formats

Use --format (-f) to choose the output format.

Text (default)

maida assert <RUN_ID> --baseline baseline.json
  ✓ step_count: 42 steps (baseline: 38, tolerance: 50%)
  ✓ tool_calls: 12 tool calls (baseline: 10, tolerance: 50%)
  ✗ no_loops: 2 loop warning(s) detected
  ✓ expect_status: status is 'ok'

RESULT: FAILED (1 of 4 checks failed)

JSON

maida assert <RUN_ID> --baseline baseline.json --format json
{
  "run_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890",
  "baseline_run_id": "e5f6a7b8c9d001122334455667788990",
  "passed": false,
  "results": [
    {
      "check_name": "step_count",
      "passed": true,
      "message": "42 steps (baseline: 38, tolerance: 50%)",
      "expected": "57",
      "actual": "42"
    },
    {
      "check_name": "no_loops",
      "passed": false,
      "message": "2 loop warning(s) detected",
      "expected": null,
      "actual": "2"
    }
  ]
}

Markdown

Designed for GitHub PR comments and step summaries:

maida assert --baseline baseline.json --format markdown
## ❌ Maida gate: agent behavior regressed

**1 of 4 checks failed** · run `a1b2c3d4` vs baseline `e5f6a7b8`

| Check | Expected | Actual | Details |
|---|---|---|---|
| ❌ `no_loops` | — | 2 | 2 loop warning(s) detected |

<details>
<summary>✅ 3 passing checks</summary>
...
</details>

### What changed vs baseline

| Metric | Baseline | Current | Change |
|---|---|---|---|
| tool_calls | 10 | 14 | +40% |
| loop_warnings | 0 | 2 | NEW |

**Tool changes:**
- ➕ `web_search` — new tool, not in baseline

The report leads with the verdict, lists failed checks first with expected vs actual values, collapses passing checks, and — when a baseline is provided — embeds the structural diff and a copy-pasteable local-repro snippet.


Step 4: Drill into failures with diff

When maida assert fails, use maida diff to see exactly what changed.

Diff against a baseline

maida diff --baseline .maida/baselines/my_agent.json

Diff two runs directly

maida diff <RUN_A> <RUN_B>

Sample output

Run comparison: a1b2c3d4 vs e5f6a7b8

Summary:
  total_events: 38 -> 42 (+11%)
  tool_calls: 10 -> 14 (+40%)
  loop_warnings: 0 -> 2 (NEW)

Tool path changes:
  + web_search (new)

Event type distribution:
  LLM_CALL: 8 -> 8
  TOOL_CALL: 10 -> 14 (+40%)
  LOOP_WARNING: 0 -> 2 (NEW)

The diff shows summary-level metric changes, new or removed tools, and shifts in the event type distribution.


GitHub Actions example

The easiest path is the packaged action, maida-ai/maida-assert — scaffold it with maida init --github. It runs your traced agent, asserts the run, and posts the regression report as a sticky PR comment.

To wire it up by hand instead, run your agent in CI, then assert against the checked-in baseline (maida assert picks up the latest run automatically):

- name: Run agent
  run: python my_agent.py

- name: Assert agent behavior
  run: |
    maida assert \
      --baseline .maida/baselines/my_agent.json \
      --format markdown >> "$GITHUB_STEP_SUMMARY"

If the assertion fails, the step exits with code 1 and the markdown report appears in the GitHub Actions step summary.