Public benchmark scoreboard

Append-only weekly history of every published AiSOC eval run. Each row is one end-to-end run of scripts/run_evals.py against the 200-incident corpus, labelled with the agent version, the commit SHA, and whether it was a deterministic substrate run or a real wet-eval against a live LLM.

Substrate rows ≠ live agent performance

Rows tagged substrate are the deterministic CI gate — they execute in microseconds with no LLM call, no money. Their token and USD figures are budget projections computed from the 4-chars/token estimator and the illustrative public rate card, not real bills. Wet-eval rows (real agent, real LLM) start arriving the moment the T5.5 weekly job lands in CI. The two row types share the same columns so the table reads uniformly, but never quote a substrate row as live agent performance.

Per-PR freshness gate (Phase E1)

The newest substrate row is no longer hand-maintained: scripts/check_scoreboard.py runs in CI on every PR, executes the deterministic live-agent MITRE-accuracy eval over the 200-incident corpus, and fails the build if the published mitre_accuracy drifts more than 0.02 from the fresh run, the JSON breaks its schema, or a substrate row is mislabelled. So the headline substrate number here is provably what the current agent scores — not a stale figure.

MITRE accuracy over time, y-axis 85–100%. Hover a dot for the exact run. ● substrate (CI gate, no LLM); ● wet eval (T5.5 weekly). One data point renders as a single dot — the line appears once two or more rows exist.

AiSOC public benchmark scoreboard — weekly eval results, newest first. Substrate rows are deterministic CI gates (no LLM); wet-eval rows are real LangGraph agent runs.
Date	Agent	Commit	Mode	MITRE acc.	MTC p50	MTC p95	USD total	Tokens total
2026-07-13	`v7.5.0`	`27c52b9`	substrate	97.0%	n/a	n/a	$0.0000 budget	0
2026-05-13	`v1.4.1`	`4ff1b7f`	substrate	97.0%	n/a	n/a	$2.94 budget	437,200

How this page is updated

The scoreboard is sourced from a single checked-in JSON file at apps/docs/static/data/scoreboard.json, validated against scoreboard.schema.json on every docs build via pnpm --filter @aisoc/docs scoreboard:check.

There are two ways a row reaches that file:

Substrate rows (per-PR CI gate). A row is appended whenever the substrate snapshot drifts enough to publish — captured during release tagging and committed by hand under feat(eval): scoreboard substrate row for v<X.Y.Z>.
Wet-eval rows (T5.5 weekly job). The wet-eval-weekly.yml GitHub Action runs the live agent against the same 200-incident corpus on a Sunday cadence, captures real latency / token / USD telemetry, and opens an auto-PR appending one row to scoreboard.json. Wet-eval rows show up at the top of the table and on the trend chart.

This append-only contract is deliberate: the scoreboard becomes more informative the longer it runs. We never silently rewrite history; if a historic row turns out to be wrong we add a follow-up row with the correction in notes and link the issue.

Reproducing any single row

Every row in the table can be reproduced from a fresh clone:

git clone https://github.com/beenuar/AiSOC.git
cd AiSOC
git checkout <commit_sha>     # the value in the "Commit" column
pnpm install
pnpm eval:public               # writes eval_report.json + eval/charts/

For wet-eval rows you additionally need an OPENAI_API_KEY (or another provider exposed via the same --telemetry-model flag) and access to the weekly workflow inputs documented in benchmark-methodology.md → How to reproduce.

Schema and column reference

Column	JSON field	Notes
Date	`date`	ISO date of the eval run.
Agent	`agent_version`	Tagged release of `services/agents` (e.g. `v1.4.1`).
Commit	`commit_sha`	Short or full git SHA the run was produced against.
Mode	`eval_mode` + `substrate`	`substrate-only` (no LLM) or `wet-eval-*` (live agent). The badge colour repeats this distinction.
MITRE acc.	`mitre_accuracy`	Per-case accuracy on the 200-incident corpus.
MTC p50	`mtc_p50_seconds`	Mean time to closure, p50, end-to-end. `n/a` on substrate rows because the substrate runs in microseconds — not a meaningful end-to-end timing.
MTC p95	`mtc_p95_seconds`	Same, p95.
USD total	`usd_total`	On wet-eval rows, real spend. On substrate rows, `budget` projection from the rate card.
Tokens total	`tokens_total`	Total tokens across the 200 investigations.

The full per-suite breakdown (alert_reduction, investigation_completeness, response_quality, playbook_completion_rate, per-template macros) lives in scoreboard.json and is rendered on the main benchmark page for the latest run; the scoreboard table keeps a tight five-number summary so the trend remains scannable.

Comparing your own runs

If you reproduce one of the rows on your own laptop and the numbers move, that's a signal worth filing — either the harness is non-deterministic on your platform (a bug we want to know about) or your fork has diverged. Open an issue on github.com/beenuar/AiSOC/issues with the JSON output of pnpm eval:public attached and the AiSOC team will investigate.

If you reproduce against a different model or rate card and want your row on the public scoreboard, see community submissions.

Provenance

Data file: apps/docs/static/data/scoreboard.json
JSON Schema: apps/docs/static/data/scoreboard.schema.json
Renderer: apps/docs/src/components/Scoreboard/index.tsx
Validator: pnpm --filter @aisoc/docs scoreboard:check → apps/docs/scripts/validate-scoreboard.mjs
Methodology: Benchmark methodology
Latest snapshot tables: Benchmark

How this page is updated​

Reproducing any single row​

Schema and column reference​

Comparing your own runs​

Provenance​

How this page is updated

Reproducing any single row

Schema and column reference

Comparing your own runs

Provenance