Public benchmark scoreboard
Append-only weekly history of every published AiSOC eval run. Each row is one
end-to-end run of scripts/run_evals.py
against the 200-incident corpus,
labelled with the agent version, the commit SHA, and whether it was a
deterministic substrate run or a real wet-eval against a live LLM.
Rows tagged substrate are the deterministic CI gate — they execute in
microseconds with no LLM call, no money. Their token and USD figures are
budget projections computed from the 4-chars/token estimator and the
illustrative public rate card,
not real bills. Wet-eval rows (real agent, real LLM) start arriving the
moment the T5.5 weekly job
lands in CI. The two row types share the same columns so the table reads
uniformly, but never quote a substrate row as live agent performance.
| Date | Agent | Commit | Mode | MITRE acc. | MTC p50 | MTC p95 | USD total | Tokens total |
|---|---|---|---|---|---|---|---|---|
| 2026-05-13 | v1.4.1 | 4ff1b7f | substrate | 97.0% | n/a | n/a | $2.94 budget | 437,200 |
How this page is updated
The scoreboard is sourced from a single checked-in JSON file at
apps/docs/static/data/scoreboard.json,
validated against
scoreboard.schema.json
on every docs build via pnpm --filter @aisoc/docs scoreboard:check.
There are two ways a row reaches that file:
- Substrate rows (per-PR CI gate). A row is appended whenever the
substrate snapshot drifts enough to publish — captured during release
tagging and committed by hand under
feat(eval): scoreboard substrate row for v<X.Y.Z>. - Wet-eval rows (T5.5 weekly job). The
wet-eval-weekly.ymlGitHub Action runs the live agent against the same 200-incident corpus on a Sunday cadence, captures real latency / token / USD telemetry, and opens an auto-PR appending one row toscoreboard.json. Wet-eval rows show up at the top of the table and on the trend chart.
This append-only contract is deliberate: the scoreboard becomes more
informative the longer it runs. We never silently rewrite history; if a
historic row turns out to be wrong we add a follow-up row with the
correction in notes and link the issue.
Reproducing any single row
Every row in the table can be reproduced from a fresh clone:
git clone https://github.com/beenuar/AiSOC.git
cd AiSOC
git checkout <commit_sha> # the value in the "Commit" column
pnpm install
pnpm eval:public # writes eval_report.json + eval/charts/
For wet-eval rows you additionally need an OPENAI_API_KEY (or another
provider exposed via the same --telemetry-model flag) and access to the
weekly workflow inputs documented in
benchmark-methodology.md → How to reproduce.
Schema and column reference
| Column | JSON field | Notes |
|---|---|---|
| Date | date | ISO date of the eval run. |
| Agent | agent_version | Tagged release of services/agents (e.g. v1.4.1). |
| Commit | commit_sha | Short or full git SHA the run was produced against. |
| Mode | eval_mode + substrate | substrate-only (no LLM) or wet-eval-* (live agent). The badge colour repeats this distinction. |
| MITRE acc. | mitre_accuracy | Per-case accuracy on the 200-incident corpus. |
| MTC p50 | mtc_p50_seconds | Mean time to closure, p50, end-to-end. n/a on substrate rows because the substrate runs in microseconds — not a meaningful end-to-end timing. |
| MTC p95 | mtc_p95_seconds | Same, p95. |
| USD total | usd_total | On wet-eval rows, real spend. On substrate rows, budget projection from the rate card. |
| Tokens total | tokens_total | Total tokens across the 200 investigations. |
The full per-suite breakdown (alert_reduction,
investigation_completeness, response_quality, playbook_completion_rate,
per-template macros) lives in scoreboard.json
and is rendered on the main benchmark page for the latest
run; the scoreboard table keeps a tight five-number summary so the trend
remains scannable.
Comparing your own runs
If you reproduce one of the rows on your own laptop and the numbers move,
that's a signal worth filing — either the harness is non-deterministic on
your platform (a bug we want to know about) or your fork has diverged. Open
an issue on
github.com/beenuar/AiSOC/issues
with the JSON output of pnpm eval:public attached and the AiSOC team will
investigate.
If you reproduce against a different model or rate card and want your row on the public scoreboard, see community submissions.
Provenance
- Data file:
apps/docs/static/data/scoreboard.json - JSON Schema:
apps/docs/static/data/scoreboard.schema.json - Renderer:
apps/docs/src/components/Scoreboard/index.tsx - Validator:
pnpm --filter @aisoc/docs scoreboard:check→apps/docs/scripts/validate-scoreboard.mjs - Methodology: Benchmark methodology
- Latest snapshot tables: Benchmark