Skip to main content

Public benchmark scoreboard

Append-only weekly history of every published AiSOC eval run. Each row is one end-to-end run of scripts/run_evals.py against the 200-incident corpus, labelled with the agent version, the commit SHA, and whether it was a deterministic substrate run or a real wet-eval against a live LLM.

Substrate rows ≠ live agent performance

Rows tagged substrate are the deterministic CI gate — they execute in microseconds with no LLM call, no money. Their token and USD figures are budget projections computed from the 4-chars/token estimator and the illustrative public rate card, not real bills. Wet-eval rows (real agent, real LLM) start arriving the moment the T5.5 weekly job lands in CI. The two row types share the same columns so the table reads uniformly, but never quote a substrate row as live agent performance.

85%90%95%100%05-13
MITRE accuracy over time, y-axis 85–100%. Hover a dot for the exact run. substrate (CI gate, no LLM); wet eval (T5.5 weekly). One data point renders as a single dot — the line appears once two or more rows exist.
AiSOC public benchmark scoreboard — weekly eval results, newest first. Substrate rows are deterministic CI gates (no LLM); wet-eval rows are real LangGraph agent runs.
DateAgentCommitModeMITRE acc.MTC p50MTC p95USD totalTokens total
2026-05-13v1.4.14ff1b7fsubstrate97.0%n/an/a$2.94 budget437,200

How this page is updated

The scoreboard is sourced from a single checked-in JSON file at apps/docs/static/data/scoreboard.json, validated against scoreboard.schema.json on every docs build via pnpm --filter @aisoc/docs scoreboard:check.

There are two ways a row reaches that file:

  1. Substrate rows (per-PR CI gate). A row is appended whenever the substrate snapshot drifts enough to publish — captured during release tagging and committed by hand under feat(eval): scoreboard substrate row for v<X.Y.Z>.
  2. Wet-eval rows (T5.5 weekly job). The wet-eval-weekly.yml GitHub Action runs the live agent against the same 200-incident corpus on a Sunday cadence, captures real latency / token / USD telemetry, and opens an auto-PR appending one row to scoreboard.json. Wet-eval rows show up at the top of the table and on the trend chart.

This append-only contract is deliberate: the scoreboard becomes more informative the longer it runs. We never silently rewrite history; if a historic row turns out to be wrong we add a follow-up row with the correction in notes and link the issue.

Reproducing any single row

Every row in the table can be reproduced from a fresh clone:

git clone https://github.com/beenuar/AiSOC.git
cd AiSOC
git checkout <commit_sha> # the value in the "Commit" column
pnpm install
pnpm eval:public # writes eval_report.json + eval/charts/

For wet-eval rows you additionally need an OPENAI_API_KEY (or another provider exposed via the same --telemetry-model flag) and access to the weekly workflow inputs documented in benchmark-methodology.md → How to reproduce.

Schema and column reference

ColumnJSON fieldNotes
DatedateISO date of the eval run.
Agentagent_versionTagged release of services/agents (e.g. v1.4.1).
Commitcommit_shaShort or full git SHA the run was produced against.
Modeeval_mode + substratesubstrate-only (no LLM) or wet-eval-* (live agent). The badge colour repeats this distinction.
MITRE acc.mitre_accuracyPer-case accuracy on the 200-incident corpus.
MTC p50mtc_p50_secondsMean time to closure, p50, end-to-end. n/a on substrate rows because the substrate runs in microseconds — not a meaningful end-to-end timing.
MTC p95mtc_p95_secondsSame, p95.
USD totalusd_totalOn wet-eval rows, real spend. On substrate rows, budget projection from the rate card.
Tokens totaltokens_totalTotal tokens across the 200 investigations.

The full per-suite breakdown (alert_reduction, investigation_completeness, response_quality, playbook_completion_rate, per-template macros) lives in scoreboard.json and is rendered on the main benchmark page for the latest run; the scoreboard table keeps a tight five-number summary so the trend remains scannable.

Comparing your own runs

If you reproduce one of the rows on your own laptop and the numbers move, that's a signal worth filing — either the harness is non-deterministic on your platform (a bug we want to know about) or your fork has diverged. Open an issue on github.com/beenuar/AiSOC/issues with the JSON output of pnpm eval:public attached and the AiSOC team will investigate.

If you reproduce against a different model or rate card and want your row on the public scoreboard, see community submissions.

Provenance