Benchmark methodology
This page is the open reference for AiSOC's public evaluation. It exists so that anyone — a contributor, a regulated buyer, an auditor, or a researcher — can understand how every number on the benchmark page is produced, reproduce them on their own laptop, and surface where the eval genuinely cannot speak to real-world performance.
We follow one workspace-wide rule throughout this page: be transparent about what is synthetic versus what is real, and never present a substrate self-check as live agent performance. Every section below labels its scope.
1. Why this exists
Vendor-published AI SOC benchmarks are typically not reproducible by the buyer — the dataset, the rubric, and the runner are private. AiSOC takes the opposite position: ship the dataset, the harness, and the CI gate in the repo, label which numbers are real measurements vs. substrate self-checks, and invite reproductions.
This page captures the design decisions behind that harness so they can be critiqued, extended, or replaced over time.
2. Dataset
2.1 200 synthetic incidents
The eval drives all suites against a deterministic 200-incident corpus at
services/agents/tests/eval_data/synthetic_incidents.json.
The corpus is regenerated by scripts/generate_eval_incidents.py
with a fixed seed, so two contributors on different machines produce byte-
identical files.
| Property | Value |
|---|---|
| Total cases | 200 |
| Distinct templates (macro families) | 55 |
| Cases per template | 3–4 (cycled through {user}/{host}/{ip}/{campaign}) |
| Severity distribution | low / medium / high / critical, weighted toward medium |
| MITRE ATT&CK tactic coverage | 11 of 14 v15 tactics |
| Synthetic vs real | 100 % synthetic. No customer telemetry, no scraped logs, no proprietary IOCs. |
The 55-template macro decomposition is a deliberate trade-off:
- Why 55, not 200? A unique narrative for every case would drift toward "one template per case" — a single broken template would be invisible in a 1/200 = 0.5 % per-case dilution. Cycling 3–4 cases per template gives the per-template macro a much sharper regression signal (≈ 1.8 % per broken template) without inflating the generator.
- Why 200, not 50 or 1 000? 200 is small enough to run substrate suites in milliseconds (the per-PR gate runs the whole corpus in ~35 ms) and large enough that wet-eval timing variance per-template stays under 10 % at the published p95 floor.
- Why deterministic? A flaky corpus is a flaky CI gate. Deterministic generation lets two PRs land on different days produce byte-identical inputs.
2.2 Synthetic telemetry corpus
Each incident ships with at least one backing event in the companion
synthetic_telemetry.jsonl
(361 events across 14 sources today: Sysmon, Windows Security, M365 audit,
Azure sign-in, CloudTrail, Linux auditd / journald, EDR, DNS, web access,
Kubernetes audit, GitHub audit, VPN, DB audit). The shape of every event
is what a real connector would pivot on — {user, host, ip, query, action, …} — so connector and Sigma PRs have a contract to develop against
without provisioning a real tenant.
This corpus is structurally faithful to real connectors but is not captured from a live tenant. It is hand-shaped (deterministic generator) to model the field set a connector would emit, not the noise distribution a real environment would produce. See § 7 Limitations.
2.3 What's synthetic vs what's real
| Component | Class | Notes |
|---|---|---|
| 200-incident corpus | Synthetic | Deterministic, regenerable |
| 361-event telemetry | Synthetic | Hand-shaped to real connector field set |
| Substrate code under test | Real | Production-bound modules in services/fusion, services/agents |
| Live agent (LangGraph orchestrator) | Real | Driven only by wet-eval (§ 3.2) |
| LLM responses (wet-eval) | Real | Real OpenAI / Anthropic / Ollama calls, real cost |
| Rate card (§ 5) | Real | Public list prices, dated at the top of each row |
3. Substrate vs. wet eval
This is the most important distinction on the benchmark page, so we restate it here in detail.
3.1 Substrate self-check (per-PR, milliseconds, no LLM)
The substrate suite drives deterministic code only:
- Keyword extractors — pulled out of
services/agents/app/extractors/ - Fusion grouping — a faithful in-harness re-implementation of the
Tier 1/2/3 rules used by
services/fusion, minus the DB-backed dedup and the ML scorer - Report and plan templates — the deterministic synthesisers used as fall-back when an LLM call fails
- Offline judges — keyword-coverage and rubric-score checks
These suites run in milliseconds and gate every PR. They do not measure
LLM agent quality; they measure substrate consistency. Three of the four
scoring suites (mitre_accuracy, investigation_completeness,
response_quality) are tautological by construction — they pass as long
as the substrate stays internally consistent and fail loudly if it does not.
This makes them an excellent regression signal but a poor
capability score. Anyone treating the headline mitre_accuracy = 0.97
as "the agent is 97 % accurate at MITRE classification" is misreading the
suite. The substrate suite is gating the substrate, not the agent.
3.2 Wet eval (weekly, real LLM, real money)
Wet eval drives the live services/agents LangGraph orchestrator
end-to-end against the same 200-incident corpus. Every LLM call is real,
every token costs real money, and every latency number reflects real
network round trips.
| Wet-eval property | Value |
|---|---|
| Cadence | Weekly, plus on-demand for PRs that touch the agent graph or prompts |
| Driver | scripts/run_evals.py with --wet (added by T2.4) |
| Dataset | Same synthetic_incidents.json (deterministic) |
| LLMs | Configurable via env (AISOC_BENCH_PROVIDER=openai|anthropic|ollama) |
| Telemetry captured | Wall-clock latency p50/p95/p99, prompt + completion tokens, USD cost |
| CI gate | .github/workflows/wet-eval-weekly.yml (T5.5) |
| Output | eval_report.json -> wet_eval block + render to apps/docs/static/eval/ |
The weekly job pushes its results into the same eval-results branch as the
substrate run, so historical trend lines stay in one place.
Cross-link: wet-eval-weekly.yml
is added by T5.5 in the v8.0 plan. Until that lands, the per-template
latency / token / USD cells on the benchmark page are placeholders rather
than imputed numbers.
4. Suites and what each one actually measures
The full per-suite description (what each gate enforces, what it cannot catch) lives on the benchmark page. The summary, with class labels:
| Suite | Class | Headline metric |
|---|---|---|
mitre_accuracy | Substrate self-check | Per-template macro accuracy of the keyword extractor |
alert_reduction | Real measurement | 1 000-alert noisy stream → fused incident count |
investigation_completeness | Substrate self-check | Mean keyword coverage of the report template |
response_quality | Substrate self-check | Mean rubric score of the synthesised plan |
playbook_completion_rate | Operational coverage gate | Fraction of in-scope incidents with a matched playbook |
synthetic_telemetry | Schema / coverage gate | Per-source field presence + diversity |
latency_* (wet) | Real measurement | Wall-clock p50/p95/p99 |
tokens_* (wet) | Real measurement | Total LLM tokens per investigation |
usd_* (wet) | Real measurement | Rate-card-multiplied cost per investigation |
5. Rate card
USD figures on the benchmark page are computed from the public list prices
below at the time of each wet-eval run. The rate at run time is recorded
inside the JSON report (wet_eval.usd.rate_card_at_run), so historic dollars
do not silently revalue when prices change.
| Provider | Model | Input ($/1M tok) | Output ($/1M tok) | Effective from |
|---|---|---|---|---|
| OpenAI | gpt-4o-2024-11-20 | 2.50 | 10.00 | 2025-11-20 |
| OpenAI | gpt-4o-mini | 0.15 | 0.60 | 2024-07-18 |
| Anthropic | claude-3.5-sonnet | 3.00 | 15.00 | 2024-10-22 |
| Anthropic | claude-3.5-haiku | 0.80 | 4.00 | 2024-11-04 |
| Local | ollama/llama3.1:8b | 0.00 | 0.00 | n/a |
Rates evolve. Providers update prices, and AiSOC ships a new wet-eval run when they do. We do not retroactively rewrite historic dollar numbers — old runs are recomputed only if you ask for them with the
--rate-cardoverride onscripts/run_evals.py. The "Effective from" column is the publish date of the rate, not the date of any specific eval run.If a provider raises prices mid-week, the next weekly wet-eval run will reflect the new rate. The diff is visible in
eval-results/eval/results/<sha>.json -> wet_eval.usd.rate_card_at_run.
If you are a maintainer landing a price update: bump the _RATE_CARD dict
in scripts/run_evals.py, set effective_from, and open a PR. The next
wet-eval run will pick it up automatically.
6. How to reproduce
Every number on the benchmark page is reproducible from a fresh clone in under 30 seconds (substrate) or 10–30 minutes (wet eval, depending on provider and concurrency).
6.1 Substrate run (per-PR, no LLM key required)
git clone https://github.com/beenuar/AiSOC.git
cd AiSOC
pnpm install
python scripts/run_evals.py --json --out my-eval.json
python scripts/render_eval_charts.py my-eval.json
Or, in one command:
pnpm eval:public
This runs the unified eval runner, writes eval_report.json, and renders
a per-suite markdown chart bundle into eval/results/charts/ for inclusion
in dashboards or PR comments. No API key is required for the substrate
portion. The pnpm eval:public script is documented in
package.json.
If a Makefile is added in the future, the same target will be exposed as
make eval-public — both names are intended to remain interchangeable.
6.2 Wet eval (weekly, requires LLM credentials)
export AISOC_BENCH_PROVIDER=openai
export OPENAI_API_KEY=sk-...
python scripts/run_evals.py --wet --concurrency 8 --out my-wet-eval.json
python scripts/render_eval_charts.py my-wet-eval.json
Wet eval respects the BYOK flow documented in
apps/docs/docs/operations/credentials.md.
The --wet flag and per-call telemetry are added by T2.4 in the v8.0
plan. Until that lands, this command falls back to substrate-only mode.
The credentials never touch a configuration file: they are read from the environment for the duration of the run and dropped on exit.
6.3 Comparing your run to the published numbers
Every published wet-eval run pushes a snapshot to the
eval-results branch.
Diff yours against latest.json:
python scripts/run_evals.py --baseline eval/results/latest.json \
--max-regression-pp 1.0 --out my-eval.json
That fails the run if mitre_accuracy regresses by more than 1.0 percentage
point versus the baseline.
7. How to compare against your own SOC
This eval is designed to be reproducible against any SIEM or SOAR that can ingest a JSON corpus and emit grouped incidents + investigation reports. The protocol is intentionally minimal so a vendor or internal team can run it in an afternoon and post the result publicly.
- Use the same dataset. Clone the repo at the commit you want to test
against and use
services/agents/tests/eval_data/synthetic_incidents.jsonverbatim. Do not regenerate; the SHA-256 of the file should match the provenance row on the benchmark page. - Use the same metrics. Score per-case mean and per-template macro, not just per-case mean — the dataset has 55 templates cycled 3–4×, and per-case-only is a weaker signal.
- Run the same five suites you can reproduce.
mitre_accuracy,alert_reduction,investigation_completeness,response_quality, andplaybook_completion_rateall live inservices/agents/tests/and require no AiSOC-specific runtime. Replace the substrate code under test with your equivalents. - For wet eval, declare your model and rate card. Latency, tokens, and USD figures are not comparable across providers without these labels — record them in your submission alongside your numbers.
- Submit the report. A reproductions-and-comparisons issue template
will live at
.github/ISSUE_TEMPLATE/benchmark_reproduction.yml(added alongside the public scoreboard in T5.4 of the v8.0 plan). Until the template lands, open a regular issue taggedbenchmark:reproductionwith the commit SHA, dataset SHA, eval mode, model, and the attachedeval_report.json. Comparisons that follow these rules are added to the public scoreboard.
We welcome any reproduction — including ones that show AiSOC losing on a particular template family. If you find a regression, please file an issue or open a PR; the harness exists precisely to surface those.
8. Limitations and threats to validity
Open-source honesty: there are real things this eval cannot tell you.
8.1 Synthetic data is not production data
The 200-incident corpus is hand-shaped to exercise structurally diverse templates, but real SOC alert streams differ in three important ways:
- Distribution. Real environments are dominated by 5–10 noisy rules. The eval evenly distributes templates so all of them get exercised.
- Adversarial drift. Real attackers mutate. The corpus is static. The
adversary_evalsuite (heavy / medium / light buckets) partially addresses this but is not a substitute for red-team data. - Context shape. Real incidents carry org-specific context (asset criticality, business hours, owner). The eval normalises this away.
What the eval can tell you: whether AiSOC's substrate gets worse over time on a fixed corpus, what the wet-eval cost looks like in steady state, and where template families regress.
8.2 Substrate self-checks are tautological by design
mitre_accuracy, investigation_completeness, and response_quality are
written to gate substrate consistency, not LLM capability. We say this in
several places on the benchmark page and again here, because it is the
single most-misread part of the eval. If you want to claim "AiSOC's agent
is 97 % accurate at MITRE classification" — the substrate score does not
support that. The wet-eval LLM-as-judge variant (post-T2.4) is the right
metric for that claim.
8.3 Wet-eval variance
LLM outputs are non-deterministic even at temperature 0 because providers batch differently across calls. Latency depends on provider load. Token counts depend on the model's tokeniser version. The published p50/p95/p99 are derived from a single weekly run; we do not report confidence intervals because n=200 is small enough that the headline number moves visibly run-over-run. Use the trend over many weeks rather than any single weekly snapshot when comparing.
8.4 Cost is rate-card cost, not infra cost
The USD per-investigation figure includes the LLM API spend only. It does not include infrastructure (compute, storage, observability), engineering time, or the cost of human review of the agent's recommendations. A real SOC TCO model has to add those.
8.5 The playbook completion gate is coverage, not quality
The 50.5 % overall completion rate on the benchmark page is the share of
synthetic incidents whose category, severity, and response action are
matched by at least one playbook in playbooks/packs/v1/. It does not
execute the playbook, time its steps, or measure its reliability against
live telemetry. Step execution is covered by the playbook engine's own
unit tests in services/agents/tests/test_playbook_engine.py.
8.6 Connector / SIEM coverage gaps
The synthetic telemetry corpus covers 14 log sources today. A connector that has not been hand-modelled into the corpus (e.g. a niche identity provider) will not appear in any suite. Submitting a new template + backing telemetry is the right path for parity — see Help us harden the harness.
9. Where to go from here
- The benchmark page for the actual numbers and the per-suite explanations.
- The eval-results branch for every historic snapshot.
- The v8.0 north-star plan for the planned expansions: token + USD telemetry (T2.4), public scoreboard (T5.4), weekly wet-eval CI (T5.5), and public-dataset fidelity benchmark (T5.3).
- Open-source agent code lives under
services/agents/— every wet-eval run drives the same code.