Benchmark methodology

This page is the open reference for AiSOC's public evaluation. It exists so that anyone — a contributor, a regulated buyer, an auditor, or a researcher — can understand how every number on the benchmark page is produced, reproduce them on their own laptop, and surface where the eval genuinely cannot speak to real-world performance.

We follow one workspace-wide rule throughout this page: be transparent about what is synthetic versus what is real, and never present a substrate self-check as live agent performance. Every section below labels its scope.

1. Why this exists

Vendor-published AI SOC benchmarks are typically not reproducible by the buyer — the dataset, the rubric, and the runner are private. AiSOC takes the opposite position: ship the dataset, the harness, and the CI gate in the repo, label which numbers are real measurements vs. substrate self-checks, and invite reproductions.

This page captures the design decisions behind that harness so they can be critiqued, extended, or replaced over time.

2. Dataset

2.1 200 synthetic incidents

The eval drives all suites against a deterministic 200-incident corpus at services/agents/tests/eval_data/synthetic_incidents.json. The corpus is regenerated by scripts/generate_eval_incidents.py with a fixed seed, so two contributors on different machines produce byte- identical files.

Property	Value
Total cases	200
Distinct templates (macro families)	55
Cases per template	3–4 (cycled through `{user}/{host}/{ip}/{campaign}`)
Severity distribution	low / medium / high / critical, weighted toward medium
MITRE ATT&CK tactic coverage	11 of 14 v15 tactics
Synthetic vs real	100 % synthetic. No customer telemetry, no scraped logs, no proprietary IOCs.

The 55-template macro decomposition is a deliberate trade-off:

Why 55, not 200? A unique narrative for every case would drift toward "one template per case" — a single broken template would be invisible in a 1/200 = 0.5 % per-case dilution. Cycling 3–4 cases per template gives the per-template macro a much sharper regression signal (≈ 1.8 % per broken template) without inflating the generator.
Why 200, not 50 or 1 000? 200 is small enough to run substrate suites in milliseconds (the per-PR gate runs the whole corpus in ~35 ms) and large enough that wet-eval timing variance per-template stays under 10 % at the published p95 floor.
Why deterministic? A flaky corpus is a flaky CI gate. Deterministic generation lets two PRs land on different days produce byte-identical inputs.

2.2 Synthetic telemetry corpus

Each incident ships with at least one backing event in the companion synthetic_telemetry.jsonl (361 events across 14 sources today: Sysmon, Windows Security, M365 audit, Azure sign-in, CloudTrail, Linux auditd / journald, EDR, DNS, web access, Kubernetes audit, GitHub audit, VPN, DB audit). The shape of every event is what a real connector would pivot on — {user, host, ip, query, action, …} — so connector and Sigma PRs have a contract to develop against without provisioning a real tenant.

This corpus is structurally faithful to real connectors but is not captured from a live tenant. It is hand-shaped (deterministic generator) to model the field set a connector would emit, not the noise distribution a real environment would produce. See § 7 Limitations.

2.3 What's synthetic vs what's real

Component	Class	Notes
200-incident corpus	Synthetic	Deterministic, regenerable
361-event telemetry	Synthetic	Hand-shaped to real connector field set
Substrate code under test	Real	Production-bound modules in `services/fusion`, `services/agents`
Live agent (LangGraph orchestrator)	Real	Driven only by wet-eval (§ 3.2)
LLM responses (wet-eval)	Real	Real OpenAI / Anthropic / Ollama calls, real cost
Rate card (§ 5)	Real	Public list prices, dated at the top of each row

3. Substrate vs. wet eval

This is the most important distinction on the benchmark page, so we restate it here in detail.

3.1 Substrate self-check (per-PR, milliseconds, no LLM)

The substrate suite drives deterministic code only:

Keyword extractors — pulled out of services/agents/app/extractors/
Fusion grouping — a faithful in-harness re-implementation of the Tier 1/2/3 rules used by services/fusion, minus the DB-backed dedup and the ML scorer
Report and plan templates — the deterministic synthesisers used as fall-back when an LLM call fails
Offline judges — keyword-coverage and rubric-score checks

These suites run in milliseconds and gate every PR. They do not measure LLM agent quality; they measure substrate consistency. Three of the four scoring suites (mitre_accuracy, investigation_completeness, response_quality) are tautological by construction — they pass as long as the substrate stays internally consistent and fail loudly if it does not.

This makes them an excellent regression signal but a poor capability score. Anyone treating the headline mitre_accuracy = 0.97 as "the agent is 97 % accurate at MITRE classification" is misreading the suite. The substrate suite is gating the substrate, not the agent.

3.2 Wet eval (weekly, real LLM, real money)

Wet eval drives the live services/agents LangGraph orchestrator end-to-end against the same 200-incident corpus. Every LLM call is real, every token costs real money, and every latency number reflects real network round trips.

Wet-eval property	Value
Cadence	Weekly, plus on-demand for PRs that touch the agent graph or prompts
Driver	`scripts/run_evals.py` with `--wet` (added by T2.4)
Dataset	Same `synthetic_incidents.json` (deterministic)
LLMs	Configurable via env (`AISOC_BENCH_PROVIDER=openai\|anthropic\|ollama`)
Telemetry captured	Wall-clock latency p50/p95/p99, prompt + completion tokens, USD cost
CI gate	`.github/workflows/wet-eval-weekly.yml` (T5.5)
Output	`eval_report.json -> wet_eval` block + render to `apps/docs/static/eval/`

The weekly job pushes its results into the same eval-results branch as the substrate run, so historical trend lines stay in one place.

Cross-link: wet-eval-weekly.yml is added by T5.5 in the v8.0 plan. Until that lands, the per-template latency / token / USD cells on the benchmark page are placeholders rather than imputed numbers.

4. Suites and what each one actually measures

The full per-suite description (what each gate enforces, what it cannot catch) lives on the benchmark page. The summary, with class labels:

Suite	Class	Headline metric
`mitre_accuracy`	Substrate self-check	Per-template macro accuracy of the keyword extractor
`alert_reduction`	Real measurement	1 000-alert noisy stream → fused incident count
`investigation_completeness`	Substrate self-check	Mean keyword coverage of the report template
`response_quality`	Substrate self-check	Mean rubric score of the synthesised plan
`playbook_completion_rate`	Operational coverage gate	Fraction of in-scope incidents with a matched playbook
`synthetic_telemetry`	Schema / coverage gate	Per-source field presence + diversity
`latency_*` (wet)	Real measurement	Wall-clock p50/p95/p99
`tokens_*` (wet)	Real measurement	Total LLM tokens per investigation
`usd_*` (wet)	Real measurement	Rate-card-multiplied cost per investigation

5. Rate card

USD figures on the benchmark page are computed from the public list prices below at the time of each wet-eval run. The rate at run time is recorded inside the JSON report (wet_eval.usd.rate_card_at_run), so historic dollars do not silently revalue when prices change.

Provider	Model	Input ($/1M tok)	Output ($/1M tok)	Effective from
OpenAI	`gpt-4o-2024-11-20`	2.50	10.00	2025-11-20
OpenAI	`gpt-4o-mini`	0.15	0.60	2024-07-18
Anthropic	`claude-3.5-sonnet`	3.00	15.00	2024-10-22
Anthropic	`claude-3.5-haiku`	0.80	4.00	2024-11-04
Local	`ollama/llama3.1:8b`	0.00	0.00	n/a

Rates evolve. Providers update prices, and AiSOC ships a new wet-eval run when they do. We do not retroactively rewrite historic dollar numbers — old runs are recomputed only if you ask for them with the --rate-card override on scripts/run_evals.py. The "Effective from" column is the publish date of the rate, not the date of any specific eval run.

If a provider raises prices mid-week, the next weekly wet-eval run will reflect the new rate. The diff is visible in eval-results/eval/results/<sha>.json -> wet_eval.usd.rate_card_at_run.

If you are a maintainer landing a price update: bump the _RATE_CARD dict in scripts/run_evals.py, set effective_from, and open a PR. The next wet-eval run will pick it up automatically.

6. How to reproduce

Every number on the benchmark page is reproducible from a fresh clone in under 30 seconds (substrate) or 10–30 minutes (wet eval, depending on provider and concurrency).

6.1 Substrate run (per-PR, no LLM key required)

git clone https://github.com/beenuar/AiSOC.git
cd AiSOC
pnpm install
python scripts/run_evals.py --json --out my-eval.json
python scripts/render_eval_charts.py my-eval.json

Or, in one command:

pnpm eval:public

This runs the unified eval runner, writes eval_report.json, and renders a per-suite markdown chart bundle into eval/results/charts/ for inclusion in dashboards or PR comments. No API key is required for the substrate portion. The pnpm eval:public script is documented in package.json.

If a Makefile is added in the future, the same target will be exposed as make eval-public — both names are intended to remain interchangeable.

6.2 Wet eval (weekly, requires LLM credentials)

export AISOC_BENCH_PROVIDER=openai
export OPENAI_API_KEY=sk-...
python scripts/run_evals.py --wet --concurrency 8 --out my-wet-eval.json
python scripts/render_eval_charts.py my-wet-eval.json

Wet eval respects the BYOK flow documented in apps/docs/docs/operations/credentials.md. The --wet flag and per-call telemetry are added by T2.4 in the v8.0 plan. Until that lands, this command falls back to substrate-only mode.

The credentials never touch a configuration file: they are read from the environment for the duration of the run and dropped on exit.

6.3 Comparing your run to the published numbers

Every published wet-eval run pushes a snapshot to the eval-results branch. Diff yours against latest.json:

python scripts/run_evals.py --baseline eval/results/latest.json \
    --max-regression-pp 1.0 --out my-eval.json

That fails the run if mitre_accuracy regresses by more than 1.0 percentage point versus the baseline.

7. How to compare against your own SOC

This eval is designed to be reproducible against any SIEM or SOAR that can ingest a JSON corpus and emit grouped incidents + investigation reports. The protocol is intentionally minimal so a vendor or internal team can run it in an afternoon and post the result publicly.

Use the same dataset. Clone the repo at the commit you want to test against and use services/agents/tests/eval_data/synthetic_incidents.json verbatim. Do not regenerate; the SHA-256 of the file should match the provenance row on the benchmark page.
Use the same metrics. Score per-case mean and per-template macro, not just per-case mean — the dataset has 55 templates cycled 3–4×, and per-case-only is a weaker signal.
Run the same five suites you can reproduce. mitre_accuracy, alert_reduction, investigation_completeness, response_quality, and playbook_completion_rate all live in services/agents/tests/ and require no AiSOC-specific runtime. Replace the substrate code under test with your equivalents.
For wet eval, declare your model and rate card. Latency, tokens, and USD figures are not comparable across providers without these labels — record them in your submission alongside your numbers.
Submit the report. A reproductions-and-comparisons issue template will live at .github/ISSUE_TEMPLATE/benchmark_reproduction.yml (added alongside the public scoreboard in T5.4 of the v8.0 plan). Until the template lands, open a regular issue tagged benchmark:reproduction with the commit SHA, dataset SHA, eval mode, model, and the attached eval_report.json. Comparisons that follow these rules are added to the public scoreboard.

We welcome any reproduction — including ones that show AiSOC losing on a particular template family. If you find a regression, please file an issue or open a PR; the harness exists precisely to surface those.

8. Limitations and threats to validity

Open-source honesty: there are real things this eval cannot tell you.

8.1 Synthetic data is not production data

The 200-incident corpus is hand-shaped to exercise structurally diverse templates, but real SOC alert streams differ in three important ways:

Distribution. Real environments are dominated by 5–10 noisy rules. The eval evenly distributes templates so all of them get exercised.
Adversarial drift. Real attackers mutate. The corpus is static. The adversary_eval suite (heavy / medium / light buckets) partially addresses this but is not a substitute for red-team data.
Context shape. Real incidents carry org-specific context (asset criticality, business hours, owner). The eval normalises this away.

What the eval can tell you: whether AiSOC's substrate gets worse over time on a fixed corpus, what the wet-eval cost looks like in steady state, and where template families regress.

8.2 Substrate self-checks are tautological by design

mitre_accuracy, investigation_completeness, and response_quality are written to gate substrate consistency, not LLM capability. We say this in several places on the benchmark page and again here, because it is the single most-misread part of the eval. If you want to claim "AiSOC's agent is 97 % accurate at MITRE classification" — the substrate score does not support that. The wet-eval LLM-as-judge variant (post-T2.4) is the right metric for that claim.

8.3 Wet-eval variance

LLM outputs are non-deterministic even at temperature 0 because providers batch differently across calls. Latency depends on provider load. Token counts depend on the model's tokeniser version. The published p50/p95/p99 are derived from a single weekly run; we do not report confidence intervals because n=200 is small enough that the headline number moves visibly run-over-run. Use the trend over many weeks rather than any single weekly snapshot when comparing.

8.4 Cost is rate-card cost, not infra cost

The USD per-investigation figure includes the LLM API spend only. It does not include infrastructure (compute, storage, observability), engineering time, or the cost of human review of the agent's recommendations. A real SOC TCO model has to add those.

8.5 The playbook completion gate is coverage, not quality

The 50.5 % overall completion rate on the benchmark page is the share of synthetic incidents whose category, severity, and response action are matched by at least one playbook in playbooks/packs/v1/. It does not execute the playbook, time its steps, or measure its reliability against live telemetry. Step execution is covered by the playbook engine's own unit tests in services/agents/tests/test_playbook_engine.py.

8.6 Connector / SIEM coverage gaps

The synthetic telemetry corpus covers 14 log sources today. A connector that has not been hand-modelled into the corpus (e.g. a niche identity provider) will not appear in any suite. Submitting a new template + backing telemetry is the right path for parity — see Help us harden the harness.

9. Where to go from here

The benchmark page for the actual numbers and the per-suite explanations.
The eval-results branch for every historic snapshot.
The v8.0 north-star plan for the planned expansions: token + USD telemetry (T2.4), public scoreboard (T5.4), weekly wet-eval CI (T5.5), and public-dataset fidelity benchmark (T5.3).
Open-source agent code lives under services/agents/ — every wet-eval run drives the same code.

1. Why this exists​

2. Dataset​

2.1 200 synthetic incidents​

2.2 Synthetic telemetry corpus​

2.3 What's synthetic vs what's real​

3. Substrate vs. wet eval​

3.1 Substrate self-check (per-PR, milliseconds, no LLM)​

3.2 Wet eval (weekly, real LLM, real money)​

4. Suites and what each one actually measures​

5. Rate card​

6. How to reproduce​

6.1 Substrate run (per-PR, no LLM key required)​

6.2 Wet eval (weekly, requires LLM credentials)​

6.3 Comparing your run to the published numbers​

7. How to compare against your own SOC​

8. Limitations and threats to validity​

8.1 Synthetic data is not production data​

8.2 Substrate self-checks are tautological by design​

8.3 Wet-eval variance​

8.4 Cost is rate-card cost, not infra cost​

8.5 The playbook completion gate is coverage, not quality​

8.6 Connector / SIEM coverage gaps​

9. Where to go from here​