Skip to main content

AiSOC Public Eval Harness

North-star performance

p50 sub-minute, p95 sub-2-minute end-to-end on the 200-incident eval. Token + USD-per-investigation budgets are reported alongside latency in the "Performance, tokens, and cost" section below. The full provenance — commit SHA, dataset SHA, and eval mode — is in the provenance footer.

These targets are wet-eval (live LLM agent) numbers, not substrate self-checks. Substrate suites are reported separately in "Latest results" further down. Read What's substrate vs wet? before quoting any of these figures.

What's substrate vs wet?

This page reports two completely different classes of measurement, and we keep them visually separate so they're never confused:

ClassWhat it measuresSuites on this page
Substrate self-checkDetermines whether AiSOC's deterministic substrate (extractors, fusion logic, report and plan templates, judges) is internally consistent. Runs in milliseconds, no LLM, no DB. CI gates every PR on it.mitre_accuracy, investigation_completeness, response_quality, playbook_completion_rate, synthetic-telemetry coverage
Wet eval (live agent)Drives the live services/agents LangGraph orchestrator end-to-end against the same 200-incident corpus, with real LLM calls. Measures latency, token usage, USD cost, and (with an LLM-as-judge variant) live agent accuracy. Runs weekly, not per-PR.latency p50 / p95 / p99, tokens per investigation, USD per investigation

Workspace rule we follow: never present a substrate self-check as live agent performance. Every table below is labelled with its class.

An open, deterministic regression harness over the AiSOC substrate.

This page is not a leaderboard for AI SOC agents. It is a CI-gated harness that exercises the deterministic substrate underneath AiSOC — the keyword extractors, the in-harness fusion grouping (a faithful re-implementation of the production Tier 1/2/3 logic in services/fusion, minus the DB-backed dedup and ML scoring), the report and response templates, and the offline judges that grade them. The dataset, the harness, and the CI gate are all in the repo. You can reproduce every number on this page in under 10 seconds on a laptop.

What's new (v1.4):

  1. Every synthetic incident now ships with a backing synthetic telemetry corpus — Sysmon / Windows Security / M365 audit / Azure sign-in / CloudTrail / Linux auditd / journald / EDR / DNS / web access / Kubernetes audit / GitHub audit / VPN / DB audit events — written to synthetic_telemetry.jsonl. Connector and Sigma PRs now have something concrete to wire against. See Synthetic telemetry corpus below.
  2. Each of the substrate suites now reports a per-template macro alongside the per-case mean. The 200-case dataset draws from 55 distinct templates, so a single broken template moves the per-case headline by ~0.5 % but moves the per-template macro by ~1.8 %. The macro is a stronger regression signal that doesn't dilute when the dataset is enlarged. See Per-case vs. per-template metrics.

MITRE Accuracy Alert Reduction Investigation Completeness Response Quality

Read this first

This harness does not exercise the live LLM agent (services/agents LangGraph orchestrator), and the alert_reduction suite does not call the production services/fusion engine — it calls a standalone re-implementation of the same Tier 1/2/3 grouping rules that lives in the test file. It runs deterministic substrate code against synthetic data so we can gate every PR targeting main / develop in milliseconds. Three of the four metrics on this page measure the internal consistency of that substrate — not agent accuracy. We explain exactly what each suite measures — and doesn't — below.

Why this exists

Vendor claims about AI SOC performance — alert reduction percentages, MITRE coverage, analyst throughput — are typically not reproducible by buyers. The dataset, the baseline, and the rubric are not published. AiSOC takes the opposite approach: ship a small harness, label which metrics are real measurements and which are substrate self-checks, and let anyone reproduce the numbers.

  1. The dataset is in the reposervices/agents/tests/eval_data/synthetic_incidents.json (200 cases, deterministic, drawn from 55 distinct templates) plus its companion synthetic_telemetry.jsonl (361 backing events across 14 log sources). Both are regenerable from scripts/generate_eval_incidents.py. Three of the four offline suites use the incident dataset; the alert-reduction suite uses a separately generated 1 000-alert stream produced by generate_noisy_alert_stream in the test file.
  2. The harness is in the repo — five pytest suites under services/agents/tests/ (four scoring suites + a synthetic-telemetry schema/coverage gate).
  3. The CI gate runs on every PR targeting main / developlatest run. (CI is currently scoped to those two branches; PRs to long-lived feature branches are not gated.)
  4. Historical numbers are queryable — every successful build pushes its report (written by scripts/run_evals.py --out) to the eval-results branch as eval/results/<commit_sha>.json.

Latest results

The numbers below are produced by scripts/run_evals.py. The MITRE, completeness, and response-quality suites run against the 200-incident synthetic dataset; the alert-reduction suite runs against a separately generated 1 000-alert noisy stream; the playbook-completion suite runs against the same 200-incident corpus and the v1 playbook pack. The whole run takes roughly 35 ms total (no LLM calls, no DB) so it's cheap enough to gate every PR targeting main or develop.

SuiteMetricPer-casePer-template macroTargetWhat it checks
Alert reduction ratioreduction75.3 %n/a≥ 70 %Real measurement of the 3-tier fusion logic on a noisy 1 000-alert stream
MITRE ATT&CK tactic accuracyaccuracy97.0 %96.4 % (n=55)≥ 80 %Substrate self-consistency — keyword extractor vs. dataset written for it
Investigation completenessmean keyword coverage94.2 %94.3 % (n=55)≥ 85 %Substrate self-consistency — report template wraps the description; judge finds keywords from the description
Response-plan qualitymean rubric score1.0001.000 (n=55)≥ 0.80Substrate self-consistency — synthesizer embeds the keywords the rubric checks for
Playbook completion ratecompletion rate50.5 %100 % H/C (mapped)≥ 50 %Operational coverage gate — every incident in scope has a matching playbook with aligned response action; orphan playbooks/templates fail CI

The synthetic telemetry suite is a schema/coverage gate, not a scoring suite, so it does not appear in the table. It checks that every incident has ≥ 1 backing event, that all {user}/{host}/{ip}/{campaign} placeholders resolve, that every event carries the fields a real connector pivots on, and that the source distribution is not concentrated on a single template. It currently passes against 361 events spanning 14 distinct log sources wired to all 200 incidents.

The playbook completion rate is also an operational coverage gate — it grades the v1 playbook pack itself, not the substrate's per-incident output. The headline 50.5 % overall figure is the share of synthetic incidents whose category, severity, and response action are matched by at least one playbook in playbooks/packs/v1/. The 100 % high+critical figure is over the mapped subset (incidents whose template the pack claims to cover) — the raw H/C figure is reported in JSON but not gated, because the dataset includes documented v1 coverage gaps. See section 7.

These numbers move with the codebase. The current snapshot lives at eval-results/eval/results/latest.json.

Weekly history: the row above is the latest snapshot only. The full append-only weekly history (date, agent version, MITRE accuracy, MTC p50/p95, total USD, total tokens — substrate and wet-eval rows visually separated) lives on the public scoreboard. The T5.5 weekly job appends one row to that scoreboard every Sunday once wet-eval CI lands.

Performance, tokens, and cost

cells with fabricated numbers. The cells stay as placeholders until either T2.4's deterministic budget or T5.5's wet-eval run produces them. -->

This section presents two parallel views of the same workload:

  • A deterministic-substrate budget projection computed by T2.4. This runs in milliseconds without any LLM call and is gated on every PR.
  • A wet-eval measurement populated by T5.5's weekly job. This is the honest "live agent on real LLM" view; at substrate-only commits the cells are placeholders, never imputed.

We keep them visually separate so substrate budgets are never read as agent performance.

Deterministic-substrate budget projection (T2.4)

Substrate budget — projection, not wet eval

The four tables and four charts immediately below are computed from eval_report.json -> per_investigation. They are a projection of what the live agent will burn on the same workload, derived from a 4-chars-per-token estimator and an illustrative 2025-era public rate card. They are not real LLM calls, real wall-clock latency, or real billing. Quote them as a CI-gated upper bound; quote the wet-eval block below for live performance.

Reproduce locally::

python3 scripts/run_evals.py --telemetry-only --json --out report.json python3 scripts/render_eval_charts.py report.json --no-markdown

Headline numbers, current substrate snapshot (gpt-4o rate card, illustrative — input $2.50/M, output $10.00/M; n = 200 incidents, 55 templates):

Metric (substrate budget)meanmedianp95p99
Total tokens / investigation2,1862,1142,4522,478
Prompt tokens / investigation9569631,0311,050
Completion tokens / investigation1,2301,1201,4401,440
USD / investigation$0.01469$0.01368$0.01693$0.01699
Latency / investigation (ms, substrate path)0.00720.00730.01070.0137

The latency numbers above are substrate path only — incident description tokenize + telemetry-event JSON serialize. They are microseconds on a laptop, not seconds. The wet-eval table below is the right place to read live wall-clock latency. We surface the substrate distribution because the shape across templates is informative — a template with ten times the telemetry events takes ten times longer to walk, even at substrate speed, and that ratio is preserved when wet eval runs.

Latency distribution (substrate path)

p50 / p95 / p99 latency, deterministic substrate

Substrate-path wall-clock per investigation. Bars: p50 / p95 / p99 / mean. Wet-eval (T5.5) replaces the absolute numbers with real-LLM wall-clock; the relative shape across templates carries over.

Tokens per investigation distribution

Tokens per investigation, deterministic substrate

Histogram across all 200 incidents. The tail on the right is dominated by critical-severity templates (ransomware-encryption, process-hollowing-svchost, …) whose larger response plans push completion_tokens up by ≈ 1.8×. Median and p95 are marked.

USD per investigation distribution

USD per investigation, deterministic substrate

Same incidents, multiplied through the illustrative gpt-4o rate card ($2.50/M input, $10.00/M output). The p95 budget sits just under $0.017 / investigation. Substitute your own rate card by passing --telemetry-model to scripts/run_evals.py; the JSON report carries the full token matrix so the same eval can be re-priced against any model.

Latency p95 by template (top 20)

Latency p95 by template, deterministic substrate

Slowest 20 templates by substrate-path p95 latency, with p50 and p95 shown as overlaid bars. Useful for spotting templates whose telemetry shape balloons the prompt budget; under wet eval the same templates will dominate end-to-end latency too.

Wet-eval, not substrate

The three tables in the next subsection are a wet eval measurement — they require the live services/agents LangGraph orchestrator and real LLM calls, which is populated by the weekly wet-eval CI job (.github/workflows/wet-eval.yml, landed by T5.5). T2.4's deterministic-substrate budget projection lives directly above and in the JSON report, but is not substituted into the wet-eval tables below because substrate timings would not be honest representations of agent performance.

The workflow runs every Monday at 07:00 UTC, dispatches the 200-incident corpus through the live agent, captures real token / USD / latency metrics from the LLM provider's response metadata, then opens a PR titled chore(bench): weekly wet-eval YYYY-MM-DD with the refreshed numbers. Forks without billing configured see a clean no-op via the preflight check. The two CI secrets that drive it (WET_EVAL_OPENAI_KEY and AISOC_BENCH_BOT_TOKEN) are documented on the Secrets and CI tokens page; setup is a one-time configuration in the GitHub repo settings.

Wet eval — Table 1 — Latency per investigation

End-to-end wall-clock time from "incident received" to "investigation complete + response plan synthesised", per template and aggregate. Run against the 200-incident synthetic_incidents.json corpus with the live agent driving real LLM calls. Lower is better.

Substrate budget for the same metric is in the substrate-budget section above (microsecond range — substrate path only, not wet wall-clock).

Template familyp50 (s)p95 (s)p99 (s)n
Aggregate (all 200)200
Endpoint compromise
Identity / OAuth phish
Cloud (AWS / Azure / GCP)
Network / WAF / DNS
Application / SaaS

Target gates — aggregate p50 ≤ 60 s, aggregate p95 ≤ 120 s. A weekly CI job (T5.5 — wet-eval-weekly.yml) regrades the corpus and fails if either gate regresses by more than 10 % week-over-week.

Wet eval — Table 2 — Tokens per investigation

Total prompt + completion tokens consumed by the agent for one investigation, across every LLM call in the LangGraph topology. Includes context-bundle tokens, tool-call tokens, and final-plan synthesis tokens.

Substrate budget for the same metric: aggregate p95 ≈ 2,452 tokens at the time of writing. Read the substrate-budget section as a CI-gated upper bound that runs every PR.

Template familymeanmedianp95n
Aggregate (all 200)200
Endpoint compromise
Identity / OAuth phish
Cloud (AWS / Azure / GCP)
Network / WAF / DNS
Application / SaaS

Tokens are reported as totals (prompt + completion) so the table is model- independent. Per-call splits live in the JSON report under per_investigation.tokens_per_investigation (T2.4 deterministic budget) or wet_eval.tokens.by_call (T5.5 wet eval). The current rate card the dollar figures below use is in Rate card.

Wet eval — Table 3 — USD per investigation

Same denominator as Table 2, multiplied through the rate card current at the time of the run. Recorded in the JSON report so historic rate-card changes don't silently revalue old runs.

Substrate budget for the same metric: aggregate p95 ≈ $0.01693 at the illustrative gpt-4o rate card. Read the substrate-budget section for the per-template breakdown and chart.

Template familymean ($)median ($)p95 ($)n
Aggregate (all 200)200
Endpoint compromise
Identity / OAuth phish
Cloud (AWS / Azure / GCP)
Network / WAF / DNS
Application / SaaS

The full breakdown by model (gpt-4o, gpt-4o-mini, local Ollama, …) and per-call class (router, planner, judge) lives under per_investigation.rate_card_per_m_tokens_usd (T2.4) or wet_eval.usd.by_model (T5.5) in the JSON report. Rate-card sources and effective dates live on the methodology page.

Per-case vs. per-template metrics

The 200-case dataset is built by drawing each case from one of 55 distinct templates (Sysmon process hollowing, M365 OAuth-consent phish, CloudTrail EC2 IMDS credential theft, Azure AD impossible travel, Linux SUID abuse, …) and swapping the {user}/{host}/{ip}/{campaign} slot in each one. That gives the substrate a wider blast radius to regress against without inflating the generator. Two metrics are reported for every scoring suite:

  • Per-case mean — the headline number, weighted across all 200 incidents. Closest to "how often does the substrate get an answer right".
  • Per-template macro — the unweighted mean across the 55 distinct templates. A single broken template (≈ 4 cases) moves the per-case mean by only ~0.5 % but moves the per-template macro by ~1.8 %. This is the dilution-resistant signal that catches template-class regressions.

Both gates have to pass for CI to be green. The harness output prints the per-template macro under each suite headline, plus the IDs of any individual templates that fall below the per-template floor (those are surfaced as information, not as failures, as long as the macro stays above the gate). This addresses a fair concern raised on the launch thread that 200 cases cycled from 55 templates can hide regressions behind the duplicates: the macro is exactly the metric that surfaces them.

Reproduce these numbers

You don't have to take our word for it. From a fresh clone:

git clone https://github.com/beenuar/AiSOC && cd AiSOC
python3 scripts/run_evals.py

That's it. No Docker, no API key, no GPU, no LLM. Expected output:

==============================================================================
AiSOC Pillar-1 Eval - 200-incident synthetic benchmark
==============================================================================
[PASS] mitre_accuracy accuracy 0.970 (target >= 0.80)
per-template macro 0.964 (target >= 0.80, n=55 templates) [PASS]
[PASS] alert_reduction reduction_ratio 0.753 (target >= 0.70)
[PASS] investigation_completeness mean_keyword_coverage 0.942 (target >= 0.85)
per-template macro 0.943 (target >= 0.80, n=55 templates) [PASS]
[PASS] response_quality mean_rubric_score 1.000 (target >= 0.80)
per-template macro 1.000 (target >= 0.75, n=55 templates) [PASS]
[PASS] playbook_completion_rate completion_rate 0.505 (target >= 0.50)
high+critical (mapped) 1.000 (target >= 0.95) [PASS]
action alignment 0.939 (target >= 0.85) [PASS]
orphan playbooks / templates 0 / 0 [PASS]
------------------------------------------------------------------------------
Synthetic telemetry: 361 events across 14 sources,
200 incidents wired up
(services/agents/tests/eval_data/synthetic_telemetry.jsonl)
==============================================================================
ALL GATES PASSED

To regenerate the dataset and its backing telemetry from scratch (e.g. after adding a template):

python3 scripts/generate_eval_incidents.py

That writes services/agents/tests/eval_data/synthetic_incidents.json and the companion synthetic_telemetry.jsonl deterministically (seeded RNG).

For machine-readable output (CI/dashboards):

python3 scripts/run_evals.py --json
# or, fail non-zero on regression:
python3 scripts/run_evals.py --ci --out report.json

What each suite actually measures

1. Alert reduction ratio — Real measurement

Source: services/agents/tests/test_alert_reduction.py

A 1 000-alert noisy stream — pure duplicates, near-duplicates within a 30-minute host window, multi-host rule storms, and benign low-score chatter — is fed into the in-harness fuse_alerts function. That function is a deterministic, in-memory re-implementation of the same Tier 1/2/3 grouping rules used by the production services/fusion engine — minus the DB-backed deduplicator and the ML scorer. The grouping logic itself is the same:

  • Tier 1 — same (rule, host, user) within 10 minutes → 1 incident
  • Tier 2 — same (rule, host) within 30 minutes → merge into a Tier-1 incident
  • Tier 3 — same rule within 5 minutes across ≥ 3 hosts → "storm" incident

Incidents below the noise threshold (score < 0.35) are dropped. The output is whatever the code produces — a fusion-rule regression will move the number. This is a legitimate measurement of grouping behavior on a controlled dataset, but it is not end-to-end coverage of the production fusion service.

The reported ~75 % is the actual output of the in-harness grouping function on this fixed dataset. It is not tuned to match a marketing number.

2. MITRE ATT&CK tactic accuracy — Substrate self-consistency

Source: services/agents/tests/test_mitre_accuracy.py

Each synthetic incident is generated with a labeled MITRE tactic and a description that is, by design, written to include keywords the hand-curated extractor in the test recognizes. A case is "correct" if the predicted tactic set has at least one overlap with the curated expected-tactic set.

The 97 % therefore mostly checks that dataset and extractor agree with each other. It is useful as:

  • A regression sentinel — if someone breaks the extractor or rewrites the dataset without updating the other, this suite catches it.
  • A schema sanity check — every incident carries at least one tactic the extractor can reach.

It is not:

  • A measure of LLM agent accuracy on real telemetry.
  • A score that should be compared to vendor MITRE benchmarks.

Treat it as a regression sentinel for the substrate, not a leaderboard score.

3. Investigation completeness — Substrate self-consistency

Source: services/agents/tests/test_investigation_completeness.py

Each synthetic incident ships with a list of evidence_keywords. A deterministic report simulator wraps the incident's description field into a Markdown report; the judge then looks for those evidence keywords in the report.

Because the description is what produces the evidence keywords in the first place, and the simulator pastes the description back into the report verbatim, the score is close to a string-copy tautology. It confirms:

  • The report template still includes the description.
  • The keyword judge can still tokenize and match.

It does not confirm an LLM agent wrote a complete investigation. The real value of this suite is catching template breakage — not LLM quality.

4. Response-plan quality — Substrate self-consistency

Source: services/agents/tests/test_response_quality.py

A deterministic response-plan synthesizer produces a containment plan for each incident. By construction, the synthesizer embeds:

  • The expected MITRE techniques into the plan summary.
  • The first evidence_keyword into the plan steps.

An offline judge then scores each plan against a 5-criterion rubric:

  1. Action aligned — the plan's action class matches the curated response_class
  2. Severity aware — plan tone scales with severity
  3. MITRE aligned — plan references at least one expected tactic
  4. Evidence grounded — plan references at least one expected evidence keyword
  5. Actionable — plan contains concrete imperative verbs and step-by-step structure

Because the synthesizer embeds exactly what the rubric checks for, criteria 3 and 4 are essentially guaranteed; 1, 2, and 5 are also driven by the templated generator. The score is ~1.000 by construction.

This catches a broken templating pipeline (e.g. someone removes the MITRE references from the synthesizer, or the rubric stops matching) — it is not a grade of LLM-written response plans.

5. Synthetic telemetry corpus — Schema and coverage gate

Source: services/agents/tests/test_synthetic_telemetry.py · Output: synthetic_telemetry.jsonl

Every synthetic incident now ships with at least one backing telemetry event written to a companion JSONL file. This addresses a real ask from the community: connector and Sigma rule PRs need concrete events to wire against without having to make up their own. The corpus currently covers 14 sources:

SourceWhat it representsCommon pivot fields
sysmonWindows process / network / image-load events (EID 1, 3, 7, 11)Computer, User, Image, CommandLine
windows_securityWindows Security log (logon, privilege use, account changes)Computer, TargetUserName, EventID
m365_auditMicrosoft 365 unified audit logUserId, Operation, Workload, ClientIP
azure_signinAzure AD / Entra sign-in loguserPrincipalName, appDisplayName, ipAddress
cloudtrailAWS CloudTrail management eventseventName, userIdentity, awsRegion
linux_auditdLinux auditd records (execve, syscall)type, syscall, auid, exe
linux_journaldLinux journald / syslog_SYSTEMD_UNIT, MESSAGE, _HOSTNAME
edrGeneric EDR detection (CrowdStrike / SentinelOne shape)rule, severity, device.hostname
dnsDNS resolver / sinkholequery_name, query_type, client_ip
web_accessWeb access / WAF / proxy loghttp_method, url, status_code
k8s_auditKubernetes audit log (audit.k8s.io/v1)verb, objectRef.resource, user.username
github_auditGitHub audit log (org / repo / app events)action, actor, repo
vpnVPN concentrator (auth and tunnel events)action, user, client_ip
db_auditDatabase audit trail (Postgres / Oracle / SQL Server shapes)user, operation, query

Each event has its {user}/{host}/{ip}/{campaign} placeholders resolved against the parent incident, so an event for INC-EVAL-044 carries the same user and host as the incident itself.

The schema/coverage gate (test_synthetic_telemetry.py) checks five things on every CI run:

  1. No unresolved placeholders — every event survives a recursive walk without finding a stray {...} slot.
  2. Per-source required fields — for each declared source, the fields a real connector pivots on are present and non-empty.
  3. Coverage — every incident in synthetic_incidents.json has ≥ 1 backing event in the JSONL corpus.
  4. Source diversity — at least 12 distinct sources appear across the corpus (we currently ship 14).
  5. No single-template concentration — no one template accounts for more than 5 % of the events, which keeps the corpus useful for connector regressions instead of being dominated by one scenario.

This is not a scoring suite. It does not gate detection accuracy or agent quality — it gates "did the synthetic substrate produce something an external connector or Sigma rule can be run against." A failing synthetic-telemetry test means a template stopped emitting events of the shape it promised, not that the agent got worse.

If you are landing a new connector, point it at this file:

head -n 5 services/agents/tests/eval_data/synthetic_telemetry.jsonl

Each line is a self-contained event with incident_id, template_id, source, and the event payload. Filter by source to focus your tests.

6. AI-vs-AI adversary eval — Graceful-degradation gate

Source: services/agents/tests/test_adversary_eval.py · Dataset: eval_data/adversary_incidents.json · Generator: scripts/generate_adversary_incidents.py

A deterministic attacker-LLM mutator rewrites every defender keyword in the 200-incident dataset into evasive synonyms, character obfuscation, and fragmentation. Three intensity buckets control how aggressively the text is mutated:

BucketShareMutation
heavy~45 %Every keyword swapped to synonym / obfuscated
medium~35 %One expected tactic preserved cleanly
light~20 %Leetspeak only (control bucket)

Three regression floors are enforced:

GateFloor / ceilingRationale
Overall catch rate≥ 0.40Under heavy mutation the extractor is expected to drop ~50 pp from its unmutated 0.95 baseline. The floor keeps graceful degradation honest.
Light-bucket catch rate≥ 0.85Light-tier obfuscation is just leetspeak; if the defender fails this bucket, a heavier failure is hiding a deeper regression.
Heavy-bucket catch rate≤ 0.50If the heavy bucket starts catching too much, the dataset is no longer adversarial — the mutation grammar has drifted.

This suite answers a specific question: does the substrate fall off a cliff when an attacker deliberately evades the keyword catalogue? It does not attempt to prove the defender is great under adversarial pressure — it proves the defender doesn't silently collapse to zero. The per-bucket accuracy curve is the metric to watch over time.

To regenerate the adversary dataset:

python3 scripts/generate_adversary_incidents.py

7. Playbook completion rate — Operational coverage gate

Source: services/agents/tests/test_playbook_completion_rate.py · Dataset: eval_data/synthetic_incidents.json · Pack under test: playbooks/packs/v1/

This suite is not a quality measurement of playbook execution — it is a coverage gate over the v1 playbook pack itself. For every one of the 200 synthetic incidents, the harness asks: does at least one playbook in the pack match this incident's category, severity, and expected response action? That answer is then rolled up into a small set of CI-enforced sub-gates so a PR that touches playbooks/packs/ cannot silently shrink coverage.

The metric reports five things; the suite passes only when all of them pass:

Sub-gateFloor / valueWhat it catches
Overall completion rate≥ 0.50Wholesale regressions (e.g. a deletion or category rename that drops dozens of incidents into the "no playbook" bucket). The floor is intentionally honest about v1's coverage, not aspirational.
High+critical completion rate (mapped)≥ 0.95Among incidents whose templates the pack claims to cover, every severe one must have a containment playbook. We measure this over the mapped subset to avoid forcing a pass by inflating coverage with mismatched playbooks. The raw rate (over all H/C templates) is reported for transparency but not gated.
Action alignment rate≥ 0.85Among matched incidents, the playbook's first-line steps must align with the dataset's response_class (e.g. block / quarantine / disable_user / reset_credentials). Stops "any matching playbook is good enough" drift.
Orphan playbooksexactly 0 (with allowlist)Playbooks that match zero incidents in the corpus. An explicit allowlist (_PLAYBOOKS_NOT_IN_BENCHMARK_DATASET) documents playbooks that are deliberately off-corpus (e.g. removable-media exfil, volumetric DDoS) so adding new orphans is loud.
Orphan templatesexactly 0 (with allowlist)Mapped templates with zero playbook hits. Templates whose category isn't in v1 scope are explicitly listed in _TEMPLATES_WITHOUT_PACK_COVERAGE; the gate only triggers when a previously-mapped template loses coverage.

The mapped-vs-raw distinction matters: the synthetic dataset includes ~22 endpoint-compromise / persistence / defense-evasion templates that are documented v1 coverage gaps. Gating on raw high+critical coverage would either force dishonest coverage (mapping a credential-stuffing playbook to ransomware just to clear the floor) or punish CI for known scope decisions. Gating on mapped coverage instead lets v1's scope evolve cleanly: as new playbooks land and templates move from _TEMPLATES_WITHOUT_PACK_COVERAGE into _TEMPLATE_CATEGORIES, the mapped denominator grows with them.

What this suite does not do: it does not execute playbooks, time their steps, or measure their reliability against live telemetry. Step execution is covered by the playbook engine's own unit tests; this gate is the inventory check that runs before every push.

The full per-suite payload is emitted as JSON via scripts/run_evals.py --json --out report.json under suites.playbook_completion_rate.details — including per-severity breakdown, per-category coverage, and the orphan lists — so the same diff that adds a playbook can be reviewed alongside its coverage delta.

Community benchmark scoreboard

The dataset and the harness are MIT-licensed and fully reproducible. Any third party — another open-source project, a vendor, or an internal team — can run the same suite against the same 200 incidents and submit a result:

python3 scripts/run_evals.py --json --out report.json

Submissions go through a structured GitHub issue template (.github/ISSUE_TEMPLATE/benchmark_submission.yml). Accepted entries are rendered on the benchmark scoreboard in the web console. Submission rules:

  1. Same fixed dataset — run against the deterministic 200-incident dataset on the commit you submit. No private fixtures.
  2. Same harness — run scripts/run_evals.py --json --out report.json with no flags that disable gates. Attach the full report.json so per-template macros are auditable.
  3. Open agent or label as closed — if your agent code is open, link it. If it is closed, the entry is accepted but labeled "closed-source".
  4. No template-stuffing — the three substrate self-consistency suites are gameable by stuffing keywords into reports. Submissions caught doing this are rejected; the alert-reduction measurement is not gameable in the same way.

Comparison to other AI SOC offerings

CapabilityAiSOCWazuhSplunkClosed-source AI SOC
Open-source (MIT)yesyesnono
Self-hostableyesyesyesno
Agent decisions step-by-step auditableyesn/an/ano
Public, reproducible regression harnessyesnonono
Eval dataset shipped in the repoyesnonono
Substrate-level regression gate in CIyesnonono
Plugin SDK (Python + Go)yesyesyespartial
Freeyesyesnono

A self-hostable, MIT-licensed agent with a published regression harness is something an auditor or regulated buyer can review directly. Vendor cloud agents typically cannot be reviewed at the same level.

What this is not

A few caveats:

  • No LLM agent runs in this harness. It exercises deterministic extractors and templated report/plan synthesis. The live services/agents/ LangGraph orchestrator that talks to OpenAI or Anthropic is not under test here. A separate online eval (LLM-as-judge, real orchestrator) is on the roadmap and will run nightly. That is where actual agent accuracy gets measured.
  • The dataset is synthetic. 200 incidents drawn from 55 templates is enough to flag major regressions and to give connector PRs concrete events to wire against, but it is not enough to claim production parity. Federated, opt-in real-customer evaluation is on the roadmap.
  • The synthetic telemetry corpus is hand-shaped, not captured from a live tenant. It models the structure that real connectors pivot on (process tree, principal, source IP, log source) but is not a substitute for capturing real M365 / CloudTrail / Sysmon events from a production environment. Treat it as a contract for connector development, not as a red-team dataset.
  • Three of the four scoring judges are tautological by design. The dataset, the templates, and the judge were written together to keep the gate fast and deterministic. They will pass as long as the substrate is internally consistent. They will fail if it is not. The per-template macro adds a non-tautological dimension on top: a single broken template stops being hidden behind 199 working duplicates.
  • The playbook completion gate is a coverage check, not a quality check. It verifies that every in-scope incident has a playbook with the right category and response action — it does not execute the playbook, time its steps, or measure its reliability against live telemetry. Step execution is covered by the playbook engine's own unit tests in services/agents/tests/test_playbook_engine.py.
  • "Public eval harness" means this harness, not a third-party leaderboard. These numbers are reproducible by anyone with python3. They are not comparable to MITRE Engenuity, MLPerf, or any other external evaluator.

Historical results

Every CI run on main writes a snapshot into the eval-results branch:

eval/results/<commit_sha>.json # one snapshot per commit
eval/results/latest.json # always points to most recent passing build
eval/results/badge-*.json # shields.io endpoints

You can git clone -b eval-results to graph the trend yourself, or open the Actions tab for per-run job summaries.

Help us harden the harness

Pull requests welcome. The fastest ways to make this harness honestly stronger:

  • Land the online LLM-as-judge variant. Wire OPENAI_API_KEY / ANTHROPIC_API_KEY through the harness so the report and response judges run against actual LLM output instead of the templated synthesizer. That is what turns this page into a real agent benchmark.
  • Add a connector and a Sigma rule against the synthetic telemetry corpus. Pick a source from synthetic_telemetry.jsonl (e.g. m365_audit or cloudtrail), wire a connector that ingests events of that shape into the fusion service, and land a Sigma rule that fires on the events backing the matching INC-EVAL-* cases. The corpus is exactly the contract you can develop against without provisioning a real tenant.
  • Add a new template with backing telemetry. Drop a new entry into _TEMPLATES in scripts/generate_eval_incidents.py with a unique template_id and a tuple of telemetry events. Re-run the generator and the per-template gate will keep us honest about whether the substrate handles the new class.
  • Find a template the keyword extractor misses. Watch the per-template MITRE macro under each suite — if it dips, the failing-templates list is printed inline. Fixtures for those cases land as a single PR against the extractor.
  • Find a fusion miss. Add a contrived alert pattern that should de-dupe but doesn't. The reduction-ratio gate will block the regression.
  • Tighten the report and plan rubrics. The completeness and quality suites are intentionally permissive in v1. PRs that add stricter evidence-grounding or that decouple the synthesizer from the judge keywords are highly welcome.

See CONTRIBUTING.md for the full path.

Provenance

Every published number on this page comes from a single deterministic pipeline. The provenance footer below is regenerated by the weekly wet-eval CI job (.github/workflows/wet-eval-weekly.yml, landed by T5.5) and the per-PR substrate run (.github/workflows/ci.yml). The fields are populated from eval_report.json so anyone can reproduce them.

FieldValue (substrate run)Source
Commit SHAgit rev-parse HEAD at eval time
Run date (UTC)eval_report.json -> generated_at
Dataset SHA-256sha256(synthetic_incidents.json + synthetic_telemetry.jsonl)
Eval modesubstrate (per-PR)eval_report.json -> mode
Harness versionscripts/run_evals.py @ <commit>repo path
Rate card datewet-eval onlyRate card

The wet-eval row is populated by the weekly job once T2.4's telemetry lands. Until then, the cells are placeholders rather than imputed values — see the methodology page for why we do not backfill.

Reproduce these numbers locally:

git clone https://github.com/beenuar/AiSOC.git
cd AiSOC
pnpm eval:public # runs run_evals.py + render_eval_charts.py

Full instructions, dataset description, rate card, and limitations live on the methodology page.