Skip to main content

Alert Reduction

Synthetic, deterministic, reproducible in under a second. This page documents the AiSOC alert-reduction benchmark — what it measures, how the 3-tier fusion logic collapses a noisy alert stream into actionable incidents, and how to reproduce the numbers locally.

The headline number on the Public Eval Harness page (75.3 %) is the output of this benchmark. This page is the long-form companion: the dataset shape, the per-tier collapse breakdown, the critical-alert preservation guarantee, and the road to per-rule false-positive rates (tracked under issue #5).

Alert Reduction

Read this first

This is a synthetic benchmark. The 1,000-alert stream is generated by generate_noisy_alert_stream in services/agents/tests/test_alert_reduction.py with a fixed RNG seed; the fusion code under test is a faithful in-harness re-implementation of the production Tier 1 / Tier 2 / Tier 3 grouping rules in services/fusion, minus the DB-backed deduplicator and the ML scorer. It runs deterministically on every PR targeting main / develop and gates fusion-logic regressions in milliseconds. It is not an end-to-end measurement of the production fusion service against real customer telemetry.

TL;DR

MetricValueTarget
Reduction ratio75.3 % (1,000 alerts → 247 incidents)≥ 70 %
Critical alerts preserved93 / 93 (zero drop)100 %
Storm incidents collapsed (Tier 3)16 (avg ~8.9 alerts each, 3–8 distinct hosts)≥ 1
DeterminismSame RNG seed → same byte-for-byte outputRequired
Wall time~8 ms on a laptop< 100 ms

The current snapshot lives at eval-results/eval/results/latest.json under suites.alert_reduction. Every successful build on main / develop appends a new commit-keyed entry alongside it (see CI artefact location).

What this measures

A SOC's first-order pain isn't "did we catch the threat" — it's "did we drown in the catch." Modern detection stacks routinely emit thousands of alerts per hour, and the same underlying event (a single misconfigured agent, one phish campaign, one credential-stuffing burst) can light up dozens of rules at once. The job of fusion is to take that raw rule output and emit a small, deduplicated, human-reviewable list of incidents.

This benchmark answers exactly one question:

Given a controlled, realistic distribution of noise, what fraction of incoming alerts collapse into incidents — and does the substrate drop any of the alerts that an analyst genuinely needed to see?

It is a real measurement, not a self-consistency gate. The fusion grouping logic is the same Tier 1/2/3 algorithm shipped in services/fusion — a regression in the grouping rules will move this number. The other three substrate suites on the public eval page are tautological by design (dataset and judge written together to gate template breakage); this one is not.

The synthetic dataset

The 1,000-alert stream is generated by generate_noisy_alert_stream with a fixed seed so every run is byte-for-byte identical. The mix is deliberately shaped to look like the worst hour of a real SIEM:

BucketShareWhat it represents
Pure duplicates~25 %Same (rule, host, user) within the same minute — the agent re-firing on the same telemetry
Near-duplicates~30 %Same (rule, host), different user or ±5-minute window — same root event, slight metadata drift
Storms~15 %One rule firing across many hosts within a 5-minute window — campaign / worm / misconfigured push
Benign noise~10 %Low-score rules (score < 0.35) — the chatter every SOC silences but the SIEM still emits
Unique high-signal events~20 %Distinct (rule, host, user) events with no friends — the things you actually want to look at

The full distribution from the latest run:

Severity inAlertsIncidents out (post-fusion)
critical9338
high417130
medium30379
low1870 (collapsed below the noise floor)
Total1,000247

The low-severity bucket disappears entirely: those alerts carry score < 0.35 by construction and get dropped by the noise threshold. This is intentional — the benign-noise bucket exists to verify the noise-floor cut still fires, not to be shown to a human.

The 3-tier fusion logic

The grouping algorithm has three rules, applied in order. The exact code lives in fuse_alerts inside the test file (so the benchmark stays self-contained) and mirrors the production logic in services/fusion.

Tier 1 — strict identity collapse

Same (rule_id, host, user) within a 10-minute window → 1 incident. This is the "the agent re-fired on the same telemetry" case. Most pure duplicates and same-source loops collapse here.

Tier 2 — host-scoped merge

Same (rule_id, host) within a 30-minute window, but spanning multiple users (or the same user with ±5-minute drift) → merged into one Tier-1 incident. This catches "same root event, slightly different metadata" duplicates, including a single attacker pivoting between local accounts on the same host.

Tier 3 — rule-storm collapse

Same rule_id firing within a 5-minute window across ≥ 3 distinct hosts → one "storm" incident with host = <storm:N-hosts>. This is the "campaign / worm / misconfigured deployment push" case — the kind of event that historically generates a hundred near-identical tickets.

Noise-floor drop

Any incident with aggregate score < 0.35 is dropped before being emitted. This is the configurable knob a tenant uses to silence chronically chatty rules without de-listing them; in the benchmark it means the 187 low alerts vanish.

Latest results

The numbers below are produced by python3 scripts/run_evals.py --json on the current main and rendered into eval/results/latest.json on the eval-results branch by the CI gate.

Headline

FieldValue
metricreduction_ratio
value0.753
target0.70
passedtrue
duration_ms7.9

Collapse breakdown

BucketCountAvg alerts collapsedNotes
Tier 1/2 multi-member incidents (non-storm)56~8.8 alerts eachSame root event, deduplicated
Tier 3 storm incidents16~8.9 alerts each3 – 8 distinct hosts per storm
Single-member incidents1751 eachGenuine unique signal — exactly what an analyst needs to triage
Total emitted incidents247from 1,000 raw alerts

The 175 single-member incidents are the floor: those are the alerts that should survive fusion, because they represent distinct events. Anything beyond that floor is a deduplication win.

Critical-alert preservation

Severity tierAlerts inPreserved in incidentsDropped
critical9393 (100 %)0

This is the non-negotiable invariant: fusion never silently drops a critical alert. The substrate may merge a critical alert into a multi-member incident (it does — 93 critical alerts are represented across the 38 critical incidents emitted), but it will never delete one. The test_critical_alerts_never_dropped test asserts this directly and is part of the same CI gate.

Determinism

Two consecutive runs on the same commit produce byte-for-byte identical output. The benchmark uses a fixed RNG seed inside generate_noisy_alert_stream, and the fusion algorithm is order-stable on its inputs. This matters for two reasons:

  1. CI flakiness has a single root cause. If two CI runs disagree on the reduction ratio, it's a code change, not a coincidence.
  2. Diff-friendly regression review. When a PR moves the number, you can inspect which incidents changed shape (single-member ↔ multi-member ↔ storm) instead of being told "the noise moved."

The test_run_is_deterministic test enforces this — it runs fuse_alerts twice and compares the incident lists element-for-element.

Reproduce locally

From a fresh clone:

git clone https://github.com/beenuar/AiSOC && cd AiSOC
python3 scripts/run_evals.py

Expected output (excerpt):

[PASS] alert_reduction reduction_ratio 0.753 (target >= 0.70)

For machine-readable output:

python3 scripts/run_evals.py --json --out /tmp/eval.json
python3 -c "import json; print(json.dumps(json.load(open('/tmp/eval.json'))['suites']['alert_reduction'], indent=2))"

To run only the alert-reduction tests under pytest:

cd services/agents
python3 -m pytest tests/test_alert_reduction.py -v

That covers four assertions:

  • test_reduction_ratio_meets_target — the headline ≥ 70 % gate
  • test_critical_alerts_never_dropped — the zero-loss invariant
  • test_run_is_deterministic — same seed → same output
  • test_storms_collapse — Tier 3 fires when ≥ 3 hosts share a rule in a 5-minute window

CI artefact location

Every push to main or develop triggers .github/workflows/ci.yml, which runs scripts/run_evals.py --ci --out report.json. On success, the report is committed to the orphan eval-results branch:

eval/results/<commit_sha>.json # one snapshot per commit
eval/results/latest.json # most recent passing build on main
eval/results/badge-reduction.json # shields.io endpoint backing the badge above

Reading the latest snapshot from CI:

curl -s https://raw.githubusercontent.com/beenuar/AiSOC/eval-results/eval/results/latest.json \
| python3 -c "import json,sys; d=json.load(sys.stdin); print(json.dumps(d['suites']['alert_reduction'], indent=2))"

Drop that into a dashboard, a Slack notifier, or a tenant-facing report — it's the same JSON shape run_evals.py writes locally.

What this benchmark does not tell you

In the spirit of being painfully honest about synthetic vs. real:

  • It does not measure the production fusion service end-to-end. The in-harness fuse_alerts is a faithful re-implementation of the Tier 1/2/3 rules from services/fusion, but the production service additionally runs a Postgres-backed deduplicator (handles cross-window persistence) and an ML scorer (lifts and depresses scores based on historical FP rate). Those components have their own unit tests; they're out of scope here because this gate has to run in milliseconds in CI.
  • It does not exercise real telemetry. The 1,000-alert stream is shaped to look like the worst hour of a SIEM, but it isn't a capture from one. Real-customer evaluation is on the v1.5 roadmap and will be opt-in / federated; see apps/docs/docs/benchmark.md.
  • It does not compute per-rule false-positive rates. That is the next stage of this benchmark, tracked under issue #5 — per-rule cross-fire FP eval gate (PR #89, in flight). Once that ships, this page will gain a "per-rule FP rate" section that surfaces which rules contribute most to the noise floor and how their merge rate trends over time.
  • It does not score "alerts an analyst would have wanted." That requires labeled human-judgement data, which we don't have synthetically. The closest proxy here is the critical-alert preservation invariant — zero alerts above the noise floor are silently dropped.

Roadmap

The pieces that turn this from a substrate gate into a buyer-grade benchmark:

  • Per-rule FP rate breakdown — depends on issue #5 / PR #89. Will surface per-rule contribution to the noise floor and gate net-new noisy detections in CI.
  • Per-tenant fusion config replay — let an operator paste their fusion knobs (window sizes, noise threshold, severity overrides) and replay the stream against them. Lets a buyer answer "what would my number look like" before deploying.
  • Live-fusion online benchmark — once the production services/fusion service has a stable replay API, run the same 1,000-alert stream through the real DB-backed deduplicator + ML scorer in a nightly job. The delta vs. this in-harness number is the "what does the substrate gate miss" signal.
  • Federated, opt-in real-tenant evaluation — collapsed against tenant telemetry under DP-style aggregation so the headline number reflects a population, not a single deployment.

PRs against any of those are welcome — the Help us harden the harness section on the umbrella benchmark page lists the contribution path.

See also