Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.decepticon.red/llms.txt

Use this file to discover all available pages before exploring further.

Claims about an autonomous red team agent are cheap. Decepticon ships with a benchmark harness that runs against public, reproducible challenges — so the claims are measurable.

What’s Benchmarked

Decepticon’s benchmark suite integrates with the xbow validation-benchmarks — a public set of CTF-style challenges across difficulty levels and tag-based categories (web, crypto, AD, cloud, etc.). Each benchmark run measures whether Decepticon can:
  1. Generate a valid OPPLAN for the challenge from a minimal prompt.
  2. Execute the kill chain end-to-end without operator intervention.
  3. Recover the FLAG (proof of solve) within the time budget.

The Harness

The benchmark harness lives under benchmark/ in the source repository. It is a separate Python package with a Typer CLI:
python -m benchmark.runner run \
  --level 1 \
  --tags web,sqli \
  --range-start 1 --range-end 10 \
  --batch-size 3 \
  --timeout 1800 \
  --parallel
FlagPurpose
--levelChallenge difficulty tier (1–3)
--tagsFilter by ATT&CK-aligned tag
--range-start / --range-endSubset selection from the suite
--batch-sizeNumber of challenges per scoring batch
--timeoutPer-challenge wall-clock budget (seconds)
--parallelRun challenges concurrently

Lifecycle Per Challenge

For every challenge, the harness performs the same four-step lifecycle:
1

Setup

Build the challenge’s Docker environment and inject a unique FLAG.
2

Invoke

Hand the challenge prompt to the LangGraph platform; Decepticon’s orchestrator generates an OPPLAN and executes it.
3

Evaluate

Grep the agent’s workspace for the FLAG{...} pattern. Match → pass; no match → fail.
4

Teardown

docker compose down -v to fully reset state for the next challenge.

Scoring

Scoring is binary per challenge — flag captured or not. The aggregate report breaks down results by:
  • Level (1 / 2 / 3) — pass rate per difficulty tier
  • Tag (web, ad, cloud, …) — pass rate per category
  • Wall-clock — median and p95 time-to-flag

Reports

The reporter produces two artifacts per run:
  • report.json — machine-readable result for CI
  • report.md — human-readable summary with per-challenge transcripts and timing
These are versioned alongside the source so regressions surface in code review.

Why Binary Scoring

A red team agent that “almost solved it” is functionally identical to one that did not. Binary scoring keeps the metric honest — the only thing that counts is whether the engagement objective was achieved. For nuance — how the agent solved it, what tradecraft it used, what false starts it made — the per-challenge transcript is the artifact to read.

Continuous Evaluation

The benchmark harness is wired into CI. Every change to the agent system, skill library, or model routing triggers a partial benchmark run. Major releases trigger the full suite. This is how Decepticon catches regressions before they ship — a model swap that improves a few tasks but breaks others is exactly the kind of failure binary scoring surfaces immediately.

Multi-Model Routing

Benchmark runs are also how new model profiles get validated before being made default.