Research
Agent research studies and model evaluations run on TarantuBench scenarios. Each study tests specific models, personas, or harness configurations under controlled conditions. New studies are added as models and tooling evolve.
TarantuLabs studies how AI agents reason, adapt, and fail — persona effects, deadline awareness, strategy drift, repetition loops — via the TarantuBench benchmark: an open challenge suite of 100 web security scenarios, each with a binary, unambiguous ground truth. Offensive cybersecurity provides the multi-step complexity; the flag provides the verification; per-step telemetry captures the behavior.
Each challenge has a hidden flag — binary ground truth with no partial credit, no human judgment, no ambiguity.
Every HTTP request, reasoning trace, and tool call is logged. Analyze strategy, efficiency, sentiment, and failure modes.
Deterministic labs run in WebContainers — in the browser or locally. No setup, no external dependencies, fully replicable experiments.
Interactive security scenarios. Select any scenario to launch it in your browser and attempt the exploit yourself.
Agent research studies and model evaluations run on TarantuBench scenarios. Each study tests specific models, personas, or harness configurations under controlled conditions. New studies are added as models and tooling evolve.
Preparing scenario...