Research Organization

Agentic AI behavioral research
via a verifiable benchmark

TarantuLabs studies how AI agents reason, adapt, and fail — persona effects, deadline awareness, strategy drift, repetition loops — via the TarantuBench benchmark: an open challenge suite of 100 web security scenarios, each with a binary, unambiguous ground truth. Offensive cybersecurity provides the multi-step complexity; the flag provides the verification; per-step telemetry captures the behavior.

View on GitHub

100

Verifiable Scenarios

Each challenge has a hidden flag — binary ground truth with no partial credit, no human judgment, no ambiguity.

∞

Rich Telemetry

Every HTTP request, reasoning trace, and tool call is logged. Analyze strategy, efficiency, sentiment, and failure modes.

100%

Reproducible

Deterministic labs run in WebContainers — in the browser or locally. No setup, no external dependencies, fully replicable experiments.

Research

TarantuBench supports both model evaluation and deeper agent research. Compare frontier models under controlled conditions, or use the rich telemetry to study reasoning, tool-use patterns, and behavioral differences.

Latest Research

Frontier Model Comparison — April 2026

Claude 4.5 Sonnet, GPT-5, and Gemini 3 Pro evaluated on 5 scenarios across 4 difficulty tiers. HTTP-only tooling, no code execution, 30-step limit.

View full results →

Scenario Catalog

Interactive security scenarios. Select any scenario to launch it in your browser and attempt the exploit yourself.

0 scenarios available

Agentic AI behavioral research
via a verifiable benchmark

Verifiable Scenarios

Rich Telemetry

Reproducible

Research

Frontier Model Comparison — April 2026

Scenario Catalog

Research

TarantuLabs

Objectives

Hints

Submit Solution

Agentic AI behavioral researchvia a verifiable benchmark

Verifiable Scenarios

Rich Telemetry

Reproducible

Research

Frontier Model Comparison — April 2026

Scenario Catalog

Research

Objectives

Hints

Submit Solution

Agentic AI behavioral research
via a verifiable benchmark