AgentContract-Bench v2

Benchmarks

293 scenarios across 12 domains. 100% pass rate. Live LLM results.

Benchmark Overview

The most comprehensive behavioral contract benchmark for AI agents.

293 Benchmark Scenarios
100% Pass Rate
12 Domains
3 LLMs Evaluated

Live LLM Results

Aggregate compliance scores across all 293 scenarios.

Claude Sonnet 4.6 Θ = 0.823 Highest Compliance
Mistral-Large-3 Θ = 0.813 Strong Compliance
GPT-5.3 Θ = 0.688 Moderate Compliance

Domain Breakdown

Scenario distribution across 12 enterprise domains.

Domain Scenarios Coverage
E-Commerce42Order, inventory, pricing
Finance38Transactions, compliance, risk
Healthcare35Triage, records, referrals
Retail28Returns, support, recommendations
Telecom26Billing, provisioning, support
Dev Tools24Code generation, review, CI/CD
Research22Literature search, summarization
General20Multi-domain, cross-cutting
MCP18Tool calling, protocol compliance
RAG16Retrieval accuracy, grounding
Content Moderation14Safety, policy enforcement
Legal10Document review, compliance

Explore the Benchmarks

Full benchmark source code, scenarios, and results are available on GitHub.