AgentContract-Bench v2
Benchmarks
293 scenarios across 12 domains. 100% pass rate. Live LLM results.
Benchmark Overview
The most comprehensive behavioral contract benchmark for AI agents.
293 Benchmark Scenarios
100% Pass Rate
12 Domains
3 LLMs Evaluated
Live LLM Results
Aggregate compliance scores across all 293 scenarios.
Claude Sonnet 4.6 Θ = 0.823 Highest Compliance
Mistral-Large-3 Θ = 0.813 Strong Compliance
GPT-5.3 Θ = 0.688 Moderate Compliance
Domain Breakdown
Scenario distribution across 12 enterprise domains.
| Domain | Scenarios | Coverage |
|---|---|---|
| E-Commerce | 42 | Order, inventory, pricing |
| Finance | 38 | Transactions, compliance, risk |
| Healthcare | 35 | Triage, records, referrals |
| Retail | 28 | Returns, support, recommendations |
| Telecom | 26 | Billing, provisioning, support |
| Dev Tools | 24 | Code generation, review, CI/CD |
| Research | 22 | Literature search, summarization |
| General | 20 | Multi-domain, cross-cutting |
| MCP | 18 | Tool calling, protocol compliance |
| RAG | 16 | Retrieval accuracy, grounding |
| Content Moderation | 14 | Safety, policy enforcement |
| Legal | 10 | Document review, compliance |
Explore the Benchmarks
Full benchmark source code, scenarios, and results are available on GitHub.