Question 1

What is AgentContract-Bench v2?

Accepted Answer

AgentContract-Bench v2 is a comprehensive benchmark suite with 293 scenarios across 12 domains for evaluating AI agent behavioral contract compliance. It achieves 100% pass rate with the agentAssert framework.

Question 2

Which LLMs have been tested?

Accepted Answer

Live results are available for GPT-5.3 (Theta=0.688), Claude Sonnet 4.6 (Theta=0.823), and Mistral-Large-3 (Theta=0.813). All models were evaluated across the full 293-scenario suite.

Question 3

How are benchmark scores calculated?

Accepted Answer

The aggregate compliance score Theta is computed across all contract constraints (preconditions, invariants, guarantees, and recovery) weighted by domain-specific importance. A higher Theta indicates better behavioral contract adherence.

Domain	Scenarios	Coverage
E-Commerce	42	Order, inventory, pricing
Finance	38	Transactions, compliance, risk
Healthcare	35	Triage, records, referrals
Retail	28	Returns, support, recommendations
Telecom	26	Billing, provisioning, support
Dev Tools	24	Code generation, review, CI/CD
Research	22	Literature search, summarization
General	20	Multi-domain, cross-cutting
MCP	18	Tool calling, protocol compliance
RAG	16	Retrieval accuracy, grounding
Content Moderation	14	Safety, policy enforcement
Legal	10	Document review, compliance

Benchmarks

Benchmark Overview

Live LLM Results

Domain Breakdown

Explore the Benchmarks