SOTA Benchmarks for Agentic Models
This is a note — quick thoughts, possibly AI-assisted. Not a fully fleshed article.
Evaluating agentic models requires different benchmarks than standard chat/reasoning evals. Each benchmark makes sharp design tradeoffs — what it optimizes for determines what it's blind to.
BFCL — Berkeley Function Calling Leaderboard
- ~4,700 test cases across ~21 scoring categories
- Four test groups: single-turn (simple/parallel/multiple), live (user-contributed schemas), multi-turn (dialogue with tool use), and agentic (memory backends, web search)
How it scores:
- AST checker for single-turn: parses output as function call, checks name + params + types + values against a
possible_answerlist. String values are heavily normalized (lowercase, strip punctuation/spaces) - Execution checker for multi-turn: runs model's tool calls against simulated backends, compares resulting state to ground truth
- Agentic checker: only verifies the final answer string appears in the response — does not validate intermediate tool-call chains
- All scoring is binary pass/fail per test case
Optimizes for: controllability and deterministic evaluation. Synthetic schemas with known correct answers make scoring unambiguous.
Sacrifices:
- Realism — synthetic schemas don't capture the messiness of real APIs (ambiguous descriptions, overlapping tools)
- Semantic quality — checks structural correctness of the call, not whether the model's reasoning was sound
- Robustness — scores are fragile to naturalistic query rephrasing and toolkit expansion with similar tools
- Format instruction compliance — IFEval-FC specifically criticizes BFCL for not testing adherence to format instructions embedded in parameter descriptions
- Small sample sizes in some live categories (15-23 entries) limit statistical significance
Bottom line: BFCL answers "can this model produce structurally correct tool calls?" — a necessary but not sufficient condition for agentic work. Good first filter, especially for smaller models where schema hallucination is the primary failure mode.
τ-bench (Tau-bench)
- 165 tasks across 2 domains: retail (115 tasks, 17 tools) and airline (50 tasks, 15 tools)
- Each domain has a policy wiki, constraint rules, mock database, and Python tool implementations
How it scores:
- Database state match: after conversation ends, replays ground-truth actions on a fresh DB copy and compares hashes to the agent's resulting DB state
- Output match: expected strings must appear (case-insensitive substring) in agent responses
- Both must pass — binary 0/1 per task
- Key metric is pass^k: run each task k times, estimate probability all k succeed via
C(c,k)/C(n,k). This measures reliability, not just average accuracy. GPT-4o scored <50% pass^1 and <25% pass^8 on retail
The user simulation twist: a separate LLM (default: GPT-4o) plays the customer, following natural conversation instructions. It doesn't dump all info at once — it behaves like a real customer. This adds realism but introduces non-determinism.
Optimizes for: end-to-end task completion with realistic multi-turn interaction. The pass^k metric specifically targets reliability — penalizing models that succeed sometimes but fail unpredictably.
Sacrifices:
- Determinism — results depend on the user-simulator model
- Credit granularity — binary scoring means a conversation that's 90% correct and one that's 0% correct score the same
- DB hash comparison is fragile: a different but equally valid action sequence (e.g., equivalent payment method) scores 0
- Breadth — only 2 narrow customer service domains, 165 total tasks
- Efficiency measurement — no evaluation of how many turns were needed, only whether the task succeeded
Bottom line: τ-bench answers "can this model reliably complete multi-turn tasks?" — the key word being reliably. The pass^k metric is the real innovation here: it surfaces the difference between a model that works 80% of the time (unusable in production) and one that works 99% of the time.
SWE-bench
- Full: 2,294 instances from 12 Python repos (Django, scikit-learn, sympy, matplotlib, Flask, etc.)
- Lite: 300 instances filtered for single-file patches with ≤3 hunks, problem statements >40 words, no images/links
- Verified: human-curated subset for quality
How it scores:
- Model receives a codebase at a specific commit + GitHub issue text, must produce a
.patchfile - Eval runs in Docker: applies patch via
git apply, runs repo's test suite - FAIL_TO_PASS: tests that failed before the gold PR must now pass (all of them)
- PASS_TO_PASS: tests that passed before must still pass (no regressions)
- Instance "resolved" only when both hit 100%. No partial credit — fixing 4/5 failing tests scores the same as fixing 0
Optimizes for: real-world grounding. These are actual PRs from real repos with real test suites. No synthetic tasks, no simplified environments. Docker isolation makes evaluation reproducible.
Sacrifices:
- Speed — Docker-based eval is heavyweight
- Language diversity — Python only in the original benchmark
- Solution leakage — SWE-bench+ found ~33% of instances where the solution was directly stated in the issue text or comments
- Weak tests — ~31% of "solved" patches were suspicious due to inadequate test cases. After filtering both issues, SWE-Agent+GPT-4 dropped from 12.47% to 3.97%
- Data contamination — 94%+ of issues predate LLM training cutoffs. SWE-bench-Live addresses this with post-2024 issues
- Test overfitting — systems that generate their own tests and iterate can pass the benchmark's tests without truly fixing the issue
Bottom line: SWE-bench is the hardest and most revealing benchmark here, but also the most gamed. The Lite/Verified/Live splits represent an ongoing arms race between benchmark designers and leaderboard optimizers. When reading SWE-bench scores, always check which split and whether the agent had access to repo tests during generation.
AgentBench
- 8 environments: OS interaction (bash), database (SQL), knowledge graph (SPARQL), card game, lateral thinking puzzles, ALFWorld (household tasks), Mind2Web (web browsing), WebShop (product search), Avalon (social deduction)
- Each environment has its own metric, interaction protocol, and Docker container
How it scores:
- No unified scoring function — each environment has a bespoke metric:
- OS: accuracy, DB: category accuracy, KG: custom score, ALFWorld: success rate, Mind2Web: step success rate, WebShop: reward score, Card Game: win rate
- Client-server architecture with max-flow scheduling for concurrent evaluation
- Detailed failure tracking: context limit exceeded, validation failed, invalid action, task limit reached
Optimizes for: breadth of environment coverage and standardized multi-model comparison (25+ models). The interactive multi-turn format (not single-shot) is closer to real agent behavior than most benchmarks.
Sacrifices:
- Depth — round limits (8 for OS, 15 for DB) artificially cap complex reasoning chains
- Unified comparison — no aggregate score, results are a table of per-environment numbers that resist summarization
- Reproducibility — Docker dependency + external data for some environments (Mind2Web, WebShop) makes setup non-trivial
- Opponent calibration — game environments use naive bots, not strong baselines
- Long-horizon planning — despite finding this as a main failure mode, the short episode format doesn't deeply test it
Bottom line: AgentBench is the broadest benchmark but the shallowest per-domain. Useful for a capability radar chart across environments, not for deep evaluation of any single agentic skill. The lack of unified scoring makes it hard to produce a single "agentic capability" number.
Design Tradeoff Matrix
| BFCL | τ-bench | SWE-bench | AgentBench | |
|---|---|---|---|---|
| What it tests | Tool call correctness | Task completion via tools | Real-world code patches | Multi-env agent tasks |
| Realism | Low (synthetic schemas) | Medium (simulated customer) | High (real repos/issues) | Medium (Docker envs) |
| Scoring | Binary, structural match | Binary, state + output match | Binary, test suite pass | Per-env bespoke metrics |
| Partial credit | No | No | No | Varies |
| Multi-turn | Yes (subset) | Always | Implicit (agent decides) | Always |
| Determinism | High | Low (LLM user sim) | High (Docker + tests) | Medium |
| Setup cost | Low (Python harness) | Low (pip install) | High (Docker per repo) | High (Docker + external data) |
| Task count | ~4,700 | 165 | 300 (Lite) / 2,294 (Full) | ~500+ across envs |
| Known gaming | Query rephrasing fragility | User sim dependence | Solution leakage, test overfitting | Naive bot opponents |
Supporting Dimensions
These aren't agentic-specific but matter when interpreting agent performance:
| Dimension | Benchmark | Why it matters for agents |
|---|---|---|
| Schema adherence | BFCL irrelevance + parallel | Smaller models hallucinate field names, wrong types, nonexistent tools |
| Multi-turn coherence | MT-Bench | Agents that lose context mid-task produce inconsistent tool calls |
| Instruction following | IFEval | Agents must respect constrained output formats — critical for structured pipelines |
| Long context recall | RULER / Needle-in-a-Haystack | MoE models may degrade differently at long context vs dense models |
| Hallucination rate | TruthfulQA / FActScore | Models reasoning about tool results can confidently fabricate; worse in multi-hop chains |
Companies Publishing Product-Specific Evals
Academic benchmarks test general capabilities. These companies built evals for their workload — worth studying as templates.
CursorBench (Cursor) — Tasks sourced from real Cursor sessions via "Cursor Blame" (traces committed code back to the agent request that produced it). Uses LLM graders, not just unit tests, because most dev requests have multiple valid solutions. Plots correctness against token cost to capture the quality/latency tradeoff. Key insight: real developer prompts are short and underspecified — nothing like the verbose issue descriptions in SWE-bench.
Cognition-Golden (Cognition/Devin) — Internal benchmark using "realistic evaluations for economically valuable tasks," sometimes on codebases with millions of lines. Train/test split to prevent overfitting. Production Devin scores 74.2% on the test split. Designed around what paying customers actually need.
Stripe Integration Benchmark (Stripe) — 11 environments testing autonomous Stripe integration building (backend-only, full-stack, focused exercises). Grading is deterministic: automated tests exercise the finished software via API calls and UI tests, then validates Stripe API objects were created correctly. Their philosophy: "a mostly correct integration is a failure; payments require 100% accuracy."
v0 Evals + next-evals-oss (Vercel) — Multi-layered: code-based checks (valid imports, file structure), LLM-based grading for subjective quality, and human feedback. Every PR that affects output requires eval documentation. The open-source Next.js eval has 20 fixture-based tasks with vitest assertions. Notable finding: an 8KB AGENTS.md file achieved 100% pass rate vs 79% for skills-based retrieval.
Terminal-Bench (VALS AI) — Tests the full terminal-based dev workflow: read code, run tests, interpret output, fix issues, commit. Used by OpenAI to benchmark Codex. More operational than SWE-bench — measures agent competence at using terminal tools, not just generating patches.
Common pattern: every product-specific eval shares the same structure — real task inputs sourced from production (not synthetic), domain-appropriate environments, deterministic outcome verification where possible, and LLM-based grading for subjective quality.
Benchmarks vs Evals
Worth clarifying the distinction:
- Benchmarks are standardized, public, and comparative. They answer "how does model A compare to model B on a known task distribution?" BFCL, SWE-bench, τ-bench are benchmarks.
- Evals are product-specific, often private, and diagnostic. They answer "does this model/prompt/system work for our use case?" CursorBench, Stripe's integration tests are evals.
Benchmarks are useful for model selection. Evals are useful for shipping. You need both — but if you're building a product, evals are the ones that actually prevent regressions.
Designing a Product-Specific Eval
Based on Anthropic's eval guide and patterns from the companies above:
Start small, source from real failures
- 20-50 tasks is enough to start. Early changes have large effect sizes — small samples suffice to detect them.
- Source tasks from real user failures and production data, not invented scenarios. Cursor uses "Cursor Blame" to trace agent-generated code back to the original request. Stripe uses real integration patterns.
- If you don't have production data yet, start with your own dogfooding sessions — record the inputs that broke.
Three types of graders
| Grader type | Speed | Flexibility | When to use |
|---|---|---|---|
| Code-based (string match, assertion, API check) | Fast, deterministic | Rigid | Structured outputs, API state, pass/fail outcomes |
| Model-based (LLM judges the output) | Slower, needs calibration | Flexible | Subjective quality, multiple valid solutions, natural language outputs |
| Human | Slow, expensive | Gold standard | Calibrating model graders, ambiguous cases, initial eval design |
Most product evals use a mix. Stripe leans heavily on code-based (deterministic API checks). Cursor leans on model-based (many valid code solutions). Start with code-based where you can, add model-based for the rest.
Grade outcomes, not paths
Agents find alternative valid solutions. If you check "did it call tool X with param Y at step 3", you'll get false failures from agents that solved the task differently. τ-bench's DB hash comparison is a cautionary tale — correct behavior via a different action sequence scores 0.
Instead: check the end state. Did the database have the right records? Did the API return the right response? Did the tests pass? Did the user's request get fulfilled?
Capability evals vs regression evals
Two different purposes, two different design philosophies:
- Capability evals measure potential. Start at low pass rates, expect improvement over time. These are your frontier tasks — the hardest things your agent should eventually handle. When pass rate exceeds ~80%, graduate tasks to the regression suite and create harder ones.
- Regression evals ensure reliability. Target ~100% pass rate. These are tasks your agent already handles — the eval exists to catch regressions when you change prompts, models, or tools. A failure here is a bug.
Measure reliability, not just accuracy
Steal τ-bench's pass^k metric. For any customer-facing agent:
- pass@k = "did it succeed at least once in k tries" — measures potential, useful during development
- pass^k = "did it succeed every time in k tries" — measures reliability, what matters in production
A model with 90% pass@1 sounds great until you realize that's 35% pass^10. Would you ship a product that fails 65% of the time over 10 interactions?
Keep the eval alive
- Refresh tasks regularly — models memorize patterns, and your product evolves
- Gate PRs on eval results — Vercel requires eval documentation for every output-affecting PR
- Feed production failures back — when users report issues, add them to the eval suite
- Monitor saturation — an eval where everything passes is no longer providing signal
Starter checklist
- Collect 20-50 real tasks from production/dogfooding failures
- Define what "success" looks like for each (end-state, not path)
- Build code-based graders for deterministic checks
- Add model-based graders for subjective quality
- Split into capability (hard, low pass rate) and regression (known-good, target 100%)
- Run each task k times, track pass^k
- Set up CI to run regression evals on every change
- Review failures manually — catch grader bugs, not just agent bugs
When Your Eval Reveals a Gap
A model topping BFCL but failing your product eval doesn't mean you picked the wrong model. It means the base capability is there — the model understands tool calling mechanics — but it hasn't learned your domain. That's a fine-tuning problem, not a fundamental capability gap.
The pipeline
- Select via benchmarks — BFCL, τ-bench, SWE-bench tell you the model has the raw skill (tool calling, multi-turn coherence, code reasoning). This narrows your candidates.
- Evaluate on your tasks — your product eval reveals the specific gap: wrong tool selection? bad argument formatting? fails on multi-step workflows? The failure mode tells you what to train on.
- SFT (supervised fine-tuning) — teach it your specific tool schemas, output formats, and domain patterns. This is where most product-specific gains come from. Collect examples of correct tool-use traces from your eval tasks (or human demonstrations) and fine-tune on those.
- RL (reinforcement learning) — optimize for task completion, not just format correctness. Especially valuable when there are multiple valid paths to a solution — exactly the case where rigid benchmarks and SFT both struggle. Use your eval's outcome-based graders as the reward signal.
SFT vs RL: when to use which
- SFT is the default starting point. It's simpler, more data-efficient, and directly addresses "the model doesn't know my tool schemas / output format." If your eval failures are about format or domain knowledge, SFT is probably enough.
- RL shines when the model knows the tools but makes bad decisions about when and how to use them. It learns policies (prefer tool A over tool B in situation X) that SFT can't easily encode from static examples. Outcome-based RL (reward = did the task succeed?) is particularly powerful because it lets the model discover strategies you didn't anticipate.
- Both together is the standard production recipe: SFT first to get the model into the right ballpark, then RL to optimize the policy against your eval metric.
Your eval is your reward function
This is where benchmarks and evals connect to training. The product eval you built (with code-based and model-based graders, measuring pass^k) is exactly the reward signal RL needs. A well-designed eval does double duty:
- During development: tells you whether to ship
- During training: tells the model what "good" looks like
This is why grading outcomes (not paths) matters for training too — if your eval penalizes valid alternative solutions, your RL reward will push the model toward a narrow strategy instead of robust task completion.
Picking the Right Benchmark
- Raw tool-calling correctness → BFCL
- Multi-turn agent reliability → τ-bench (the pass^k metric is uniquely valuable)
- Coding agent / PR automation → SWE-bench Lite (check which split)
- Broad capability scan → AgentBench
- Diagnosing long-task failures → RULER + MT-Bench
- Small model (≤8B) deployment → BFCL irrelevance detection + IFEval
- Your specific product workload → build your own eval (see above)
Takeaway
Academic benchmarks tell you which model to pick. Product evals tell you whether to ship. You need both, but they serve different stages:
- Model selection — use BFCL, τ-bench, SWE-bench to narrow your options
- Product validation — build your own eval from real failures, grade outcomes not paths, measure reliability with pass^k
- Ongoing quality — gate deployments on regression evals, feed production failures back into the suite
The meta-lesson: benchmark scores are most useful when you understand what the benchmark doesn't test. A model that tops BFCL but bombs τ-bench can call tools perfectly but can't complete tasks. A model that tops SWE-bench Lite but was never tested on SWE-bench+ might just be pattern-matching leaked solutions. Always pair a primary benchmark with supporting dimensions that cover its blind spots — and always supplement with evals that test your workload.