Performance Certification
Measuring effectiveness, correctness, reliability, and latency under real and synthetic workloads.
Performance Certification
Goal: Determine whether the agent performs its intended tasks effectively and efficiently.
Evaluation Criteria (Pass/Fail per category):
- Tool Selection Accuracy: Does the agent choose the correct tool for the user’s request?
- Invocation Correctness: Are tools called with proper formats, inputs, and parameters?
- Execution Reliability: Do tool calls complete successfully without recurring errors? (via error logging)
- Speed & Responsiveness: Does the agent meet expected performance latency targets? (averaged from usage telemetry)
Pass → Agent demonstrates correctness, consistency, and speed within thresholds.
Fail → Any persistent tool misuse, error-proneness, or lag below baseline.

What we measure
- Correct Tool Use (Can be Scaled)
- Chooses the right tool for the user’s request
- Supplies the correct parameter formats
- Execution Reliability (Requires Usage Data)
- Tools run without errors (from aggregated error logs)
- This step is diffcult to scale automatically because many agents call external APIs or require specific setup. In these cases, collecting aggregated and annonymised error logs helps determine the pass / fail performance.
- Tools run without errors (from aggregated error logs)
- Latency (Requires Usage Data)
- End-to-end response time; use observed usage data where available
- Fall back to container startup time averages if live data is absent
Correct Tool Use
- Tasks, tools, and expected outcomes form the basis of automated checks
- Objective pass/fail signals are computed from the ground truth derived by static analysis
From the agent’s metadata, we generate tool-targeted tasks that can be evaluated objectively.
Examples:
- Figma MCP Agent →
get_filemust be chosen with the correct format to fetch a shared design file. - Search MCP Server →
google_searchmust retrieve and summarize current results with references. - Backup MCP →
backup_projectsmust select projects updated in the last 7 days. - Research MCP →
research_paper_searchreturns recent, concise paper summaries.
Why synthetic tasks? They enable scalable, repeatable, and automation-friendly checks over many agents.
We statically analyze the agent code and tool descriptions to derive ground truth for:
- Expected tool selection for a task
- Required parameter shapes & formats
- Output structure and success criteria
This provides the ground truth on which to evaluate the synthetic tasks.
MCP-Bench
The generation of these synthetic tasks is based on research from our state-of-the-art evaluation harness to quantify agent behavior on real-world settings called MCP-Bench which we have open-sourced to the community (Apache 2.0).
MCP-Bench automates task synthesis across real-world MCP servers and evaluates agent execution with rule- and LLM-based judges. This know-how to evaluate agentic and tool-use performance has been the foundation of our agentic certification for performance.
MCP-Bench overall scores highlight relative tool-use performance across leading foundation models - which are used to power agents.
- Schema-level understanding is mostly saturated. We discover that strong frontier models today have no problem calling tools and understanding their schema. Models obtain very high scores on name validity and schema compliance (e.g. > 98 %)
- Higher-level reasoning remains the bottleneck. However, differences between models are largest in planning, dependency awareness, and parallel orchestration.
- For instance, top models like GPT-5, o3, and GPT-OSS-120B achieve overall scores ~0.70–0.75, while smaller or weaker models lag significantly.
- Performance often degrades when tasks span multiple servers, especially for weaker models. But top-tier models show much more stable behavior across single- and multi-server settings.
- The number of tool calls and rounds required highlights the complexity of tasks: even for good models, tasks often require multiple planning rounds and substantial tool invocation.
- While many models have mastered execution fidelity, the real frontier lies in long-horizon planning, cross-domain orchestration, and adaptive reasoning.
- With our benchmarking, techniques of which we use to generate synthetic data for Agent Gallery, we are able to surface these gaps systematically.
| Resource | Link |
|---|---|
| Paper | ArXiv, to appear at NeurIPS (SEA) 25 |
| Code | Github |
| Leaderboard | Huggingface Spaces |
| License | Apache-2.0 |
Execution Reliability
We evaluate execution reliability by sampling real usage traces and verifying that tool calls complete without retriable or user-visible errors.
Production telemetry is necessary to give pass/fail signals through annonymised aggregated error logs for reliability due to synthethic task and automated pipeline execution being limited by a large distribution of gated (i.e. require API keys or credentials) agents.
- Acceptance criteria: error rates stay below threshold across recent releases
- Escalation path: recurring failures trigger manual review
Latency
Latency certification uses end-to-end response times observed in live environments where available.
When live telemetry is missing, we run representative synthetic tasks in containerized environments to approximate startup and execution costs.
- Primary sources: usage telemetry for startup time and query response time
- Fallback: benchmark runs against container startup time and tool execution averages
Pass / Fail Criteria
An agent passes performance when:
- Tool selection accuracy meets threshold
- Parameter/format validity meets threshold
- Observed error rate stays below threshold
- Latency is within the target service level (i.e. startup times: a maximum of 1 minute to startup and an average below 30 seconds)
Performance results are continuously updated from live telemetry when available and versioned per agent release.
What to include in your agent
- Tool specs with strict schemas and examples
- Clear error messages and remediation hints