Measuring effectiveness, correctness, reliability, and latency under real and synthetic workloads.

Performance Certification

Goal: Determine whether the agent performs its intended tasks effectively and efficiently.

Evaluation Criteria (Pass/Fail per category):

Tool Selection Accuracy: Does the agent choose the correct tool for the user’s request?
Invocation Correctness: Are tools called with proper formats, inputs, and parameters?
Execution Reliability: Do tool calls complete successfully without recurring errors? (via error logging)
Speed & Responsiveness: Does the agent meet expected performance latency targets? (averaged from usage telemetry)

Pass → Agent demonstrates correctness, consistency, and speed within thresholds.

Fail → Any persistent tool misuse, error-proneness, or lag below baseline.

Performance certification interface highlighting latency and correctness metrics.

What we measure

Correct Tool Use (Can be Scaled)
- Chooses the right tool for the user’s request
- Supplies the correct parameter formats
Execution Reliability (Requires Usage Data)
- Tools run without errors (from aggregated error logs)
  - This step is diffcult to scale automatically because many agents call external APIs or require specific setup. In these cases, collecting aggregated and annonymised error logs helps determine the pass / fail performance.
Latency (Requires Usage Data)
- End-to-end response time; use observed usage data where available
- Fall back to container startup time averages if live data is absent

Correct Tool Use

Tasks, tools, and expected outcomes form the basis of automated checks
Objective pass/fail signals are computed from the ground truth derived by static analysis

From the agent’s metadata, we generate tool-targeted tasks that can be evaluated objectively.
Examples:

Figma MCP Agent → get_file must be chosen with the correct format to fetch a shared design file.
Search MCP Server → google_search must retrieve and summarize current results with references.
Backup MCP → backup_projects must select projects updated in the last 7 days.
Research MCP → research_paper_search returns recent, concise paper summaries.

Why synthetic tasks? They enable scalable, repeatable, and automation-friendly checks over many agents.

We statically analyze the agent code and tool descriptions to derive ground truth for:

Expected tool selection for a task
Required parameter shapes & formats
Output structure and success criteria

This provides the ground truth on which to evaluate the synthetic tasks.

The generation of these synthetic tasks is based on research from our state-of-the-art evaluation harness to quantify agent behavior on real-world settings called MCP-Bench which we have open-sourced to the community (Apache 2.0).

High-level flow of MCP-Bench connecting real-world MCP servers, task synthesis, and evaluation. MCP-Bench automates task synthesis across real-world MCP servers and evaluates agent execution with rule- and LLM-based judges. This know-how to evaluate agentic and tool-use performance has been the foundation of our agentic certification for performance.

Bar chart comparing model scores on MCP-Bench for complex tool-use tasks. MCP-Bench overall scores highlight relative tool-use performance across leading foundation models - which are used to power agents.

Schema-level understanding is mostly saturated. We discover that strong frontier models today have no problem calling tools and understanding their schema. Models obtain very high scores on name validity and schema compliance (e.g. > 98 %)
Higher-level reasoning remains the bottleneck. However, differences between models are largest in planning, dependency awareness, and parallel orchestration.
For instance, top models like GPT-5, o3, and GPT-OSS-120B achieve overall scores ~0.70–0.75, while smaller or weaker models lag significantly.
Performance often degrades when tasks span multiple servers, especially for weaker models. But top-tier models show much more stable behavior across single- and multi-server settings.
The number of tool calls and rounds required highlights the complexity of tasks: even for good models, tasks often require multiple planning rounds and substantial tool invocation.
While many models have mastered execution fidelity, the real frontier lies in long-horizon planning, cross-domain orchestration, and adaptive reasoning.
With our benchmarking, techniques of which we use to generate synthetic data for Agent Gallery, we are able to surface these gaps systematically.

Resource	Link
Paper	ArXiv, to appear at NeurIPS (SEA) 25
Code	Github
Leaderboard	Huggingface Spaces
License	Apache-2.0

Execution Reliability

We evaluate execution reliability by sampling real usage traces and verifying that tool calls complete without retriable or user-visible errors.

Production telemetry is necessary to give pass/fail signals through annonymised aggregated error logs for reliability due to synthethic task and automated pipeline execution being limited by a large distribution of gated (i.e. require API keys or credentials) agents.

Acceptance criteria: error rates stay below threshold across recent releases
Escalation path: recurring failures trigger manual review

Latency

Latency certification uses end-to-end response times observed in live environments where available.
When live telemetry is missing, we run representative synthetic tasks in containerized environments to approximate startup and execution costs.

Primary sources: usage telemetry for startup time and query response time
Fallback: benchmark runs against container startup time and tool execution averages

Pass / Fail Criteria

An agent passes performance when:

Tool selection accuracy meets threshold
Parameter/format validity meets threshold
Observed error rate stays below threshold
Latency is within the target service level (i.e. startup times: a maximum of 1 minute to startup and an average below 30 seconds)

Performance results are continuously updated from live telemetry when available and versioned per agent release.

What to include in your agent

Tool specs with strict schemas and examples
Clear error messages and remediation hints

Performance Certification

On this page