AgentBench for Elastic — Methodology & Metrics

1. What is AgentBench for Elastic?

AgentBench for Elastic (v1.0) is an end-to-end evaluation framework that measures how different language models (LLM) perform as AI agents within Elastic Agent Builder 9.3. Unlike generic benchmarks (MMLU, HumanEval, etc.) that evaluate isolated model capabilities, AgentBench evaluates real agentic performance — the model integrated into a production stack, with real tools, data, and constraints. We measure:

🔧 Using tools correctly — choosing the right tool (search, ES|QL, mapping...)
🎯 Giving correct answers (Correctness) — verified against detailed ground truth through claim decomposition with geometric mean (a single critical error drives the score to 0)
⚓ Grounding answers in real data (Groundedness) — each claim is compared against actual tool output with arithmetic mean (fair proportion of grounded claims)
📋 Following instructions — format, filters, specific constraints
🧠 Multi-step reasoning — analyze, search, combine results across indices
💬 Maintaining context — remembering previous turns in conversations of up to 5 turns
🛡️ Resisting adversarial inputs — non-existent fields, contradictory requests, impossible operations
🔀 Working across indices — correlating data from multiple sources (cross-index)
📐 Strict format compliance — responding with only JSON, only a number, or exactly N bullet points
⚡ Being efficient — responding quickly and at low cost

            ⚠️ Important: The 30 tests (organized in 11 categories and 4 difficulty levels) run against two real indices: benchmark-ecommerce (1,000 
            e-commerce order documents) and benchmark-customers (20 customer profiles with tiers and segmentation). 
            No simulated data or mocks — the agent interacts with real Elasticsearch through the Agent Builder API. 
            The dataset is deterministic: always the same data so results are comparable between models and between runs.
        

2. How it works?

Each test follows this 6-step flow:

📝

Test

Predefined question

→

🤖

Agent Builder

Kibana API + LLM

→

🔧

Tools

Search, ES|QL, Mapping

→

💬

Response

From agent to user

→

🧑‍⚖️

Judge GPT-5.2

Claim Decomposition

→

📊

Scoring

Geometric + Arithmetic

Step by step:

The question is sent to the agent through the Kibana API (POST /api/agent_builder/converse). In multi-turn tests, multiple turns are sent in the same conversation.
The LLM decides what to do — reasons, chooses tools, executes queries against Elasticsearch. The agent has access to search, ES|QL, mappings and index listing.
Everything is captured — the response, tools used (with their actual outputs), latency (TTLT), tokens (input/output), real cost from the provider via OpenRouter.
The judge (GPT-5.2) decomposes the response into claims — each atomic claim is individually evaluated for Correctness (vs ground truth) and Groundedness (vs tool output), assigning verdicts and severity. The judge receives the actual tool output (truncated to 6,000 chars) to verify hallucinations.
Metrics are calculated locally — Correctness with geometric mean (a critical error drives the score to 0), Groundedness with arithmetic mean (fair proportion of grounded claims), plus programmatic metrics (latency, cost, overlap-based tool calling, error rate).
Weights, difficulty and failure penalty are applied — the final score (★ Adjusted Overall) weighs by test difficulty (easy ×0.7, medium ×1.0, hard ×1.3, expert ×1.6) and applies a failure penalty that penalizes models with timeouts or errors. Everything is stored in Elasticsearch with a model field for historical tracking.

3. The Judge: GPT-5.2 and Claim Decomposition

🧑‍⚖️ Why an LLM as judge?

Evaluating AI responses isn't trivial — you can't compare strings, because an answer can be correct in many different ways. We use OpenAI GPT-5.2 as "LLM-as-Judge" because:

It can understand semantics — if the answer says the same thing with different words
It can evaluate nuances — partially correct, appropriate format but incomplete data
It can verify hallucinations — comparing the answer with real tool output
It can decompose into claims — extract each atomic claim and evaluate it individually

📦 What does the judge receive?

For each test, GPT-5.2 receives a structured prompt with:

The original question asked to the agent
Detailed ground truth — the expected correct answer with verified concrete data (exact quantities, names, values)
Exact answer (if applicable) — a precise numeric value the agent must provide
Actual tool output — the data Elasticsearch returned, truncated to 6,000 chars with truncation notice
The agent's response — what it finally told the user

🧩 Claim Decomposition — The key to evaluation

Instead of requesting generic global scores, the judge decomposes the response into atomic claims — each individual factual assertion is evaluated separately. This enables detecting:

Partial errors — a response can be 90% correct but have one critical data point wrong
Specific hallucinations — identifying exactly which claim has no support in the data
Peripheral vs central claims — an error in a minor detail weighs less than an error in the key data point

Each claim receives two independent evaluations: Correctness (compared against ground truth) and Groundedness (compared against actual tool output). This detects both factual errors and hallucinations.

            🔍 Anti-hallucination: The judge compares the agent's response against actual tool output. 
            If the output is truncated (explicitly indicated as ⚠️ TRUNCATED), the judge does not penalize data it can't verify — 
            it only penalizes information that clearly contradicts what the tools returned. Additionally, if the agent presents 
            information as general knowledge (without attributing it to tools), it's classified as DISCLOSED_UNGROUNDED and penalized less than a direct hallucination.
        

4. Claim-based evaluation: Correctness, Groundedness and more

The judge does not assign generic global scores. Instead, it decomposes the response into atomic claims (individual factual assertions) and evaluates each one across two independent dimensions. It also gives simple scores for format and instruction following.

🏷️ Centrality of each claim

Each claim is classified as:

Central — Essential to answering the user's question (counts, names, key data). An error here is severe.
Peripheral — Additional context, formatting notes, general advice. An error here weighs less.

🎯 Correctness — Is it factually correct? (vs Ground Truth)

Each claim is compared against verified ground truth to determine if it's correct:

Verdict	Meaning	Score (central)
FULLY_SUPPORTED	Completely matches the ground truth	1.0
PARTIALLY_SUPPORTED	Partially correct with minor inaccuracies	0.70
CONTRADICTED	Directly contradicts the ground truth	0.0 (critical)
NOT_VERIFIABLE	Cannot be verified from the available ground truth	0.85

Aggregation: Geometric mean of all claim scores
→ A single CONTRADICTED claim (critical, central) drives ALL Correctness to 0.0
→ This is intentional: a critical factual error makes the response useless

⚓ Groundedness — Is it based on real data? (vs Tool Output)

Each claim is compared against what the tools actually returned, to detect hallucinations:

Verdict	Meaning	Score (central)
GROUNDED	Directly supported by the tool output	1.0
PARTIALLY_GROUNDED	Partially supported by the tool output	0.70
DISCLOSED_UNGROUNDED	Not in the output but agent explicitly presents it as general knowledge	0.60
UNGROUNDED	No basis in tool output (potentially hallucinated)	0.0

Aggregation: Arithmetic mean of all claim scores
→ Unlike Correctness, geometric mean is NOT used here
→ Reason: "not verifiable from tool output" ≠ "incorrect". More detailed models generate more claims, and with geometric mean a single peripheral DISCLOSED_UNGROUNDED claim would drive the score to 0 — unfairly penalizing complete responses.
→ Arithmetic mean reflects the proportion of grounded claims, which is the fair metric.

⚠️ Severity — Only for CONTRADICTED or UNGROUNDED

When a claim is incorrect or ungrounded, a severity level modifies the impact:

critical — Wrong numbers, wrong names, wrong status. Could cause wrong decisions.
major — Significant factual error but less impactful.
minor — Small imprecision: rounding, slight paraphrase.

Example: a claim "There are 50 orders from Madrid" when ground truth says 49 → CONTRADICTED + severity "minor" (score 0.50 for central). But a claim "The top customer is John Smith" when it's "Hans Mueller" → CONTRADICTED + severity "critical" (score 0.0 for central).

📐 Format Score and 📝 Instruction Following Score

In addition to claims, the judge assigns two simple scores (0–10):

Format Score — Is the response well structured? Does it use markdown, tables, lists when appropriate?
Instruction Following Score — Did the agent respect specific constraints? (table format, number of bullets, filters, specific index…)

📊 Relevance — Proportion of central claims

Automatically calculated as the proportion of claims marked as "central" out of total claims. A high score indicates the response is concise and relevant — not much peripheral "filler".

Relevance = n_central_claims / n_total_claims
0.9 → 90% of the response is information relevant to the question

            📌 Note on multi-turn: In multi-turn conversational tests, the judge evaluates the final response 
            but has context from all previous turns. This allows evaluating whether the agent maintains coherence and remembers information 
            from previous turns in conversations of up to 5 turns.
        

5. Automatically calculated metrics

In addition to the judge's claim-based evaluations, these metrics are calculated objectively and programmatically (no LLM):

⏱️ Latency Score

Based on total response time (Time To Last Token). Linear interpolation between thresholds:

< 5s → 10.0 (excellent)
5–15s → 10.0–7.0 (linear interpolation)
15–45s → 7.0–4.0 (linear interpolation)
> 45s → 4.0–1.0 (decreasing penalty)
Timeouts (≥120s) receive score ≈ 1.0 and latency is recorded as 120s (not 0s)

💰 Cost Score

Based on real cost in USD per query, obtained from the provider via OpenRouter (real provider price, not base price):

< $0.005 → 10.0 (excellent)
$0.005–$0.02 → 10.0–7.0 (interpolation)
$0.02–$0.08 → 7.0–4.0 (interpolation)
> $0.08 → 4.0–1.0 (expensive)

🔧 Tool Calling Score (Overlap)

Measures if the model used at least one of the expected tools. Based on set overlap:

expected ∩ actual ≠ ∅ → 10.0 (at least one correct tool used)
expected ∩ actual = ∅ → 0.0 (no expected tool was used)
No expected tools → 10.0 (any usage is valid)

Why overlap and not F1? In practice, smart models sometimes use additional tools (e.g., checking the mapping before searching). This is proactive and desirable behavior, not an error. The overlap approach rewards using at least one correct tool without penalizing extra tools.

🚫 Error Rate Score

Penalizes technical errors in tool execution (tool timeouts, API errors, etc.):

0 errors → 10.0
Each error subtracts 3 points: score = max(0, 10 - errors × 3)

✅ Exact Answer Check

For tests with an exact numeric answer (e.g., "How many orders are there?"), it's automatically verified whether the agent's response contains the correct value:

Exact match → ✓ match
Within 0.1% → ✓ match (numeric_close)
Within 5% → ~ approximate
Out of range → ✗ no match
For integers, comparison is strict (49 ≠ 50)

6. ★ Adjusted Overall — Weights, difficulty and failure penalty

The ★ Adjusted Overall score is the benchmark's primary metric. It's calculated in three phases: first the weighted average by metrics, then difficulty weighting, and finally failure penalty.

📊 Phase 1: Weighted metric average

Each test receives an overall_weighted based on these weights:

Metric	Weight	Source	What it measures
🎯 Correctness	25%	Claims (geom.)	Is the data factually correct?
⚓ Groundedness	20%	Claims (arith.)	Is it based on tool data?
🔧 Tool Calling	15%	Programmatic	Used the correct tools?
⏱️ Latency	10%	Programmatic	Responded quickly?
📝 Instruction Following	10%	Judge (simple)	Followed instructions?
🚫 Error Rate	10%	Programmatic	Had tool execution errors?
💰 Cost	5%	Programmatic	Is it economical?
📊 Relevance	5%	Claims (ratio)	Is it concise and relevant?

overall_weighted = Σ (weight_i × score_i)
Example: 0.25×8.5 + 0.20×9.0 + 0.15×10.0 + 0.10×4.2 + ... = 7.83

🎚️ Phase 2: Difficulty weighting

Harder tests weigh more in the model's global score. Each difficulty has a multiplier:

easy × 0.7 medium × 1.0 hard × 1.3 expert × 1.6

model_overall = Σ(score_i × diff_weight_i) / Σ(diff_weight_i)
→ Passing an "expert" test contributes 2.3× more than an "easy" test

⚠️ Phase 3: Failure Penalty

Models with timeouts or errors receive a penalty proportional to their failure rate:

failure_penalty = pass_rate ^ severity
severity = 1.2 (configurable)

★ Adjusted Overall = model_overall × failure_penalty

Examples:
100% pass → penalty = 1.00 (no penalty)
90% pass → penalty = 0.895 (penalty -10.5%)
80% pass → penalty = 0.782 (penalty -21.8%)
70% pass → penalty = 0.663 (penalty -33.7%)

Why? A model that fails 3 out of 30 tests doesn't just lose those 3 scores — it also loses credibility as a reliable agent. The failure penalty reflects this operational risk.

            Why does Correctness have the most weight (25%)? Because an agent that gives incorrect answers is useless, 
            no matter how fast or cheap it is. Groundedness (20%) is second because an agent that hallucinates data is dangerous. 
            Tool Calling (15%) is third because an agent that can't choose tools doesn't scale to real use cases.
        

7. Test categories (11 categories)

The 30 tests are grouped into 11 categories covering different agent capabilities:

🔍

Tool Usage

Basic tool usage tests: list indices, get mappings, execute filtered searches. Validates the agent knows how to interact with Elasticsearch.

4 tests · Easy–Hard

📊

Analytics (ES|QL)

Analytical queries: count documents, group by category, calculate averages with filters. The agent can use ES|QL internally via search.

3 tests · Easy–Hard

🧠

Reasoning

Multi-step reasoning: analyze a mapping then build queries, or analyze distributions and provide an interpretive summary.

2 tests · Medium–Hard

📝

Instruction Following

Strict instruction following: markdown table format, exactly 3 bullet points, specific format constraints.

2 tests · Medium

💬

Multi-turn

2–5 turn conversations where each question depends on the previous one. Evaluates context retention, error correction and progressive refinement.

3 tests · Medium–Hard

⚡

Edge Cases

Edge cases: non-existent products, vague questions without clear context, indices that don't exist. Does the agent handle errors and ambiguity correctly?

3 tests · Easy–Medium

🎯

Exact Answer

Tests with a verifiable exact numeric answer: filtered counts, amount sums, cardinality counts. Verified programmatically.

3 tests · Easy–Medium

🔀

Cross-Index

Tests requiring data correlation between benchmark-ecommerce and benchmark-customers. Requires multiple tool calls and cross-source reasoning.

2 tests · Hard–Expert

🛡️

Adversarial

Inputs designed to confuse the agent: non-existent fields, contradictory requests (status=cancelled AND delivered), impossible operations (SQL JOINs).

3 tests · Hard–Expert

🏆

Expert

Expert-level tests: complex derived calculations (top 3 by revenue with metrics), Q3 vs Q4 temporal analysis, and deep 5-turn multi-turn with final summary.

3 tests · Expert

📐

Format Strict

Strict response formats: "respond only with JSON, no markdown or explanation" or "respond only with a number, nothing else". Evaluates extreme format adherence.

2 tests · Medium–Hard

            🆕 Included in v1.0: The categories Exact Answer, Cross-Index, Adversarial, 
            Expert and Format Strict cover more demanding and realistic scenarios that better differentiate 
            between models of different capability. Additionally, expert level tests (temporal analysis, deep 5-turn multi-turn, 
            complex derived calculations) stress the most powerful models.
        

8. Difficulty levels (easy → expert)

easy Easy — 6 tests (weight ×0.7)

Single step, single tool. Example: "List available indices", "Count how many documents there are", "Search for a non-existent product".

medium Medium — 9 tests (weight ×1.0)

Requires choosing the correct tool, applying filters, or following format instructions. Example: "List categories in markdown table format", "Search for orders from customers in Madrid", "Correct the error from my previous query".

hard Hard — 10 tests (weight ×1.3)

Multi-step, multi-turn, complex analytical queries, or adversarial inputs. Example: "Based on the mapping, write a query to find the top 3 customers", "Search for orders that are cancelled AND delivered at the same time".

expert Expert — 5 tests (weight ×1.6)

Temporal analysis, complex derived calculations, cross-index, 5-turn multi-turn, and impossible operations. Example: "Compare Q3 vs Q4 revenue by category", "Find orders from Gold customers crossing two indices", "Analyze cancellation trends by quarter and payment method in a 5-turn conversation".

            💡 Impact of difficulty weighting: An expert test passed with an 8.0 contributes 
            8.0 × 1.6 = 12.8 to the numerator, while an easy test with the same score contributes 
            8.0 × 0.7 = 5.6. This clearly differentiates models that pass hard tests from those that only pass easy ones.
        

9. The 30 benchmark tests

The 30 tests run against two real indices with deterministic data:

benchmark-ecommerce — 1,000 e-commerce order documents (20 customers, 6 product categories, 6 statuses, prices, dates, payment methods)
benchmark-customers — 20 customer profiles with name, email, city, country, loyalty tier (Standard, Premium, Gold, Platinum) and registration date

The dataset is deterministic — generated by a script with a fixed seed so results are comparable between models and between runs. Each test has a detailed ground truth with verified concrete data (exact quantities, names, values), and some tests include a programmatically verifiable exact_answer.

            💡 In the report: Click on any test row to expand the complete detail: the original question, 
            all turns (in multi-turn), the ground truth, the agent's response, the claim-by-claim judge evaluation (with Correctness and Groundedness verdicts), 
            expected vs used tools, individual scores and the Exact Answer Check if applicable.
        

10. Tool Calling: what it is and why it matters

What tools does the agent have?

The Elastic Agent Builder agent (elastic-ai-agent) has access to these tools:

platform.core.search — Search documents in Elasticsearch. Can generate and execute ES|QL queries internally.
platform.core.execute_esql — Execute pre-prepared ES|QL queries (doesn't write them, only executes).
platform.core.generate_esql — Generate an ES|QL query from natural language.
platform.core.get_index_mapping — View the schema/mapping of an index.
platform.core.list_indices — List available indices.

Important note: In practice, platform.core.search can generate and execute ES|QL internally. This means that when a test asks to "use ES|QL", the agent can use platform.core.search and still execute ES|QL correctly. Therefore, the benchmark accepts both tools as valid for ES|QL tests.

How is the Tool Calling Score calculated?

An overlap approach (not F1) is used to be fair to smart models:

If the model used at least one of the expected tools → 10.0
If it didn't use any of the expected tools → 0.0
Extra (proactive) tools → Not penalized. A model that checks the mapping before searching is being smart.

In the report, if you see a 0.0 with a ≠ icon, it means the model used a different tool than expected. Click on the row to see the detail: expected vs used tools and the judge's evaluation.

Additional Tool metrics

Tool Exec Rate — % of tool calls that executed without technical error (API errors, tool timeouts).
Tokens/Tool Call — Average tokens per tool call (lower = more efficient). Includes the tool response payload.
Tool Retries — Number of times the agent retried a failed tool (resilience indicator).

11. Efficiency metrics

💲 Quality / Dollar

How much quality do you get per dollar spent?

Quality/$ = Σ(overall_weighted per test) / total_cost_USD
Higher = better quality-price ratio. A cheap but bad model can have a high ratio; an expensive but excellent model can beat it.

⚡ Quality / Second

How much quality do you get per second of waiting?

Quality/s = Σ(overall_weighted per test) / total_wall_time_seconds
Higher = better time performance. Timeouts (recorded as 120s) penalize this metric.

📊 Dual latency

The report shows two latency values to avoid distortions:

Avg Latency (OK) — Average of successful tests only (no timeouts). Reflects the model's real speed when it works.
Avg Latency (all) — Average including timeouts (recorded as 120s instead of 0s). Reflects real operational reliability.

Timed-out tests show ≥120s in the latency column and — in the cost column (since token generation didn't complete).

📈 Consistency (passed only)

Measures score stability across tests that completed successfully:

Std Dev σ = standard deviation of overall_weighted (successful tests only)
σ < 1.0 = very consistent | σ > 3.0 = very erratic

Consistency is shown only for successful tests so timeouts don't distort the metric. High consistency (low σ) indicates the model is predictable — you know what to expect.

🛡️ Reliability Score

Synthetic score combining pass rate, consistency and error absence:

reliability = (pass_rate × 0.5) + (consistency_score × 0.3) + (error_absence × 0.2)
A model that passes all tests, is consistent, and has no errors gets a reliability close to 10.

12. How to interpret the report

📊 Comparison table

The main table orders models by ★ Adjusted Overall. Click on a model name to jump to its detailed card. Columns show:

★ Adjusted — Final score adjusted by difficulty and failure penalty (0–10)
Correct — Claim-based Correctness (geometric mean)
Ground — Claim-based Groundedness (arithmetic mean)
Tool — Tool calling score (overlap)
Lat (OK/all) — Average latency (successful only / including timeouts)
Cost — Total cost of all tests
Pass — Tests completed without timeout or error

🤖 Model cards

Each model has a detailed card with:

Failure badges — ⏱ Timeouts (amber), 🔧 Tool mismatch (purple), ❌ Errors (red) — with specific counts
Main KPIs — ★ Adjusted Overall, Pass Rate %, Avg Latency (OK and all), Total Cost
Score bars — Visual breakdown of each metric: Correctness (geom.), Groundedness (arith.), Relevance, Tool Calling, Latency, Cost, Instruction Following, Error Rate
Efficiency — Quality/$, Quality/s, Tokens/tool, with formulas visible alongside
Consistency — Std Dev σ, Min, Max, Median (successful tests only). Shows if the model is predictable or erratic
Test table — Each test with all scores, difficulty, and status. Click to expand full detail

🔎 Test detail panel

Clicking any test row expands a panel with:

Original question — The question asked to the agent (all turns in multi-turn)
Ground Truth — The expected correct answer with concrete data
Agent response — What it answered (complete)
Judge evaluation — GPT-5.2's complete reasoning
Claims Analysis — Each individual claim with Correctness and Groundedness verdicts, centrality and explanation
Tools — Expected vs Used, with ≠ icon if they don't match
Exact Answer Check — If applicable, expected vs found value
🔗 Permalink — Direct link to each test for sharing

📈 Charts

🎯 Overall Score Comparison — Horizontal bars per model with ★ Adjusted Overall. Quick visual comparison.
🕸️ Multi-Dimensional Radar — 10 dimensions: Correctness, Groundedness, Tool Calling, Latency, Cost, Instruction Following, Error Rate, Relevance, Format, Reliability. Shows each model's complete profile.
📂 Score by Category — Which model is best at analytics? At cross-index? At adversarial?
🎚️ Score by Difficulty — How does performance scale from easy → medium → hard → expert?
⚡ Latency vs Quality — Scatter plot (successful tests only). Ideal: top-left (fast and good).

Have questions about any result?

Every score has complete context. Click on any test to see the question, turns, ground truth, agent's response, individual claims, judge's evaluation and tools used. You can share any individual test with its permalink.

View the results report →

📖 AgentBench for Elastic — Benchmark Methodology

📑 Table of Contents