Byviz Analytics Byviz Analytics

📖 AgentBench for Elastic — Benchmark Methodology

Agentic AI Evaluation Framework for Elastic Agent Builder 9.3 — Complete methodology: evaluation process, claim decomposition, metrics and scoring explained step by step

AgentBench for Elastic is an Agentic AI Evaluation Framework — a specialized AI agent evaluation framework designed specifically for Elastic Agent Builder 9.3. It's not a generic language model benchmark: here we evaluate how each LLM behaves as a real agent, with real tools, real data, and real production constraints.

We use Claim Decomposition — an advanced evaluation system that breaks down each response into atomic claims and evaluates them individually for Correctness (geometric mean) and Groundedness (arithmetic mean). We measure tool calling, multi-step reasoning, adversarial input resistance, multi-turn context retention, accuracy, reliability, latency, and cost across 30 real tests against deterministic data.

📑 Table of Contents

  1. What is AgentBench for Elastic?
  2. How it works — Evaluation flow
  3. The Judge: GPT-5.2 and Claim Decomposition
  4. Claim-based evaluation: Correctness, Groundedness and more
  5. Automatically calculated metrics
  6. ★ Adjusted Overall — Weights, difficulty and failure penalty
  7. Test categories (11 categories)
  8. Difficulty levels (easy → expert)
  9. The 30 benchmark tests
  10. Tool Calling: what it is and why it matters
  11. Efficiency metrics
  12. How to interpret the report

1. What is AgentBench for Elastic?

AgentBench for Elastic (v1.0) is an end-to-end evaluation framework that measures how different language models (LLM) perform as AI agents within Elastic Agent Builder 9.3. Unlike generic benchmarks (MMLU, HumanEval, etc.) that evaluate isolated model capabilities, AgentBench evaluates real agentic performance — the model integrated into a production stack, with real tools, data, and constraints. We measure:

  • 🔧 Using tools correctly — choosing the right tool (search, ES|QL, mapping...)
  • 🎯 Giving correct answers (Correctness) — verified against detailed ground truth through claim decomposition with geometric mean (a single critical error drives the score to 0)
  • Grounding answers in real data (Groundedness) — each claim is compared against actual tool output with arithmetic mean (fair proportion of grounded claims)
  • 📋 Following instructions — format, filters, specific constraints
  • 🧠 Multi-step reasoning — analyze, search, combine results across indices
  • 💬 Maintaining context — remembering previous turns in conversations of up to 5 turns
  • 🛡️ Resisting adversarial inputs — non-existent fields, contradictory requests, impossible operations
  • 🔀 Working across indices — correlating data from multiple sources (cross-index)
  • 📐 Strict format compliance — responding with only JSON, only a number, or exactly N bullet points
  • Being efficient — responding quickly and at low cost
⚠️ Important: The 30 tests (organized in 11 categories and 4 difficulty levels) run against two real indices: benchmark-ecommerce (1,000 e-commerce order documents) and benchmark-customers (20 customer profiles with tiers and segmentation). No simulated data or mocks — the agent interacts with real Elasticsearch through the Agent Builder API. The dataset is deterministic: always the same data so results are comparable between models and between runs.

2. How it works?

Each test follows this 6-step flow:

📝
Test
Predefined question
🤖
Agent Builder
Kibana API + LLM
🔧
Tools
Search, ES|QL, Mapping
💬
Response
From agent to user
🧑‍⚖️
Judge GPT-5.2
Claim Decomposition
📊
Scoring
Geometric + Arithmetic

Step by step:

  1. The question is sent to the agent through the Kibana API (POST /api/agent_builder/converse). In multi-turn tests, multiple turns are sent in the same conversation.
  2. The LLM decides what to do — reasons, chooses tools, executes queries against Elasticsearch. The agent has access to search, ES|QL, mappings and index listing.
  3. Everything is captured — the response, tools used (with their actual outputs), latency (TTLT), tokens (input/output), real cost from the provider via OpenRouter.
  4. The judge (GPT-5.2) decomposes the response into claims — each atomic claim is individually evaluated for Correctness (vs ground truth) and Groundedness (vs tool output), assigning verdicts and severity. The judge receives the actual tool output (truncated to 6,000 chars) to verify hallucinations.
  5. Metrics are calculated locallyCorrectness with geometric mean (a critical error drives the score to 0), Groundedness with arithmetic mean (fair proportion of grounded claims), plus programmatic metrics (latency, cost, overlap-based tool calling, error rate).
  6. Weights, difficulty and failure penalty are applied — the final score (★ Adjusted Overall) weighs by test difficulty (easy ×0.7, medium ×1.0, hard ×1.3, expert ×1.6) and applies a failure penalty that penalizes models with timeouts or errors. Everything is stored in Elasticsearch with a model field for historical tracking.

3. The Judge: GPT-5.2 and Claim Decomposition

🧑‍⚖️ Why an LLM as judge?

Evaluating AI responses isn't trivial — you can't compare strings, because an answer can be correct in many different ways. We use OpenAI GPT-5.2 as "LLM-as-Judge" because:

  • It can understand semantics — if the answer says the same thing with different words
  • It can evaluate nuances — partially correct, appropriate format but incomplete data
  • It can verify hallucinations — comparing the answer with real tool output
  • It can decompose into claims — extract each atomic claim and evaluate it individually

📦 What does the judge receive?

For each test, GPT-5.2 receives a structured prompt with:

  • The original question asked to the agent
  • Detailed ground truth — the expected correct answer with verified concrete data (exact quantities, names, values)
  • Exact answer (if applicable) — a precise numeric value the agent must provide
  • Actual tool output — the data Elasticsearch returned, truncated to 6,000 chars with truncation notice
  • The agent's response — what it finally told the user

🧩 Claim Decomposition — The key to evaluation

Instead of requesting generic global scores, the judge decomposes the response into atomic claims — each individual factual assertion is evaluated separately. This enables detecting:

  • Partial errors — a response can be 90% correct but have one critical data point wrong
  • Specific hallucinations — identifying exactly which claim has no support in the data
  • Peripheral vs central claims — an error in a minor detail weighs less than an error in the key data point

Each claim receives two independent evaluations: Correctness (compared against ground truth) and Groundedness (compared against actual tool output). This detects both factual errors and hallucinations.

🔍 Anti-hallucination: The judge compares the agent's response against actual tool output. If the output is truncated (explicitly indicated as ⚠️ TRUNCATED), the judge does not penalize data it can't verify — it only penalizes information that clearly contradicts what the tools returned. Additionally, if the agent presents information as general knowledge (without attributing it to tools), it's classified as DISCLOSED_UNGROUNDED and penalized less than a direct hallucination.

4. Claim-based evaluation: Correctness, Groundedness and more

The judge does not assign generic global scores. Instead, it decomposes the response into atomic claims (individual factual assertions) and evaluates each one across two independent dimensions. It also gives simple scores for format and instruction following.

🏷️ Centrality of each claim

Each claim is classified as:

  • Central — Essential to answering the user's question (counts, names, key data). An error here is severe.
  • Peripheral — Additional context, formatting notes, general advice. An error here weighs less.

🎯 Correctness — Is it factually correct? (vs Ground Truth)

Each claim is compared against verified ground truth to determine if it's correct:

VerdictMeaningScore (central)
FULLY_SUPPORTED Completely matches the ground truth 1.0
PARTIALLY_SUPPORTED Partially correct with minor inaccuracies 0.70
CONTRADICTED Directly contradicts the ground truth 0.0 (critical)
NOT_VERIFIABLE Cannot be verified from the available ground truth 0.85
Aggregation: Geometric mean of all claim scores
→ A single CONTRADICTED claim (critical, central) drives ALL Correctness to 0.0
→ This is intentional: a critical factual error makes the response useless

⚓ Groundedness — Is it based on real data? (vs Tool Output)

Each claim is compared against what the tools actually returned, to detect hallucinations:

VerdictMeaningScore (central)
GROUNDED Directly supported by the tool output 1.0
PARTIALLY_GROUNDED Partially supported by the tool output 0.70
DISCLOSED_UNGROUNDED Not in the output but agent explicitly presents it as general knowledge 0.60
UNGROUNDED No basis in tool output (potentially hallucinated) 0.0
Aggregation: Arithmetic mean of all claim scores
→ Unlike Correctness, geometric mean is NOT used here
→ Reason: "not verifiable from tool output" ≠ "incorrect". More detailed models generate more claims, and with geometric mean a single peripheral DISCLOSED_UNGROUNDED claim would drive the score to 0 — unfairly penalizing complete responses.
→ Arithmetic mean reflects the proportion of grounded claims, which is the fair metric.

⚠️ Severity — Only for CONTRADICTED or UNGROUNDED

When a claim is incorrect or ungrounded, a severity level modifies the impact:

  • critical — Wrong numbers, wrong names, wrong status. Could cause wrong decisions.
  • major — Significant factual error but less impactful.
  • minor — Small imprecision: rounding, slight paraphrase.

Example: a claim "There are 50 orders from Madrid" when ground truth says 49 → CONTRADICTED + severity "minor" (score 0.50 for central). But a claim "The top customer is John Smith" when it's "Hans Mueller" → CONTRADICTED + severity "critical" (score 0.0 for central).

📐 Format Score and 📝 Instruction Following Score

In addition to claims, the judge assigns two simple scores (0–10):

  • Format Score — Is the response well structured? Does it use markdown, tables, lists when appropriate?
  • Instruction Following Score — Did the agent respect specific constraints? (table format, number of bullets, filters, specific index…)

📊 Relevance — Proportion of central claims

Automatically calculated as the proportion of claims marked as "central" out of total claims. A high score indicates the response is concise and relevant — not much peripheral "filler".

Relevance = n_central_claims / n_total_claims
0.9 → 90% of the response is information relevant to the question
📌 Note on multi-turn: In multi-turn conversational tests, the judge evaluates the final response but has context from all previous turns. This allows evaluating whether the agent maintains coherence and remembers information from previous turns in conversations of up to 5 turns.

5. Automatically calculated metrics

In addition to the judge's claim-based evaluations, these metrics are calculated objectively and programmatically (no LLM):

⏱️ Latency Score

Based on total response time (Time To Last Token). Linear interpolation between thresholds:

< 5s → 10.0 (excellent)
5–15s → 10.0–7.0 (linear interpolation)
15–45s → 7.0–4.0 (linear interpolation)
> 45s → 4.0–1.0 (decreasing penalty)
Timeouts (≥120s) receive score ≈ 1.0 and latency is recorded as 120s (not 0s)

💰 Cost Score

Based on real cost in USD per query, obtained from the provider via OpenRouter (real provider price, not base price):

< $0.005 → 10.0 (excellent)
$0.005–$0.02 → 10.0–7.0 (interpolation)
$0.02–$0.08 → 7.0–4.0 (interpolation)
> $0.08 → 4.0–1.0 (expensive)

🔧 Tool Calling Score (Overlap)

Measures if the model used at least one of the expected tools. Based on set overlap:

expected ∩ actual ≠ ∅ → 10.0 (at least one correct tool used)
expected ∩ actual = ∅ → 0.0 (no expected tool was used)
No expected tools → 10.0 (any usage is valid)

Why overlap and not F1? In practice, smart models sometimes use additional tools (e.g., checking the mapping before searching). This is proactive and desirable behavior, not an error. The overlap approach rewards using at least one correct tool without penalizing extra tools.

🚫 Error Rate Score

Penalizes technical errors in tool execution (tool timeouts, API errors, etc.):

0 errors → 10.0
Each error subtracts 3 points: score = max(0, 10 - errors × 3)

✅ Exact Answer Check

For tests with an exact numeric answer (e.g., "How many orders are there?"), it's automatically verified whether the agent's response contains the correct value:

Exact match → ✓ match
Within 0.1% → ✓ match (numeric_close)
Within 5% → ~ approximate
Out of range → ✗ no match
For integers, comparison is strict (49 ≠ 50)

6. ★ Adjusted Overall — Weights, difficulty and failure penalty

The ★ Adjusted Overall score is the benchmark's primary metric. It's calculated in three phases: first the weighted average by metrics, then difficulty weighting, and finally failure penalty.

📊 Phase 1: Weighted metric average

Each test receives an overall_weighted based on these weights:

Metric Weight Source What it measures Distribution
🎯 Correctness 25% Claims (geom.) Is the data factually correct?
⚓ Groundedness 20% Claims (arith.) Is it based on tool data?
🔧 Tool Calling 15% Programmatic Used the correct tools?
⏱️ Latency 10% Programmatic Responded quickly?
📝 Instruction Following 10% Judge (simple) Followed instructions?
🚫 Error Rate 10% Programmatic Had tool execution errors?
💰 Cost 5% Programmatic Is it economical?
📊 Relevance 5% Claims (ratio) Is it concise and relevant?
overall_weighted = Σ (weight_i × score_i)
Example: 0.25×8.5 + 0.20×9.0 + 0.15×10.0 + 0.10×4.2 + ... = 7.83

🎚️ Phase 2: Difficulty weighting

Harder tests weigh more in the model's global score. Each difficulty has a multiplier:

easy × 0.7    medium × 1.0    hard × 1.3    expert × 1.6

model_overall = Σ(score_i × diff_weight_i) / Σ(diff_weight_i)
→ Passing an "expert" test contributes 2.3× more than an "easy" test

⚠️ Phase 3: Failure Penalty

Models with timeouts or errors receive a penalty proportional to their failure rate:

failure_penalty = pass_rate ^ severity
severity = 1.2 (configurable)

★ Adjusted Overall = model_overall × failure_penalty

Examples:
100% pass → penalty = 1.00 (no penalty)
90% pass → penalty = 0.895 (penalty -10.5%)
80% pass → penalty = 0.782 (penalty -21.8%)
70% pass → penalty = 0.663 (penalty -33.7%)

Why? A model that fails 3 out of 30 tests doesn't just lose those 3 scores — it also loses credibility as a reliable agent. The failure penalty reflects this operational risk.

Why does Correctness have the most weight (25%)? Because an agent that gives incorrect answers is useless, no matter how fast or cheap it is. Groundedness (20%) is second because an agent that hallucinates data is dangerous. Tool Calling (15%) is third because an agent that can't choose tools doesn't scale to real use cases.

7. Test categories (11 categories)

The 30 tests are grouped into 11 categories covering different agent capabilities:

🔍
Tool Usage

Basic tool usage tests: list indices, get mappings, execute filtered searches. Validates the agent knows how to interact with Elasticsearch.

4 tests · Easy–Hard
📊
Analytics (ES|QL)

Analytical queries: count documents, group by category, calculate averages with filters. The agent can use ES|QL internally via search.

3 tests · Easy–Hard
🧠
Reasoning

Multi-step reasoning: analyze a mapping then build queries, or analyze distributions and provide an interpretive summary.

2 tests · Medium–Hard
📝
Instruction Following

Strict instruction following: markdown table format, exactly 3 bullet points, specific format constraints.

2 tests · Medium
💬
Multi-turn

2–5 turn conversations where each question depends on the previous one. Evaluates context retention, error correction and progressive refinement.

3 tests · Medium–Hard
Edge Cases

Edge cases: non-existent products, vague questions without clear context, indices that don't exist. Does the agent handle errors and ambiguity correctly?

3 tests · Easy–Medium
🎯
Exact Answer

Tests with a verifiable exact numeric answer: filtered counts, amount sums, cardinality counts. Verified programmatically.

3 tests · Easy–Medium
🔀
Cross-Index

Tests requiring data correlation between benchmark-ecommerce and benchmark-customers. Requires multiple tool calls and cross-source reasoning.

2 tests · Hard–Expert
🛡️
Adversarial

Inputs designed to confuse the agent: non-existent fields, contradictory requests (status=cancelled AND delivered), impossible operations (SQL JOINs).

3 tests · Hard–Expert
🏆
Expert

Expert-level tests: complex derived calculations (top 3 by revenue with metrics), Q3 vs Q4 temporal analysis, and deep 5-turn multi-turn with final summary.

3 tests · Expert
📐
Format Strict

Strict response formats: "respond only with JSON, no markdown or explanation" or "respond only with a number, nothing else". Evaluates extreme format adherence.

2 tests · Medium–Hard
🆕 Included in v1.0: The categories Exact Answer, Cross-Index, Adversarial, Expert and Format Strict cover more demanding and realistic scenarios that better differentiate between models of different capability. Additionally, expert level tests (temporal analysis, deep 5-turn multi-turn, complex derived calculations) stress the most powerful models.

8. Difficulty levels (easy → expert)

easy  Easy — 6 tests  (weight ×0.7)

Single step, single tool. Example: "List available indices", "Count how many documents there are", "Search for a non-existent product".

medium  Medium — 9 tests  (weight ×1.0)

Requires choosing the correct tool, applying filters, or following format instructions. Example: "List categories in markdown table format", "Search for orders from customers in Madrid", "Correct the error from my previous query".

hard  Hard — 10 tests  (weight ×1.3)

Multi-step, multi-turn, complex analytical queries, or adversarial inputs. Example: "Based on the mapping, write a query to find the top 3 customers", "Search for orders that are cancelled AND delivered at the same time".

expert  Expert — 5 tests  (weight ×1.6)

Temporal analysis, complex derived calculations, cross-index, 5-turn multi-turn, and impossible operations. Example: "Compare Q3 vs Q4 revenue by category", "Find orders from Gold customers crossing two indices", "Analyze cancellation trends by quarter and payment method in a 5-turn conversation".

💡 Impact of difficulty weighting: An expert test passed with an 8.0 contributes 8.0 × 1.6 = 12.8 to the numerator, while an easy test with the same score contributes 8.0 × 0.7 = 5.6. This clearly differentiates models that pass hard tests from those that only pass easy ones.

9. The 30 benchmark tests

The 30 tests run against two real indices with deterministic data:

The dataset is deterministic — generated by a script with a fixed seed so results are comparable between models and between runs. Each test has a detailed ground truth with verified concrete data (exact quantities, names, values), and some tests include a programmatically verifiable exact_answer.

💡 In the report: Click on any test row to expand the complete detail: the original question, all turns (in multi-turn), the ground truth, the agent's response, the claim-by-claim judge evaluation (with Correctness and Groundedness verdicts), expected vs used tools, individual scores and the Exact Answer Check if applicable.

10. Tool Calling: what it is and why it matters

What tools does the agent have?

The Elastic Agent Builder agent (elastic-ai-agent) has access to these tools:

  • platform.core.search — Search documents in Elasticsearch. Can generate and execute ES|QL queries internally.
  • platform.core.execute_esql — Execute pre-prepared ES|QL queries (doesn't write them, only executes).
  • platform.core.generate_esql — Generate an ES|QL query from natural language.
  • platform.core.get_index_mapping — View the schema/mapping of an index.
  • platform.core.list_indices — List available indices.

Important note: In practice, platform.core.search can generate and execute ES|QL internally. This means that when a test asks to "use ES|QL", the agent can use platform.core.search and still execute ES|QL correctly. Therefore, the benchmark accepts both tools as valid for ES|QL tests.

How is the Tool Calling Score calculated?

An overlap approach (not F1) is used to be fair to smart models:

  • If the model used at least one of the expected tools10.0
  • If it didn't use any of the expected tools0.0
  • Extra (proactive) tools → Not penalized. A model that checks the mapping before searching is being smart.

In the report, if you see a 0.0 with a icon, it means the model used a different tool than expected. Click on the row to see the detail: expected vs used tools and the judge's evaluation.

Additional Tool metrics

  • Tool Exec Rate — % of tool calls that executed without technical error (API errors, tool timeouts).
  • Tokens/Tool Call — Average tokens per tool call (lower = more efficient). Includes the tool response payload.
  • Tool Retries — Number of times the agent retried a failed tool (resilience indicator).

11. Efficiency metrics

💲 Quality / Dollar

How much quality do you get per dollar spent?

Quality/$ = Σ(overall_weighted per test) / total_cost_USD
Higher = better quality-price ratio. A cheap but bad model can have a high ratio; an expensive but excellent model can beat it.

⚡ Quality / Second

How much quality do you get per second of waiting?

Quality/s = Σ(overall_weighted per test) / total_wall_time_seconds
Higher = better time performance. Timeouts (recorded as 120s) penalize this metric.

📊 Dual latency

The report shows two latency values to avoid distortions:

  • Avg Latency (OK) — Average of successful tests only (no timeouts). Reflects the model's real speed when it works.
  • Avg Latency (all) — Average including timeouts (recorded as 120s instead of 0s). Reflects real operational reliability.

Timed-out tests show ≥120s in the latency column and in the cost column (since token generation didn't complete).

📈 Consistency (passed only)

Measures score stability across tests that completed successfully:

Std Dev σ = standard deviation of overall_weighted (successful tests only)
σ < 1.0 = very consistent | σ > 3.0 = very erratic

Consistency is shown only for successful tests so timeouts don't distort the metric. High consistency (low σ) indicates the model is predictable — you know what to expect.

🛡️ Reliability Score

Synthetic score combining pass rate, consistency and error absence:

reliability = (pass_rate × 0.5) + (consistency_score × 0.3) + (error_absence × 0.2)
A model that passes all tests, is consistent, and has no errors gets a reliability close to 10.

12. How to interpret the report

📊 Comparison table

The main table orders models by ★ Adjusted Overall. Click on a model name to jump to its detailed card. Columns show:

  • ★ Adjusted — Final score adjusted by difficulty and failure penalty (0–10)
  • Correct — Claim-based Correctness (geometric mean)
  • Ground — Claim-based Groundedness (arithmetic mean)
  • Tool — Tool calling score (overlap)
  • Lat (OK/all) — Average latency (successful only / including timeouts)
  • Cost — Total cost of all tests
  • Pass — Tests completed without timeout or error

🤖 Model cards

Each model has a detailed card with:

  • Failure badges — ⏱ Timeouts (amber), 🔧 Tool mismatch (purple), ❌ Errors (red) — with specific counts
  • Main KPIs — ★ Adjusted Overall, Pass Rate %, Avg Latency (OK and all), Total Cost
  • Score bars — Visual breakdown of each metric: Correctness (geom.), Groundedness (arith.), Relevance, Tool Calling, Latency, Cost, Instruction Following, Error Rate
  • Efficiency — Quality/$, Quality/s, Tokens/tool, with formulas visible alongside
  • Consistency — Std Dev σ, Min, Max, Median (successful tests only). Shows if the model is predictable or erratic
  • Test table — Each test with all scores, difficulty, and status. Click to expand full detail

🔎 Test detail panel

Clicking any test row expands a panel with:

  • Original question — The question asked to the agent (all turns in multi-turn)
  • Ground Truth — The expected correct answer with concrete data
  • Agent response — What it answered (complete)
  • Judge evaluation — GPT-5.2's complete reasoning
  • Claims Analysis — Each individual claim with Correctness and Groundedness verdicts, centrality and explanation
  • Tools — Expected vs Used, with ≠ icon if they don't match
  • Exact Answer Check — If applicable, expected vs found value
  • 🔗 Permalink — Direct link to each test for sharing

📈 Charts

  • 🎯 Overall Score Comparison — Horizontal bars per model with ★ Adjusted Overall. Quick visual comparison.
  • 🕸️ Multi-Dimensional Radar — 10 dimensions: Correctness, Groundedness, Tool Calling, Latency, Cost, Instruction Following, Error Rate, Relevance, Format, Reliability. Shows each model's complete profile.
  • 📂 Score by Category — Which model is best at analytics? At cross-index? At adversarial?
  • 🎚️ Score by Difficulty — How does performance scale from easy → medium → hard → expert?
  • ⚡ Latency vs Quality — Scatter plot (successful tests only). Ideal: top-left (fast and good).

Have questions about any result?

Every score has complete context. Click on any test to see the question, turns, ground truth, agent's response, individual claims, judge's evaluation and tools used. You can share any individual test with its permalink.

View the results report →