Agentic AI Evaluation Framework for Elastic Agent Builder 9.3 — Complete methodology: evaluation process, claim decomposition, metrics and scoring explained step by step
AgentBench for Elastic is an Agentic AI Evaluation Framework — a specialized AI agent evaluation framework designed specifically for Elastic Agent Builder 9.3. It's not a generic language model benchmark: here we evaluate how each LLM behaves as a real agent, with real tools, real data, and real production constraints.
We use Claim Decomposition — an advanced evaluation system that breaks down each response into atomic claims and evaluates them individually for Correctness (geometric mean) and Groundedness (arithmetic mean). We measure tool calling, multi-step reasoning, adversarial input resistance, multi-turn context retention, accuracy, reliability, latency, and cost across 30 real tests against deterministic data.
AgentBench for Elastic (v1.0) is an end-to-end evaluation framework that measures how different language models (LLM) perform as AI agents within Elastic Agent Builder 9.3. Unlike generic benchmarks (MMLU, HumanEval, etc.) that evaluate isolated model capabilities, AgentBench evaluates real agentic performance — the model integrated into a production stack, with real tools, data, and constraints. We measure:
benchmark-ecommerce (1,000
e-commerce order documents) and benchmark-customers (20 customer profiles with tiers and segmentation).
No simulated data or mocks — the agent interacts with real Elasticsearch through the Agent Builder API.
The dataset is deterministic: always the same data so results are comparable between models and between runs.
Each test follows this 6-step flow:
POST /api/agent_builder/converse). In multi-turn tests, multiple turns are sent in the same conversation.model field for historical tracking.Evaluating AI responses isn't trivial — you can't compare strings, because an answer can be correct in many different ways. We use OpenAI GPT-5.2 as "LLM-as-Judge" because:
For each test, GPT-5.2 receives a structured prompt with:
Instead of requesting generic global scores, the judge decomposes the response into atomic claims — each individual factual assertion is evaluated separately. This enables detecting:
Each claim receives two independent evaluations: Correctness (compared against ground truth) and Groundedness (compared against actual tool output). This detects both factual errors and hallucinations.
⚠️ TRUNCATED), the judge does not penalize data it can't verify —
it only penalizes information that clearly contradicts what the tools returned. Additionally, if the agent presents
information as general knowledge (without attributing it to tools), it's classified as DISCLOSED_UNGROUNDED and penalized less than a direct hallucination.
The judge does not assign generic global scores. Instead, it decomposes the response into atomic claims (individual factual assertions) and evaluates each one across two independent dimensions. It also gives simple scores for format and instruction following.
Each claim is classified as:
Each claim is compared against verified ground truth to determine if it's correct:
| Verdict | Meaning | Score (central) |
|---|---|---|
| FULLY_SUPPORTED | Completely matches the ground truth | 1.0 |
| PARTIALLY_SUPPORTED | Partially correct with minor inaccuracies | 0.70 |
| CONTRADICTED | Directly contradicts the ground truth | 0.0 (critical) |
| NOT_VERIFIABLE | Cannot be verified from the available ground truth | 0.85 |
Each claim is compared against what the tools actually returned, to detect hallucinations:
| Verdict | Meaning | Score (central) |
|---|---|---|
| GROUNDED | Directly supported by the tool output | 1.0 |
| PARTIALLY_GROUNDED | Partially supported by the tool output | 0.70 |
| DISCLOSED_UNGROUNDED | Not in the output but agent explicitly presents it as general knowledge | 0.60 |
| UNGROUNDED | No basis in tool output (potentially hallucinated) | 0.0 |
When a claim is incorrect or ungrounded, a severity level modifies the impact:
Example: a claim "There are 50 orders from Madrid" when ground truth says 49 → CONTRADICTED + severity "minor" (score 0.50 for central). But a claim "The top customer is John Smith" when it's "Hans Mueller" → CONTRADICTED + severity "critical" (score 0.0 for central).
In addition to claims, the judge assigns two simple scores (0–10):
Automatically calculated as the proportion of claims marked as "central" out of total claims. A high score indicates the response is concise and relevant — not much peripheral "filler".
In addition to the judge's claim-based evaluations, these metrics are calculated objectively and programmatically (no LLM):
Based on total response time (Time To Last Token). Linear interpolation between thresholds:
Based on real cost in USD per query, obtained from the provider via OpenRouter (real provider price, not base price):
Measures if the model used at least one of the expected tools. Based on set overlap:
Why overlap and not F1? In practice, smart models sometimes use additional tools (e.g., checking the mapping before searching). This is proactive and desirable behavior, not an error. The overlap approach rewards using at least one correct tool without penalizing extra tools.
Penalizes technical errors in tool execution (tool timeouts, API errors, etc.):
For tests with an exact numeric answer (e.g., "How many orders are there?"), it's automatically verified whether the agent's response contains the correct value:
The ★ Adjusted Overall score is the benchmark's primary metric. It's calculated in three phases: first the weighted average by metrics, then difficulty weighting, and finally failure penalty.
Each test receives an overall_weighted based on these weights:
| Metric | Weight | Source | What it measures | Distribution |
|---|---|---|---|---|
| 🎯 Correctness | 25% | Claims (geom.) | Is the data factually correct? | |
| ⚓ Groundedness | 20% | Claims (arith.) | Is it based on tool data? | |
| 🔧 Tool Calling | 15% | Programmatic | Used the correct tools? | |
| ⏱️ Latency | 10% | Programmatic | Responded quickly? | |
| 📝 Instruction Following | 10% | Judge (simple) | Followed instructions? | |
| 🚫 Error Rate | 10% | Programmatic | Had tool execution errors? | |
| 💰 Cost | 5% | Programmatic | Is it economical? | |
| 📊 Relevance | 5% | Claims (ratio) | Is it concise and relevant? |
Harder tests weigh more in the model's global score. Each difficulty has a multiplier:
Models with timeouts or errors receive a penalty proportional to their failure rate:
Why? A model that fails 3 out of 30 tests doesn't just lose those 3 scores — it also loses credibility as a reliable agent. The failure penalty reflects this operational risk.
The 30 tests are grouped into 11 categories covering different agent capabilities:
Basic tool usage tests: list indices, get mappings, execute filtered searches. Validates the agent knows how to interact with Elasticsearch.
Analytical queries: count documents, group by category, calculate averages with filters. The agent can use ES|QL internally via search.
Multi-step reasoning: analyze a mapping then build queries, or analyze distributions and provide an interpretive summary.
Strict instruction following: markdown table format, exactly 3 bullet points, specific format constraints.
2–5 turn conversations where each question depends on the previous one. Evaluates context retention, error correction and progressive refinement.
Edge cases: non-existent products, vague questions without clear context, indices that don't exist. Does the agent handle errors and ambiguity correctly?
Tests with a verifiable exact numeric answer: filtered counts, amount sums, cardinality counts. Verified programmatically.
Tests requiring data correlation between benchmark-ecommerce and benchmark-customers. Requires multiple tool calls and cross-source reasoning.
Inputs designed to confuse the agent: non-existent fields, contradictory requests (status=cancelled AND delivered), impossible operations (SQL JOINs).
Expert-level tests: complex derived calculations (top 3 by revenue with metrics), Q3 vs Q4 temporal analysis, and deep 5-turn multi-turn with final summary.
Strict response formats: "respond only with JSON, no markdown or explanation" or "respond only with a number, nothing else". Evaluates extreme format adherence.
Single step, single tool. Example: "List available indices", "Count how many documents there are", "Search for a non-existent product".
Requires choosing the correct tool, applying filters, or following format instructions. Example: "List categories in markdown table format", "Search for orders from customers in Madrid", "Correct the error from my previous query".
Multi-step, multi-turn, complex analytical queries, or adversarial inputs. Example: "Based on the mapping, write a query to find the top 3 customers", "Search for orders that are cancelled AND delivered at the same time".
Temporal analysis, complex derived calculations, cross-index, 5-turn multi-turn, and impossible operations. Example: "Compare Q3 vs Q4 revenue by category", "Find orders from Gold customers crossing two indices", "Analyze cancellation trends by quarter and payment method in a 5-turn conversation".
8.0 × 1.6 = 12.8 to the numerator, while an easy test with the same score contributes
8.0 × 0.7 = 5.6. This clearly differentiates models that pass hard tests from those that only pass easy ones.
The 30 tests run against two real indices with deterministic data:
benchmark-ecommerce — 1,000 e-commerce order documents (20 customers, 6 product categories, 6 statuses, prices, dates, payment methods)benchmark-customers — 20 customer profiles with name, email, city, country, loyalty tier (Standard, Premium, Gold, Platinum) and registration dateThe dataset is deterministic — generated by a script with a fixed seed so results are comparable between models and between runs. Each test has a detailed ground truth with verified concrete data (exact quantities, names, values), and some tests include a programmatically verifiable exact_answer.
The Elastic Agent Builder agent (elastic-ai-agent) has access to these tools:
platform.core.search — Search documents in Elasticsearch. Can generate and execute ES|QL queries internally.platform.core.execute_esql — Execute pre-prepared ES|QL queries (doesn't write them, only executes).platform.core.generate_esql — Generate an ES|QL query from natural language.platform.core.get_index_mapping — View the schema/mapping of an index.platform.core.list_indices — List available indices.
Important note: In practice, platform.core.search can generate and execute ES|QL internally.
This means that when a test asks to "use ES|QL", the agent can use platform.core.search and still execute ES|QL correctly.
Therefore, the benchmark accepts both tools as valid for ES|QL tests.
An overlap approach (not F1) is used to be fair to smart models:
In the report, if you see a 0.0 with a ≠ icon, it means the model used a different tool than expected. Click on the row to see the detail: expected vs used tools and the judge's evaluation.
How much quality do you get per dollar spent?
How much quality do you get per second of waiting?
The report shows two latency values to avoid distortions:
Timed-out tests show ≥120s in the latency column and — in the cost column
(since token generation didn't complete).
Measures score stability across tests that completed successfully:
Consistency is shown only for successful tests so timeouts don't distort the metric. High consistency (low σ) indicates the model is predictable — you know what to expect.
Synthetic score combining pass rate, consistency and error absence:
The main table orders models by ★ Adjusted Overall. Click on a model name to jump to its detailed card. Columns show:
Each model has a detailed card with:
Clicking any test row expands a panel with:
Have questions about any result?
Every score has complete context. Click on any test to see the question, turns, ground truth, agent's response, individual claims, judge's evaluation and tools used. You can share any individual test with its permalink.
View the results report →