Building an Evaluation Harness for Production AI Agents: A 12-Metric Framework From 100+ Deployments

AI deployment, our consumer’s compliance officer requested us a query we couldn’t reply.

“How are you aware your agent isn’t hallucinating affected person signs?”

We had unit checks. We had integration checks. We had a mannequin that carried out superbly on the demo dataset. What we didn’t have was an analysis harness that might measure hallucination fee, context faithfulness, or tool-selection accuracy in manufacturing.

That hole practically killed the venture. Six weeks later, we had a 12-metric analysis framework working towards each agent response, each instrument name, each retrieval operation. The compliance crew signed off. The agent shipped.

Throughout the 100+ enterprise AI agent deployments we’ve shipped since then, that framework has developed into the playbook under. In the event you’re constructing manufacturing AI brokers, that is the analysis harness we want we’d had on day one.

The 12-Metric Framework at a Look

Class	Metric	What It Measures	Crucial Threshold
Retrieval	Context Relevance	Are retrieved chunks related to the question?	>0.85
Retrieval	Context Recall	Did we retrieve all related information obtainable?	>0.90
Retrieval	Context Precision	Are top-ranked chunks probably the most related?	>0.80
Retrieval	Retrieval Latency	How briskly did the retrieval full?	<200ms p95
Technology	Reply Faithfulness	Does the reply match retrieved context?	>0.95
Technology	Reply Relevance	Does the reply handle the person’s query?	>0.90
Technology	Hallucination Fee	How usually does the mannequin invent details?	<2%
Agent	Device Choice Accuracy	Did the agent decide the suitable instrument?	>0.92
Agent	Device Execution Success	Did instrument calls succeed?	>0.98
Agent	Multi-Step Coherence	Did the agent keep logical move?	>0.85
Manufacturing	Price per Question	Token + infra price per request	<$0.05 typical
Manufacturing	P99 Latency	Finish-to-end response time	<3s

Three classes cowl the agent’s inside operations (retrieval, era, and agent conduct). The fourth class measures what manufacturing cares about (price and latency). Skip any considered one of these classes at your personal threat.

Why Most Groups Skip Analysis (and Pay for It Later)

Throughout the tasks we’ve audited, three patterns clarify why groups ship AI brokers with out correct analysis infrastructure.

Sample 1: “We’ll add analysis after the MVP.”

That is the most typical and most costly sample. By the point the MVP ships, the crew has constructed a UI, an API, integrations, and onboarded clients. Now they’ve so as to add analysis infrastructure to a system that’s already in manufacturing, with customers sending unpredictable queries. The retrofit takes 4-6 weeks. The information assortment lag means they will’t catch a regression for days. By then, the belief injury is finished.

Sample 2: “Accuracy is sufficient.”

Accuracy on a held-out check set is important however not adequate. A RAG agent can have 95% accuracy on benchmark questions and nonetheless hallucinate 30% of the time on actual person queries that fall exterior the benchmark distribution. Manufacturing visitors is at all times totally different out of your eval set. With out faithfulness, hallucination fee, and tool-selection metrics, you’re flying blind.

Sample 3: “Handbook spot-checks are nice.”

Handbook evaluate works at 100 queries per day. It breaks at 10,000. The groups that attempt to scale handbook evaluate both burn out their engineers or settle for that they’re not truly reviewing the amount they declare to. Automated analysis isn’t optionally available when you cross just a few thousand queries per day.

The framework under addresses all three patterns. Construct it earlier than you ship, instrument each layer, and let the metrics let you know what your handbook critiques can’t.

For groups constructing AI agents for business automation, the analysis harness usually determines whether or not the venture ships to manufacturing in any respect.

The 12-Metric Framework

The framework teams 12 metrics into 4 classes. Every class solutions a distinct query about how your agent is performing.

Class 1: Retrieval Metrics (4)

In case your agent makes use of retrieval (RAG, data base lookup, doc search), retrieval high quality is the inspiration. Dangerous retrieval upstream means no quantity of intelligent prompting downstream can save the response.

1. Context Relevance

What it measures: What fraction of the retrieved chunks are literally related to the person’s question?

Why it issues: Most RAG failures we see in manufacturing hint again to retrieval moderately than era. The mannequin can solely work with what you feed it. In the event you retrieve 10 chunks and solely 3 are related, you’ve polluted the context and compelled the mannequin to filter sign from noise.

How we measure it: For every question, an LLM-as-judge evaluator scores every retrieved chunk on a 0-1 relevance scale relative to the question. We common throughout the top-k retrieved chunks.

Goal threshold: >0.85 common relevance throughout top-10 chunks. Beneath 0.7 signifies a retrieval downside price investigating earlier than chasing mannequin enhancements.

Manufacturing observe: Once we see context relevance drop under 0.75 in manufacturing, the trigger is nearly at all times considered one of three issues: index drift (new paperwork not chunked correctly), question intent shift (customers asking totally different questions than the eval set), or chunking technique mismatch (chunks too giant or too small for the question sort).

2. Context Recall

What it measures: Did we retrieve ALL the knowledge wanted to reply the question, or did we miss related chunks?

Why it issues: Recall is the silent killer of RAG methods. Low recall means the reply is incomplete or unsuitable, however the mannequin has no strategy to sign “I don’t have sufficient context.” It can confidently generate from partial info.

How we measure it: This requires a labeled eval set during which human evaluators have recognized all chunks containing info related to a benchmark question. We then compute the fraction of these “floor fact related” chunks that our retrieval truly returned.

Goal threshold: >0.90 recall on benchmark queries. Beneath 0.80 means you’re systematically lacking info, which results in confident-but-wrong solutions.

Manufacturing observe: Recall drops are normally a symptom of an embedding mannequin mismatch (your embedding mannequin isn’t capturing the semantics of your area) or of chunk-size points (info is break up throughout chunks in ways in which defeat similarity search). The repair is usually re-chunking, not re-modeling.

3. Context Precision

What it measures: Of the retrieved chunks, are probably the most related ones ranked on the prime?

Why it issues: Most manufacturing RAG methods go solely the highest 3-5 chunks to the LLM context window on account of token budgets. In case your top-1 chunk is irrelevant however the related one is at place 7, you’ve successfully retrieved nothing helpful.

How we measure it: We compute Imply Reciprocal Rank (MRR) — the common place of the primary related chunk within the ranked retrieval outcomes.

Goal threshold: MRR >0.80 — your first related chunk ought to be in place 1 or 2 more often than not.

Manufacturing observe: Precision improves dramatically if you add a reranker after the preliminary vector search. We’ve seen MRR bounce from 0.55 to 0.92 by including a BGE reranker on prime of pgvector retrieval. The latency price is ~50ms; the precision achieve is price it.

4. Retrieval Latency

What it measures: Time from question receipt to when retrieved chunks are prepared, measured at p95.

Why it issues: Finish-to-end agent response time is dominated by retrieval at scale. If retrieval takes 800ms, the person waits 800ms earlier than the LLM even begins pondering.

How we measure it: Normal utility efficiency monitoring on the retrieval service. We log the retrieval time for each question and report p50, p95, and p99.

Goal threshold: p95 retrieval latency <200ms. p99 <500ms.

Manufacturing observe: Latency spikes normally correlate with one of many following: index measurement development with out re-tuning HNSW parameters, community hops between the embedding service and the vector DB, or cold-start cache misses. Examine which of the three earlier than assuming you want a sooner vector DB.

Class 2: Technology Metrics (3)

As soon as the suitable context is retrieved, the standard of the era determines whether or not the person receives a helpful response. Three metrics matter right here.

5. Reply Faithfulness

What it measures: Does the generated reply precisely mirror the retrieved context, or does it contradict or fabricate info?

Why it issues: That is an important metric for any AI agent serving regulated industries. An untrue reply in healthcare, fintech, or authorized contexts is a compliance failure. Even exterior regulation, faithfulness immediately determines person belief.

How we measure it: For every generated reply, an LLM-as-judge evaluator extracts atomic claims from the reply, then checks every declare towards the retrieved context. The faithfulness rating is the fraction of claims supported by the context.

Goal threshold: >0.95 faithfulness in regulated industries. >0.90 on the whole use circumstances. Something under 0.85 wants quick investigation.

Manufacturing observe: Faithfulness drops normally point out considered one of three causes: temperature settings too excessive (flip it all the way down to 0.0-0.3 for manufacturing), context window overflow (your retrieved chunks plus immediate exceed context limits and the mannequin hallucinates from coaching information), or immediate template encouraging extrapolation (“Based mostly on the context, what do you consider…”).

6. Reply Relevance

What it measures: Does the generated reply truly handle what the person requested, or does it wander off-topic?

Why it issues: Relevance is distinct from faithfulness. A solution could be 100% trustworthy to context but not handle the person’s precise query. Each metrics have to be excessive concurrently for a very good response.

How we measure it: LLM-as-judge evaluator generates 3-5 questions that the reply can be a very good response to, then computes semantic similarity between these generated questions and the unique person question.

Goal threshold: >0.90 relevance. Beneath 0.80, the agent is answering adjoining questions, not the person’s query.

Manufacturing observe: Relevance points usually hint again to question rewriting steps in agentic flows. In case your agent rewrites “How do I cancel my subscription?” into “What’s the cancellation coverage?” after which solutions the rewritten question, the unique intent will get misplaced.

7. Hallucination Fee

What it measures: How usually does the mannequin generate details, names, numbers, or claims that haven’t any foundation within the retrieved context or in verifiable actuality?

Why it issues: Hallucination fee is the metric your CTO will ask about. Faithfulness measures constancy to context; hallucination fee measures fabrication past context. They overlap however aren’t equivalent — a mannequin could be trustworthy to unhealthy context, or untrue in benign methods.

How we measure it: We pattern 5% of manufacturing queries day by day and run them by way of a devoted hallucination detection pipeline that flags claims requiring fact-check, then human-reviews the flagged subset.

Goal threshold: <2% hallucination fee for manufacturing brokers. <0.5% for regulated business deployments.

Manufacturing observe: Hallucination spikes by question sort. Open-ended questions hallucinate greater than sure/no. Numeric questions hallucinate greater than categorical ones. Construct query-type classification into your eval pipeline so you’ll be able to goal investigation.

Class 3: Agent-Particular Metrics (3)

In case your AI system is an agent (multi-step, tool-using, goal-directed) moderately than a easy RAG pipeline, three further metrics matter.

8. Device Choice Accuracy

What it measures: When the agent has a alternative of instruments, does it decide the suitable one for the person’s intent?

Why it issues: Trendy brokers have entry to dozens of instruments — search, calculators, calendars, database queries, and API calls. Mistaken instrument choice cascades — the agent then tries to make a sq. peg match a spherical gap, producing incorrect outcomes downstream.

How we measure it: Construct a labeled eval set of (question, correct_tool) pairs. Run the agent towards the queries and compute the accuracy of instrument choice on the first determination level.

Goal threshold: >0.92 for binary instrument decisions. >0.85 for decisions amongst 5+ instruments.

Manufacturing observe: Device choice accuracy drops because the variety of obtainable instruments grows. We’ve seen 95% accuracy with 3 instruments collapse to 70% with 12 instruments. The repair is normally clearer instrument descriptions, fewer instruments per agent (decompose into specialised sub-agents), or fine-tuning on tool-use traces from manufacturing.

9. Device Execution Success

What it measures: Of the instrument calls the agent makes, what fraction execute efficiently (right arguments, legitimate responses, no errors)?

Why it issues: An agent can decide the suitable instrument and nonetheless name it incorrectly — unsuitable argument format, lacking required fields, malformed enter. Device execution success isolates this failure mode.

How we measure it: Monitor each instrument name in manufacturing with success/failure standing, error categorization, and retry makes an attempt. Compute success fee per instrument, per question sort, and per time window.

Goal threshold: >0.98 instrument execution success fee. Beneath 0.95 signifies systematic argument-construction issues.

Manufacturing observe: The commonest failure mode is the agent confidently developing arguments in a format that doesn’t match the instrument’s precise schema (e.g., passing a date string when the API expects ISO 8601). The repair is structured-output enforcement (operate calling, JSON Schema validation) on the instrument boundary.

10. Multi-Step Coherence

What it measures: When the agent executes a multi-step plan, does the logical move stay coherent throughout steps?

Why it issues: Single-step accuracy is important however not adequate for agentic conduct. An agent that picks the suitable instrument in step 1, will get a very good outcome, then forgets that outcome by step 4 has failed, although each particular person step succeeded.

How we measure it: Hint-level analysis. For every multi-step hint, an LLM-as-judge evaluator scores whether or not every step builds on prior steps coherently and whether or not the ultimate output displays the complete reasoning chain.

Goal threshold: >0.85 coherence on traces of 4+ steps. Beneath 0.75, your agent is basically doing a number of disconnected single-step queries.

Manufacturing observe: Coherence drops with hint size. We see 95%+ coherence on 2-step traces collapse to 60% on 6-step traces. The repair is both decomposition (splitting a 6-step job into 2 separate 3-step duties with an express handoff) or reminiscence structure (persistent state throughout steps moderately than re-prompting with the complete historical past every time).

Class 4: Manufacturing Metrics (2)

The primary ten metrics measure what the agent does. These two metrics measure what manufacturing cares about.

11. Price per Question

What it measures: Whole price (token price + infrastructure price + instrument name prices) per person question, averaged throughout manufacturing visitors.

Why it issues: AI brokers have a novel price profile — a single person question can set off 5-15 LLM calls (rewriting, retrieval grading, instrument choice, era, verification). Token sprawl turns a $0.02 question right into a $0.30 question with out anybody noticing till the month-to-month invoice arrives.

How we measure it: Instrument each LLM name with token utilization logging, each instrument name with related API prices, and each infrastructure dependency with prorated price. Mixture per question, then per question sort, then per time window.

Goal threshold: Varies by use case. Inside worker instruments: <$0.10/question is suitable. Buyer-facing merchandise: <$0.05/question for sustainable economics. In regulated industries, price issues lower than different metrics.

Manufacturing observe: Price spikes normally hint to considered one of: immediate size development (your system immediate grew over time), retry storms (failures triggering re-execution loops), or context window inflation (retrieved chunks getting longer as your data base grows). All three are simple to instrument and repair.

For groups seeing cost-per-query trending upward unsustainably, the build-vs-buy decision usually shifts towards customized infrastructure with capped prices moderately than per-token API pricing.

12. P99 Latency

What it measures: Finish-to-end time from person question to closing response, measured on the 99th percentile.

Why it issues: Common latency hides the failure modes that frustrate customers. A system with 1-second common latency however 15-second p99 has customers abandoning classes after 4-5 gradual responses. P99 is what customers bear in mind.

How we measure it: Normal utility efficiency monitoring. We log the end-to-end latency for each question and report p50, p95, p99, and max. We monitor these per question sort as a result of conversational queries ought to be a lot sooner than analytical queries.

Goal threshold: p99 <3 seconds for conversational brokers. p99 <10 seconds for analytical brokers that carry out multi-step reasoning. Past 10 seconds, customers disengage.

Manufacturing observe: P99 latency is nearly at all times dominated by considered one of three issues: retrieval (vector DB chilly cache), instrument calls (exterior API timeouts), or LLM era for lengthy outputs (hitting token-by-token streaming bottlenecks). Establish the dominant trigger earlier than optimizing the unsuitable layer.

A Determination Tree: Which Metrics to Prioritize First

Twelve metrics are so much to instrument concurrently. Right here’s how we sequence implementation throughout venture phases.

Section 1 (Pre-launch — Week 0-2): Implement retrieval metrics (context relevance, recall, precision) plus reply faithfulness. These 4 catch the most typical pre-launch failure modes.

Section 2 (Delicate launch — Week 3-6): Add hallucination fee, reply relevance, and gear choice accuracy. These catch points that solely emerge with actual person visitors.

Section 3 (Manufacturing secure — Week 7+): Add price per question, P99 latency, instrument execution success, multi-step coherence, and retrieval latency. These optimize the working system moderately than catch launch-blocking failures.

Use case modifiers:

Regulated business (healthcare, fintech, authorized): Prioritize faithfulness and hallucination fee above all the pieces else. Goal for >0.97 faithfulness and <0.5% hallucination fee from day one.
Excessive-volume client product: Prioritize price per question and P99 latency. Faithfulness issues however not on the expense of unit economics.
Inside worker instruments: Prioritize instrument execution success and multi-step coherence. Workers forgive gradual responses however not damaged workflows.

How This Framework Compares to Current Instruments

You don’t should construct all 12 metrics from scratch. A number of open-source and business instruments cowl subsets of this framework.

Ragas covers context relevance, recall, precision, faithfulness, and reply relevance nicely. It’s the strongest open-source place to begin for RAG-specific metrics. Doesn’t cowl agent-specific metrics or manufacturing well being.

TruLens covers comparable RAG metrics with stronger observability tooling. Higher integration with LangChain and LlamaIndex. Requires extra setup than Ragas.

DeepEval affords a broader metric library with good agent-specific assist (instrument choice, faithfulness). Newer than Ragas, smaller neighborhood.

LangSmith supplies manufacturing monitoring and analysis for LangChain-based brokers. Sturdy on traces and observability, weaker on offline benchmark analysis.

Why we constructed our personal framework on prime: Not one of the present instruments cowl all 12 metrics in a single place, and the agent-specific metrics (instrument choice accuracy, multi-step coherence) are significantly underserved. We use Ragas for RAG metrics, customized evaluators for agent metrics, and commonplace APM instruments (Datadog, OpenTelemetry) for manufacturing well being metrics. The framework above is the unified view throughout all three.

Implementation Actuality: What It Truly Prices to Construct This

Organising the complete 12-metric framework takes 2-3 weeks of centered engineering effort, assuming you could have an LLM-judge evaluator already configured.

Time breakdown:

Eval set development (labeled queries + floor fact): 4-6 days
Metric implementation (Ragas or customized): 3-5 days
CI/CD integration (run eval on each PR): 2-3 days
Manufacturing monitoring instrumentation: 3-5 days
Dashboards and alerting: 2-3 days

Tooling we use throughout deployments:

Eval orchestration: Ragas + customized evaluators in Python
LLM-as-judge: GPT-4 for high-stakes analysis, Claude Sonnet for cost-sensitive eval, Llama 3 70B for absolutely self-hosted compliance environments
Storage: PostgreSQL for eval outcomes, S3 for uncooked traces
Dashboards: Grafana for manufacturing metrics, Streamlit for offline eval stories
Alerting: PagerDuty integration for threshold breaches

Widespread pitfalls we’ve watched groups hit:

Utilizing the identical mannequin for era and judging. This produces inflated scores. Use a distinct mannequin household for the decide than for the generator.
Skipping the labeled eval set. With out floor fact labels, you’ll be able to’t compute recall or measure regression. The labeling price is actual however pays again inside the first month.
Operating eval solely on success circumstances. You want failure circumstances in your eval set, otherwise you’ll by no means catch regressions. Pattern manufacturing failures aggressively.
Treating eval scores as absolute. Monitor tendencies and deltas, not absolute scores. A 0.85 rating that drops to 0.78 over every week is extra significant than absolutely the quantity.

Regularly Requested Questions

What Is the Minimal Eval Setup for a New AI Agent Undertaking?

For a brand new venture, implement context relevance, reply faithfulness, and gear choice accuracy. These three catch 70% of pre-launch failures with minimal setup overhead. Skip the manufacturing metrics till you could have precise manufacturing visitors.

How Typically Ought to We Run the Full Eval Suite?

Run offline eval (towards labeled benchmark set) on each code change that impacts retrieval, prompts, or agent logic. Run on-line eval (sampled manufacturing visitors) repeatedly, with day by day rollup stories. Full benchmark re-runs are costly however ought to occur not less than weekly to catch regressions.

Ought to We Use LLM-as-Choose or Human Analysis?

Each are sequenced. Use LLM-as-judge for scale (consider 100% of manufacturing visitors at low price), use human analysis for calibration (consider a 1-2% pattern to confirm the LLM decide agrees with human consensus). When the LLM decide and human analysis diverge, retrain the decide immediate.

What Is the Distinction Between Offline and On-line Analysis?

Offline analysis runs towards a labeled benchmark dataset with recognized right solutions. On-line analysis runs towards actual manufacturing visitors, the place you don’t know the bottom fact reply upfront, so that you measure proxy indicators (faithfulness, relevance, hallucination) moderately than accuracy. Each are vital. Offline catches regressions earlier than they ship. On-line catches points that emerge from actual person conduct.

How Do We Deal with Analysis for Non-Deterministic Brokers?

Run every analysis question 3-5 instances and report the imply and variance of scores. Excessive variance signifies the agent’s conduct is unstable, which is itself a sign price investigating. For manufacturing visitors, pattern sufficiently to beat the noise from variance.

What Metrics Matter Most for RAG Versus Agentic Methods?

Pure RAG methods: prioritize the 4 retrieval metrics plus faithfulness. Agentic methods: add instrument choice accuracy, instrument execution success, and multi-step coherence on prime of the RAG metrics. The manufacturing metrics (price, latency) matter equally for each.

How Do We Measure Person Satisfaction in Eval?

Person satisfaction is downstream of the 12 metrics above. In case your faithfulness, relevance, and latency metrics are all in goal vary, satisfaction will monitor. Direct satisfaction indicators (thumbs-up/down, follow-up questions, session abandonment) are helpful as manufacturing well being indicators however lag behind the metrics that trigger them.

What Is the Eval Price — Is It Value It?

LLM-as-judge analysis prices roughly 30-50% of your inference price (each manufacturing question additionally will get evaluated by an LLM). For a $4K/month inference price range, count on $1,200-$2,000/month in eval prices. The ROI is stopping a single manufacturing incident that will price engineer-weeks to debug or trust-damage to get well from. After the primary prevented incident, eval pays for itself indefinitely.

Closing Thought

The groups delivery AI brokers efficiently in 2026 aren’t those with one of the best fashions. They’re those with one of the best analysis infrastructure. Fashions are commodities. Analysis is differentiation.

In the event you’re constructing manufacturing AI brokers and need a second opinion in your analysis framework grounded in 100+ deployments, the Intuz team is happy to help.

Assets

Pratik Okay Rupareliya is the Co-Founder and Head of Technique at Intuz, the place he leads enterprise AI technique throughout 100+ deployments spanning healthcare, fintech, manufacturing, and retail. Join with him on LinkedIn.

Source link

Building an Evaluation Harness for Production AI Agents: A 12-Metric Framework From 100+ Deployments

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

What Does “Following Best Practices” Mean in the Age of AI?

Pest control reduces cockroach allergens, endotoxins, & asthma

Today’s NYT Connections: Sports Edition Hints, Answers for Oct. 31 #403

Building an Evaluation Harness for Production AI Agents: A 12-Metric Framework From 100+ Deployments

The 12-Metric Framework at a Look

Why Most Groups Skip Analysis (and Pay for It Later)

The 12-Metric Framework

Class 1: Retrieval Metrics (4)

1. Context Relevance

2. Context Recall

3. Context Precision

4. Retrieval Latency

Class 2: Technology Metrics (3)

5. Reply Faithfulness

6. Reply Relevance

7. Hallucination Fee

Class 3: Agent-Particular Metrics (3)

8. Device Choice Accuracy

9. Device Execution Success

10. Multi-Step Coherence

Class 4: Manufacturing Metrics (2)

11. Price per Question

12. P99 Latency

A Determination Tree: Which Metrics to Prioritize First

How This Framework Compares to Current Instruments

Implementation Actuality: What It Truly Prices to Construct This

Regularly Requested Questions

What Is the Minimal Eval Setup for a New AI Agent Undertaking?

How Typically Ought to We Run the Full Eval Suite?

Ought to We Use LLM-as-Choose or Human Analysis?

What Is the Distinction Between Offline and On-line Analysis?

How Do We Deal with Analysis for Non-Deterministic Brokers?

What Metrics Matter Most for RAG Versus Agentic Methods?

How Do We Measure Person Satisfaction in Eval?

What Is the Eval Price — Is It Value It?

Closing Thought

Assets

Related Posts