Building an LLM Evaluation Harness: The Metrics That Actually Predict User Satisfaction

An LLM evaluation harness is the system that runs structured tests against your model and surfaces a single question: is this version better than the one in production for the users who actually use it. Most mid-market teams ship LLM features without one. They rely on vibes — a few examples that look good in a Slack thread, a thumbs-up from product, ship it. Then production drift hits three weeks later and nobody can answer whether the model is worse, the inputs shifted, or users got pickier. This post is about the metrics that correlate with user satisfaction, the ones that do not, and how to build the harness that catches regressions before they reach users.

Why vibes-based QA fails the moment you have real traffic

We work with engineering teams shipping LLM features at 50-500 person companies. The pattern is consistent. The first version of an AI feature gets a careful manual review. Three engineers spend an afternoon prompting it with a curated list of inputs and the outputs look reasonable. The feature ships.

Six weeks later something feels off. A few users complain. Support escalates a handful of bad outputs. The team starts iterating on the prompt. Each prompt change improves the examples in the Slack thread and quietly degrades performance on the 80% of inputs nobody is looking at.

This is the failure mode an evaluation harness exists to prevent. The harness is the answer to three production questions:

Is the version we are about to ship better or worse than what is in production right now?
Has the live model started drifting — behaving worse on the same kinds of inputs it used to handle well?
When users report a bad output, is it a regression or a long-tail edge case that has always been there?

Without a harness, you cannot distinguish these. With one, every decision becomes data-driven and every regression has a fingerprint. We covered the case for systematic evaluation in LLM Evaluation in Production; this piece focuses specifically on the harness architecture and the metric choices that determine whether the harness is worth running.

The three layers of a working eval harness

A production eval harness has three layers and they answer different questions. Skipping any of them creates a blind spot.

Layer 1: offline benchmark on a held-out test set

This is what most teams build first. A curated dataset of inputs with either expected outputs or grading rubrics, run on demand or in CI before deployment. It answers: did this prompt change regress on the cases we already know matter.

The test set is the asset. Start with 100-200 examples sourced from production logs (PII stripped), weighted toward inputs that previously caused problems. Add a new case every time a user reports an issue. The test set is the institutional memory of every bug the team has ever fixed.

What this layer cannot tell you: whether your test set still represents the distribution of inputs you are actually seeing in production. Test sets get stale.

Layer 2: online proxy metrics

These are signals you can compute on every production response, in real time, without waiting for user feedback. They are weaker than direct user feedback but they are fast and they cover 100% of traffic.

The proxy metrics that have actually predicted user satisfaction in our engagements:

Response length distribution. A sudden shift in mean output tokens is almost always a signal of something. Either the prompt is producing terser/longer outputs than before, or the model is being routed differently, or input characteristics changed.
Refusal rate. What fraction of requests does the model decline or hedge on. Sudden movement here is usually a prompt change or a model update.
Format adherence rate. For structured outputs (JSON, classification labels), what percentage parse successfully on the first attempt. This catches schema regressions in seconds.
Latency p95 and p99. Slow responses correlate with user abandonment. If p95 doubled, quality is irrelevant — users left.
Citation grounding rate (for RAG). What fraction of claims in the output can be traced back to a retrieved chunk. A drop here is the leading indicator of hallucination.

These run in your application code or your observability layer. They are cheap and they catch the unglamorous failures.

Layer 3: user feedback

The ground truth. Thumbs up/down, regenerate clicks, copy-to-clipboard events, downstream conversion (did the user complete the task the AI was helping with). This is what you actually want to optimize for, but it is sparse, biased toward the loud minority, and lagging.

The harness's job is to use Layers 1 and 2 to predict Layer 3 changes before they manifest, and to use Layer 3 to validate that Layers 1 and 2 are still tracking reality.

The metrics that actually predict user satisfaction

This is where most teams burn months. The literature on text generation evaluation is decades old and most of it does not apply to production LLM features.

Metrics that look reasonable and rarely predict satisfaction

BLEU, ROUGE, METEOR. These come from machine translation and summarization research. They measure n-gram overlap between the model output and a reference. For open-ended generation — chatbots, agents, RAG — they are uncorrelated with user perception of quality. Two responses with identical BLEU scores can have wildly different usefulness. We have not seen a mid-market team where BLEU scores correlated with retention.

Embedding similarity to a gold answer. Better than BLEU because it captures semantic equivalence, but it still treats your reference answer as the only good answer. For tasks with multiple valid responses (most of them), embedding similarity rewards conservatism and penalizes diverse-but-correct outputs.

Perplexity. A measure of how surprised the model is by its own output. Useful for model training diagnostics, near-useless for production quality.

Metrics that do predict satisfaction

LLM-as-judge with a structured rubric. A separate LLM call that scores the output against criteria you define — helpfulness, factual accuracy, tone, instruction-following. The G-Eval paper (Liu et al., 2023, https://arxiv.org/abs/2303.16634) demonstrated that GPT-4-based judges with chain-of-thought rubrics correlated with human judgment significantly better than ROUGE or BLEU on summarization tasks. We have seen the same pattern hold across customer support, content generation, and code review use cases.

The structure of the rubric matters. Free-form "rate this 1 to 5" prompts produce noisy scores. Rubric-based prompts — "score 1 if the response is factually wrong, 3 if partially correct, 5 if fully grounded in the provided context" — produce scores stable enough to detect 5-10% quality shifts.

Pairwise preference judgments. Show the judge model two responses to the same input and ask which is better. This is the gold standard. Pairwise comparisons sidestep the calibration problem of absolute scoring — the judge does not need to know what "4 out of 5" means in your domain, only which of these two is better. Pairwise win rate against the production baseline is the single most useful metric we measure during prompt iteration.

Task completion rate. For agentic systems or multi-step workflows, did the system do the thing it was asked to do. Binary, expensive to instrument, deeply correlated with user satisfaction. If you can only measure one thing, measure this.

Time-to-first-acceptable-answer. Latency-weighted quality. Users do not just want correct answers; they want them fast. A response that takes 12 seconds and scores 5/5 is often worse than one that takes 2 seconds and scores 4/5.

LLM-as-judge model selection

The judge model choice is one of the more consequential and least discussed decisions. Mid-2026 reality:

Haiku 4.5 ($1/MTok input, $5/MTok output) is the default for routine pairwise comparisons and rubric scoring on straightforward tasks. Fast enough to run on full test sets without budget pain.
GPT-5-mini is the cost-comparable alternative if your stack is already OpenAI-leaning. Treat it as functionally equivalent to Haiku for most judging.
Sonnet 4.6 ($3/MTok input, $15/MTok output) for harder judgments — complex code review, multi-paragraph reasoning quality, anything where the judge needs to actually understand domain context. Save it for the test cases where Haiku gives noisy or inconsistent scores.

Two rules we apply across every engagement:

Cross-family judging. Do not use the same model family to generate and judge. If you are serving Claude in production, judge with GPT-5-mini or Gemini 2.5 Flash. The self-preference bias is well-documented and meaningful — a model rates its own outputs roughly 10-15% higher than human raters would.

Length-bias controls. LLM judges reward longer outputs, regardless of quality. If you are comparing a verbose v2 prompt against a concise v1 prompt, the judge will favor v2 even when users prefer v1. Either include explicit length-bias instructions in the judge prompt ("ignore output length when scoring") or normalize by length before comparing.

Building the harness: tooling considerations

The current landscape has four credible options. None are clearly best; the choice depends on your stack and your procurement constraints.

Langfuse (MIT core + Enterprise Edition for the ee/, web/src/ee/, worker/src/ee/ directories). v3 went GA in December 2024. Self-hostable, generous OSS feature set, evaluation integrated with tracing. Our default recommendation for teams with a Postgres-comfortable platform team. The Enterprise Edition gates SSO, RBAC, and some audit features.

Arize Phoenix (Elastic License 2.0). Source-available, not OSI-approved open source — the EL2.0 license restricts offering Phoenix as a hosted service to others. For most internal mid-market use this restriction is irrelevant, but procurement teams that filter on OSI-approved licenses will reject it. Strong on tracing and dataset management; the evaluation tooling has matured significantly through 2025-2026.

OpenAI Evals (MIT). Stripped-down framework, good if your stack is already OpenAI-centric and you want eval-as-code without a UI dependency. Lower operational overhead, fewer features.

DeepEval (Apache 2.0). pytest-style assertions for LLM outputs. Best fit for teams that want eval to live entirely in CI alongside unit tests. Less useful as an ongoing observability platform.

License precision matters here because the SME procurement teams in 50-500 person companies will ask. Langfuse Enterprise Edition is the most common point of confusion — the core is MIT, the ee/ directories are not. If you are deploying self-hosted Langfuse and disabling the EE features, the MIT license governs and you are fine. Once you enable RBAC or SSO, you need a license. Read the LICENSE file before assuming.

Helicone, which used to be on this list, was acquired by Mintlify on 2026-03-03. Roadmap uncertainty is real. If you are evaluating it now, factor in the acquisition.

Eval drift detection: when the harness lies

A test set drifts when production inputs diverge from the inputs the test set was built on. When this happens, your offline metrics keep saying everything is fine while users have a worse experience. This is the most common silent failure we see.

Three signals catch drift:

Test set coverage of production inputs. Periodically (we recommend monthly), embed a sample of production inputs and your test set inputs and measure the distance distribution. If the average distance is growing month over month, your test set is becoming less representative.

Disagreement between Layer 2 proxy metrics and Layer 1 offline scores. If your offline LLM-as-judge scores are stable but your production refusal rate is climbing, or format adherence is dropping, the test set is missing something real.

User feedback divergence from judge scores. Track thumbs-down rate against rolling LLM-as-judge scores on the same time-aligned window. If they decorrelate, either the judge is wrong or the test set no longer reflects what users care about.

The fix for drift is not to rebuild the test set from scratch. It is to add the cases that the harness is missing. When a user reports a bad output, the question is not just "fix this output" — it is "is this a case our test set should cover, and if so, add it."

The budget conversation: evals are not free

This is the conversation that surprises engineering leaders. A well-instrumented LLM eval harness routinely costs 10-30% of production inference budget. Sometimes more.

The math: a test set of 200 cases, evaluated with an LLM-as-judge using Haiku 4.5, run on every prompt change (maybe 30 times a month during active development) is roughly 6,000 judgments per month. Each judgment is ~2,000 input tokens (the case + output + rubric) and ~200 output tokens. At Haiku 4.5 pricing that is on the order of $20-50 per month per evaluated feature — trivial.

But that is the floor. Production teams add:

Pairwise comparisons that double the input cost per judgment
Sonnet 4.6 judging for hard cases (5x the per-judgment cost of Haiku)
Continuous online judging on a sample of production responses (this is what gets expensive fast — if you judge 10% of production traffic and your production volume is non-trivial, the eval bill scales linearly with usage)
Multi-judge agreement (running 3 judges and taking majority vote to reduce judge noise)

A team-of-6+ AI org running rigorous evals on 4-5 production features will spend $2,000-5,000/month on judging alone. Worth it. But you have to budget for it explicitly because it is going to show up on the same bill as your production inference and someone is going to ask.

Team-of-2 vs team-of-6+ ambition

The harness scope should match team size. Two patterns:

Team-of-2 (an engineer and a PM, maybe with a half-time data scientist). Build Layer 1 (offline test set) and Layer 3 (basic thumbs up/down). Skip Layer 2 except for the cheapest proxies (refusal rate, format adherence, latency). Use Haiku-only LLM-as-judge with a 3-criterion rubric. Run evals manually before deployment; do not invest in CI integration yet. Test set size: 50-100 cases. Total eval budget: under $200/month. Most of the value of evals is the discipline of having a test set at all; the sophistication can come later.

Team-of-6+ (dedicated AI platform team, a data scientist, multiple LLM features in production). All three layers, automated. CI-gated deployment for Layer 1 regressions. Online Layer 2 dashboards. Sampling-based Layer 3 collection with explicit annotation queues. Pairwise judging for prompt iteration, rubric scoring for ongoing quality monitoring, Sonnet 4.6 for hard cases. Drift detection running monthly. This is the harness configuration that pays for itself by catching regressions before users see them.

The mistake we see most often: a team of 2 trying to build the team-of-6 harness and shipping nothing for four months. The minimum viable harness ships in a week.

Where this framework breaks

A few honest limits:

Evals do not catch novel failure modes. A test set codifies what you know to test for. The most embarrassing failures are the ones you did not think to test. Pair the harness with adversarial testing and red-teaming for high-stakes use cases.

LLM-as-judge has a ceiling. For very high-quality outputs, judge models cannot reliably distinguish between "good" and "excellent." If your application's value depends on the difference between Sonnet-quality and Opus-quality outputs, you need human judges.

Online proxy metrics can be gamed. If you optimize for response length distribution stability, you can ship a regression that produces equally long but worse outputs. The proxies are signals, not targets.

Evaluating agentic systems is materially harder. Multi-step workflows where the failure mode is in the action sequence, not the text generation, do not fit cleanly into this framework. We covered agentic eval considerations in the broader LLM POC-to-production guide and the governance dimension in our LLM governance framework for mid-market companies.

FAQ

Do we need an eval harness if our LLM feature is internal-only?

Less critical but still useful. Internal users complain less loudly, which means regressions persist longer. The case for evals is weaker if the feature is genuinely low-stakes, but most internal LLM features end up being more business-critical than expected within 12 months. Build the test set now; it is cheap and it is the asset that pays off later.

Should we use the same LLM for production and for judging?

No. Cross-family judging is the discipline. If you serve Claude, judge with GPT-5-mini or Gemini 2.5 Flash. The self-preference bias is real and material — around 10-15% in our experience — and it will make your evals systematically too optimistic.

Is LLM-as-judge good enough to replace human evaluation?

For detecting regressions and supporting iteration, yes. For absolute quality calibration and high-stakes go/no-go decisions, no. The pattern that works: LLM-as-judge runs continuously and triggers human review when scores move significantly. Humans handle the cases the judge flags.

How long does it take to build a useful eval harness?

For a team-of-2 with a single LLM feature, a week to first usable version. A couple of months to get to something the team genuinely trusts. For a team-of-6+ ambition, plan for a quarter of focused platform work, then ongoing investment.

What is G-Eval and is it still relevant in 2026?

G-Eval (Liu et al., 2023, https://arxiv.org/abs/2303.16634) is the LLM-as-judge methodology paper that showed structured-rubric judges with chain-of-thought correlate with human judgment better than n-gram metrics. The specific GPT-4 prompts in the paper are dated, but the methodology — structured rubric, chain-of-thought reasoning before scoring, calibration against human labels — is the foundation of every credible LLM-as-judge implementation in production today. Read it.

How do we know our evals are themselves any good?

Calibrate against humans. Periodically (we recommend quarterly), have humans score a sample of 50-100 cases that your harness has also scored. Measure agreement. If your harness agrees with humans 80%+ of the time on directional changes, it is useful. If agreement is below 70%, the rubric needs work or the judge model is wrong for your domain.

If you are designing the eval harness for a production LLM feature and want a review of the architecture, the metric choices, and the budget plan, we offer that review through our AI enablement practice.