LLM Evaluation in Production: Moving Beyond Vibe Checks to Measurable Quality Gates

What LLM Evaluation Actually Solves

LLM evaluation is the practice of systematically measuring whether an AI feature is doing what you want it to do. Without evaluation infrastructure, teams can only tell whether a change made the system better through anecdote — a few examples that look good or bad in manual review, or user complaints after a regression reaches production.

Anecdote-based quality management creates predictable problems: teams iterate on prompts for weeks without knowing whether they are improving or regressing on the full distribution of inputs. A prompt that improves the 5 examples the team checked may degrade performance on 200 edge cases they did not check.

At 50-200 person companies, the evaluation problem is real but the solution does not need to be the complex evaluation platforms built for large ML organizations. This post describes a four-tier evaluation framework calibrated to mid-market constraints.

The Four-Tier LLM Evaluation Framework

Tier 1: Unit Tests on Deterministic Outputs

Some LLM tasks have outputs that should be deterministic or near-deterministic. A classification system should always classify "Your order has shipped" as an order status message. An extraction system should always extract the contract start date from a document that clearly states it.

Write unit tests for these deterministic cases the same way you write unit tests for application code: a test input, an expected output, and a pass/fail assertion. Run them in CI on every prompt change.

The collection of deterministic test cases is your regression safety net. Start with 20-30 cases covering the most important behaviors and known failure modes. Add a new test case every time a user reports an incorrect output that should have been deterministic.

These tests run fast (1-2 seconds per case at API rates), are fully automated, and catch the class of regressions that are most embarrassing: cases where the model produces a clearly wrong output on inputs that are unambiguously correct.

Tier 2: Statistical Evaluation on a Held-Out Test Set

For tasks where outputs are not deterministic — summarization quality, response helpfulness, answer accuracy for open-ended questions — you need statistical evaluation on a representative test set.

Building the test set: Collect 100-200 examples of real inputs from production logs (with PII removed). For each input, define the expected output or a grading rubric. For RAG systems, the test set should include inputs whose answers are in the knowledge base and inputs whose answers are not (to test for hallucination on out-of-scope queries).

Metrics to track: - Answer correctness: Does the output contain the correct information? For factual tasks, this can be automated with string matching or embedding similarity to a reference answer. - Faithfulness (for RAG): Does the output only include claims supported by the retrieved context? LLM-as-judge evaluation (using a separate LLM call to evaluate the output against the retrieved context) works well for this. - Task completion: For multi-step tasks, did the model complete the specified task? Binary pass/fail per case, with overall pass rate as the metric.

Run statistical evaluation before every prompt deployment to production. A 5% regression in answer correctness is a meaningful quality signal that warrants investigation before release.

Tier 3: LLM-as-Judge Evaluation

For quality dimensions that are difficult to measure programmatically — writing quality, tone appropriateness, instruction-following on complex tasks — LLM-as-judge evaluation uses a separate LLM call to score the output.

Implementation: Write a structured evaluation prompt that asks the judge model to score the output on a 1-5 scale for specific dimensions. Include the original input, the system context, and the output to evaluate. Require the judge model to provide a brief justification alongside the score.

LLM-as-judge evaluation is more expensive than automated metrics (each judgment is an API call) but significantly faster than human evaluation. Use it for qualitative dimensions where automated metrics are insufficient, not as a replacement for unit tests and statistical evaluation.

Important limitation: LLM-as-judge evaluation is susceptible to length bias (longer outputs tend to score higher) and self-preference bias (a model tends to rate its own outputs favorably). Use a different model family as the judge from the model generating outputs, and include explicit instructions to avoid length bias in the judge prompt.

Tier 4: Human Evaluation on Edge Cases

The first three tiers catch most regressions automatically. Human evaluation is reserved for two scenarios: validating the initial test set creation (human judgment defines what "correct" means for your task), and investigating systematic failures discovered through automated evaluation.

Build a lightweight human evaluation interface — a spreadsheet or a simple web form — that presents an input, an output, and asks a reviewer to rate quality on a 5-point scale. Sample 50-100 outputs per week from production logs for human review. Track the human evaluation scores over time as a leading indicator of quality drift.

Connecting Evaluation to the Deployment Pipeline

Evaluation has no value if it does not gate deployment. Configure your CI/CD pipeline to run Tier 1 unit tests and Tier 2 statistical evaluation on every prompt change, and block deployment if: - Any Tier 1 unit test fails - Tier 2 answer correctness drops below your defined threshold (typically: no regression worse than 3% from the current production baseline)

Tier 3 evaluation runs on a schedule (daily or weekly) rather than blocking deployment — the cost of running LLM-as-judge on your full test set for every PR is too high to be practical.

Evaluating RAG Systems Specifically

RAG systems have two components that require separate evaluation: retrieval quality and generation quality.

Retrieval evaluation: Does the retrieval step surface relevant context? Measure precision (fraction of retrieved chunks that are relevant) and recall (fraction of relevant chunks that are retrieved) on your test set. Poor retrieval quality is the most common cause of RAG hallucination — the model correctly uses the context it receives, but the context is wrong because retrieval failed.

Tools like RAGAS and TruLens provide structured frameworks for RAG-specific evaluation including retrieval quality metrics. Both are open-source and integrable with standard Python ML tooling.

Generation evaluation: Given correct context, does the model generate a correct and faithful response? This is where LLM-as-judge faithfulness evaluation is most useful.

Evaluate retrieval and generation separately. A system with poor retrieval but good generation will look like a generation quality problem when diagnosed through end-to-end evaluation only. Separating the two makes root cause analysis significantly faster.

Frequently Asked Questions

How large does the test set need to be for meaningful statistical evaluation? 100-200 examples is sufficient for most mid-market LLM features to detect regressions of 5% or more with reasonable statistical confidence. Smaller test sets (50 examples) can detect larger regressions (10%+) but may miss subtle quality degradation. Start with 100 examples and expand as you collect more real-world inputs.

Is LLM-as-judge evaluation reliable enough to replace human evaluation? For catching significant quality regressions, yes. LLM-as-judge evaluation is reliable for detecting directional changes in quality — whether a prompt change made things better or worse. It is less reliable for absolute quality scores or fine-grained comparisons between similar quality levels. Use it as a fast signal that triggers human investigation rather than as a definitive quality arbiter.

What is RAGAS and should we use it? RAGAS is an open-source framework for RAG evaluation that provides structured metrics for context precision, context recall, faithfulness, and answer relevancy. It is a practical starting point for teams building their first RAG evaluation infrastructure. The main limitation: it requires LLM calls for several of its metrics, so evaluation at scale incurs meaningful API costs. Evaluate whether the metrics it provides are the right ones for your use case before adopting it wholesale.

How do we build the initial ground truth for our test set? For the first test set, use a combination of: engineer-constructed examples covering known important cases and edge cases, real inputs from early users or pilot testers with outputs reviewed by subject matter experts, and examples specifically constructed to test failure modes your team has hypothesized. Do not rely on LLM-generated ground truth — it introduces the same biases you are trying to evaluate against.

If you are building evaluation infrastructure for a production LLM system and want a review of your approach, we offer a free AI systems review for mid-market engineering teams.

LLM Evaluation in Production: Moving Beyond Vibe Checks to Measurable Quality Gates

What LLM Evaluation Actually Solves

The Four-Tier LLM Evaluation Framework

Tier 1: Unit Tests on Deterministic Outputs

Tier 2: Statistical Evaluation on a Held-Out Test Set

Tier 3: LLM-as-Judge Evaluation

Tier 4: Human Evaluation on Edge Cases

Connecting Evaluation to the Deployment Pipeline

Evaluating RAG Systems Specifically

Frequently Asked Questions

Mohak Deep Singh

Stay Updated

Related Articles

Multi-Region Deployment Strategies for Low-Latency Indian Applications

Ultimate Cloud FinOps Savings Guide for 2026

Ready to Transform Your Cloud Infrastructure?