An AI Observability Stack for Production: Tracing, Evals, and Drift Detection for Mid-Market Teams

An AI observability stack for production is the combination of request-level tracing, ongoing evaluation pipelines, and drift detection that lets an engineering team answer three questions at any time: did the application work for this specific request, is overall quality holding steady or regressing, and are the inputs and outputs the application sees today still the inputs and outputs it was tested against. The three layers map to three operational concerns — debugging, regression detection, and behavioral change — and to three incident severities. Mid-market teams running LLM features rarely need all three layers from day one, but they need the framework before they need the third layer.

The three-month problem

The pattern is consistent enough that we now warn clients about it during scoping: a team ships an LLM-backed feature, monitors it for the first few weeks with a mix of error dashboards and manual spot checks, and then — somewhere between month two and month four — realizes they have no idea whether quality is regressing. The application is up. Latency looks fine. Token costs are tracked. But the quality signal is anecdotal. A support engineer mentions that responses "feel different" this week. A product manager notices a few outputs that would not have shipped in week one. There is no system to confirm or refute the suspicion.

This is the observability gap. Standard application observability (uptime, latency, error rates) tells you whether the application is running. It does not tell you whether the application is producing useful outputs. For LLM features, those are different questions, and the second one is harder.

This post describes the three-layer stack we deploy across mid-market AI engagements. It is the operational complement to our LLM evaluation framework and the production-readiness piece of taking LLM applications from POC to production.

Why mid-market teams need a different shape of stack

The observability content from the LLM tooling vendors is written for two audiences: large ML organizations with dedicated platform teams, and AI-native startups whose entire product is the LLM. Mid-market teams sit in between and have a different constraint set.

We observe three properties that shape the stack:

The AI team is small. A mid-market AI feature is often owned by a team of two to four engineers, sometimes inside a broader platform group. Stack complexity costs them disproportionately because there is no platform team to absorb it.
The LLM feature is one of many. Unlike an AI-native startup where the LLM is the product, a mid-market LLM feature sits next to twenty other application features. Observability for the LLM piece has to integrate with the existing observability stack rather than replace it.
The budget is real but not unlimited. A $40,000-per-year managed observability spend is typically defensible if the AI feature drives meaningful revenue or cost reduction. A $200,000 spend is harder to justify at a 200-person company on a single LLM feature — though revenue-critical LLM features (a core search or recommendations product at a 200-person fin-services firm, for example) can defensibly exceed that ceiling. Treat the $40k-$200k band as a typical-case heuristic, not a hard limit.

These constraints push toward a stack that starts open-source-first, uses managed services only where the operational savings are clear, and scales by adding layers rather than replacing them.

The three observability layers

Layer 1: Request-level tracing (debugging)

Tracing is the layer that lets you answer "what happened on this specific request." For a single-call LLM feature, tracing captures the prompt, the model, the parameters, the response, the latency, and the token counts. For a multi-step or agentic system, tracing captures all of the above for every step, linked by a trace ID, so the full execution path is reconstructable.

Tracing is the foundation of the stack. Without it, debugging LLM behavior is guesswork; with it, the bug-to-fix loop is comparable to standard application debugging. It is also the first layer most teams already have in some form, even if only as application logs that include LLM inputs and outputs.

What good tracing captures:

The full prompt as it was sent (system prompt, user message, retrieved context, conversation history)
The model identifier and version, including provider routing if applicable
Generation parameters (temperature, max_tokens, top_p, structured output schema if used)
The full response, including any tool calls and tool results
Latency broken down by phase (retrieval, generation, post-processing)
Token counts (input, output, cached if the provider supports prompt caching)
A trace ID and parent-span ID that link multi-step workflows
The application context (user ID with appropriate anonymization, feature name, request ID from upstream)

Tooling landscape, vendor-neutral view:

Langfuse is the open-source default for mid-market teams building on Python or TypeScript. The core repository is MIT-licensed, with a separately licensed Enterprise Edition that covers code under the ee/, web/src/ee/, and worker/src/ee/ directories (SSO enforcement, certain RBAC features, and data retention controls live there); the MIT core is sufficient for most mid-market deployments, but the EE boundary is worth understanding before you self-host (source: github.com/langfuse/langfuse). Self-hostable on a single Postgres instance for early-stage use, with a managed offering when the operational overhead of self-hosting exceeds the cost difference. Trace visualization, prompt management, and evaluation integration in one product. The Langfuse v3 architecture (released December 2024) added ClickHouse for high-volume trace storage, which matters once you exceed ~50,000 traces per day; by mid-2026, ClickHouse-backed Langfuse v3 has been GA for over a year and is the default for self-host installs.

Arize Phoenix is the open-source-adjacent observability product from Arize. The repository is published under the Elastic License 2.0, which is source-available rather than OSI-approved open source — the license permits internal use and modification but restricts offering Phoenix as a hosted competing service. For mid-market teams self-hosting Phoenix for their own application, the license is not a practical constraint, but procurement and legal review processes that gate on OSI-approved licenses should know this in advance. Phoenix is strong on the evaluation side, integrates cleanly with notebooks for iterative development, and pairs with the Arize commercial platform when you need enterprise-grade drift monitoring at scale. We use Phoenix on engagements where the team is doing significant offline evaluation work and wants the same tool to cover production tracing.

LangSmith is the observability product from LangChain. Primarily managed, with a self-hosted Enterprise add-on that deploys on Kubernetes against AWS, GCP, or Azure for organizations with data residency or air-gap requirements (source: docs.langchain.com/langsmith/architectural-overview). Tightest integration with LangChain and LangGraph; less flexible for teams that built their orchestration directly on provider SDKs. Reasonable choice if you are already committed to LangChain and want one vendor for the orchestration framework and the observability layer.

OpenLLMetry from Traceloop is an instrumentation library that emits OpenTelemetry-compatible traces from LLM SDK calls. It is not a backend; it is the layer that sits between your application and any OpenTelemetry-compatible backend you already operate. We recommend it for teams that already have an OpenTelemetry pipeline (Tempo, Honeycomb, Datadog, New Relic, Grafana Cloud) and do not want to operate a parallel LLM-specific backend. The tradeoff is real: generic APM backends will not give you the LLM-specific UX that Langfuse or Phoenix ship out of the box (side-by-side prompt diffs, trace trees rendered for tool-calling agents, evaluation result overlays), so the OpenTelemetry path saves operational overhead at the cost of debugging ergonomics.

Helicone and the Mintlify acquisition. Helicone is a widely adopted open-source LLM observability product and is mentioned in much of the existing comparison content. Helicone was acquired by Mintlify in March 2026 (source: helicone.ai/blog). At time of writing, the open-source repository remains available and the proxy continues to function, but the strategic roadmap is held by Mintlify and the direction has shifted toward Mintlify's documentation-first product surface. For teams selecting a tool today, we recommend treating Helicone as a higher-risk choice for new production deployments until the post-acquisition roadmap is clearer. Existing Helicone deployments do not require immediate migration, but a migration evaluation is reasonable at the next planned tooling review.

The OpenTelemetry GenAI semantic conventions. The OpenTelemetry community has been standardizing semantic conventions for GenAI telemetry under the gen_ai.* attribute namespace — attributes like gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.response.id, and span events for each model interaction. These conventions are still in Development (experimental) status as of mid-2026; the OpenTelemetry specification explicitly notes that the GenAI conventions have not yet transitioned to Stable, and a stability transition is planned but not yet shipped (source: opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/). The practical implication for mid-market teams: instrumentation against gen_ai.* attributes is portable across backends today, but the attribute names and shapes may change before the conventions stabilize. If you instrument with OpenLLMetry or with a backend SDK that emits gen_ai.* attributes, plan for a re-instrumentation window when the stability transition lands. We recommend this path for teams that already run OpenTelemetry for application observability and can absorb a future migration; for teams that do not, the simpler Langfuse or Phoenix SDKs are a reasonable starting point.

Stack scope by team size.

For a team of 2 ML/AI engineers: pick one tool, deploy it on day one of production launch, do not overthink it. Langfuse self-hosted or Phoenix are both defensible defaults. The decision matters less than the discipline of using it.

For a team of 6+ AI/platform engineers: assume you will need OpenTelemetry conventions for portability, a dedicated evaluation pipeline that runs against the same data your tracing layer captures, and a drift detection layer that operates on tracing data. Pick the tracing tool with the evaluation and drift integrations you will need at month nine, not the one that is fastest to install at month one.

Layer 2: Ongoing evaluation pipelines (quality regression detection)

Tracing answers "what happened on this request." Evaluation answers "is overall quality holding steady or regressing." These are different questions and require different infrastructure.

The evaluation layer is a pipeline that runs your evaluation test set against the current production prompt and model configuration on a defined cadence, captures the results, and alerts when quality metrics drop below a defined threshold. It is the safety net that catches quality regressions before users do.

What the evaluation pipeline does:

Runs your held-out evaluation test set (typically 100-200 examples) against the current production configuration on a schedule — daily or weekly depending on change velocity
Records the per-case results and the aggregate quality scores (correctness, faithfulness, task completion, whatever metrics fit the feature)
Compares the aggregate scores against the rolling baseline (previous 7 or 30 days)
Alerts on regressions beyond a defined threshold (we typically use 3-5% as the action threshold)
Stores the per-case results so a regression can be diagnosed — which cases failed today that passed last week

The evaluation pipeline is also the gate for prompt and model changes. On a pull request that modifies a prompt or changes the active model, the evaluation pipeline runs against the proposed configuration and blocks the merge if quality regresses. The full mechanics of this evaluation work — test set construction, metric selection, LLM-as-judge versus deterministic metrics — are covered in our LLM evaluation framework piece; this post focuses on the operational integration.

Two evaluation cadences to keep distinct:

Pre-deployment evaluation. Runs on every change to prompts, models, or pipeline configuration. Blocks deployment on regression. Fast (2-10 minutes); subset of the full evaluation suite is acceptable.
Production evaluation. Runs on a schedule against the production-deployed configuration. Detects regressions caused by factors outside your control — model provider changes, vendor model updates, gradual changes in input distribution. Slower (10-60 minutes); runs the full evaluation suite.

The production evaluation cadence is the one most teams skip and the one that catches the failure mode the headline of this post describes. A model provider can change behavior under the same model identifier — OpenAI's gpt-4-turbo alias, for example, has historically pointed to different underlying snapshots (the alias rolled from gpt-4-0125-preview to gpt-4-turbo-2024-04-09 and subsequent updates), and similar aliasing behavior applies to most provider "latest" or family names. A prompt change can have effects that the pre-deployment evaluation suite missed; an input shift can degrade quality without changing the model or prompt. Production evaluation against a stable test set catches all three.

Implementation note. The evaluation pipeline does not need to be a new piece of infrastructure. A scheduled job in your existing CI system (GitHub Actions, GitLab CI, Jenkins, Buildkite) that runs your evaluation script and posts results to your tracing tool is sufficient for most mid-market deployments. Langfuse and Phoenix both have evaluation integration that captures evaluation runs alongside tracing data, which makes the per-case diagnosis significantly easier when a regression alert fires.

Layer 3: Drift detection on inputs and outputs (behavioral change)

The third layer is the one most mid-market teams add last and only after they have been bitten. Drift detection answers "are the inputs and outputs the application sees today the same as the inputs and outputs it was tested against."

Three drift types matter in production LLM systems:

Input drift. The distribution of user queries changes over time. A customer support feature that handled refund queries 60% of the time at launch is handling refund queries 30% of the time and shipping-delay queries 50% of the time six months later. The evaluation test set built at launch no longer represents the current input distribution. The application can pass the evaluation suite while degrading on the queries users actually send.

Output drift. The distribution of model outputs changes. Average output length increases or decreases significantly. The fraction of outputs that include certain phrases or refuse to respond changes. Detected via statistical monitoring of output characteristics over time; useful as a leading indicator of underlying model or prompt changes.

Semantic drift. The meaning of outputs changes in ways that statistical monitoring does not catch. A summarization feature begins emphasizing different aspects of source documents. A classification feature begins drawing the boundary between two categories slightly differently. Detected via embedding-based clustering of outputs or periodic human review of sampled outputs against historical samples.

What drift detection looks like in practice:

A pipeline that samples 1-5% of production requests and computes input embeddings, output embeddings, and basic statistical features (length, language, structured output validity)
A baseline distribution captured during a defined reference period (the first 30 days of stable production, typically)
Statistical tests (KS test, population stability index, or simpler quantile comparisons) that compare current distributions against baseline
Embedding-based clustering that surfaces clusters of inputs or outputs that did not appear in the baseline
Alerts when divergence exceeds a defined threshold

Arize, Phoenix, and Fiddler are the established tools for this; Langfuse has been expanding drift monitoring capabilities through 2025 and 2026 and now covers basic input and output distribution checks for teams that want a single tool for tracing, evaluation, and drift. The build-versus-buy decision depends on whether you have a data platform team that can operate the pipeline; if not, the managed options are cheaper than the engineering time to build it.

Honest tradeoff: drift detection is the layer with the lowest signal-to-noise ratio. Drift alerts fire frequently for benign reasons (a marketing campaign drives new query patterns, a seasonal product cycle changes input distribution). Tuning the thresholds to avoid alert fatigue is ongoing work. Teams that add drift detection too early often turn it off within a quarter; teams that add it after they have been burned by a silent quality degradation tend to keep it.

Mapping the layers to incident response

The three layers map to three incident severities, and the mapping clarifies what each layer is for.

P1: Application down. The LLM application is returning errors, hanging, or has an outage at the model provider level. Detection: standard application monitoring (error rates, latency, provider status). Response: incident process, provider failover if you have multi-provider abstraction, rollback if the trigger is a recent deployment. Tracing layer is the diagnostic tool: filter traces by error to find the failure pattern.

P2: Quality regression. The application is up and returning responses, but the responses are measurably worse than they were last week. Detection: the production evaluation pipeline alerts when the quality metric drops below the rolling baseline by more than the threshold. Response: identify the change — recent deployment, model provider update, prompt change — and roll back the change. The evaluation pipeline is the diagnostic tool: per-case results show which categories of input regressed.

P3: Data drift. The application is up, the evaluation suite is passing, but the input distribution or output distribution has shifted such that the evaluation suite is no longer representative. Detection: drift monitoring alerts on distribution changes. Response: refresh the evaluation test set to reflect the current input distribution; verify quality on the new distribution; update the rolling baseline. This is rarely a same-day incident; it is a slower process that runs over days or weeks.

The mapping matters because it tells you which layer needs to exist before you launch and which can wait. Layer 1 (tracing) is non-negotiable — without it, P1 incidents are intractable. Layer 2 (evaluation pipeline) should exist before the application is in front of users at any meaningful scale — without it, P2 regressions are invisible. Layer 3 (drift detection) is a six-month-plus addition for most mid-market deployments — the operational maturity to act on drift signals takes time to build.

Where the framework breaks

Three scenarios where the three-layer framework needs adjustment:

Conversational features with long sessions. A multi-turn conversation feature accumulates context across many turns. Per-request tracing captures each turn but loses the cross-turn picture. Add session-level aggregation to the tracing layer; treat the session, not the individual request, as the unit of analysis for evaluation and drift detection.

Agentic systems with high step counts. A multi-step agent that calls 10-20 tools per request generates an order of magnitude more trace volume than a single-call LLM feature. The cost of full trace retention can become significant. Consider sampling at the request level (capture full traces for 10-20% of production requests) rather than discarding intermediate steps; you need the full execution path to diagnose any individual failure, but you do not need to retain it for every request.

High-stakes decision features. Features where a single bad output has significant consequences (medical, legal, financial decisions) need a fourth layer: human review of a sampled fraction of outputs, with the sample weighted toward edge cases identified by drift detection. The evaluation pipeline catches the aggregate quality signal; human review catches the high-impact individual failures that aggregate metrics do not surface.

Implementation sequencing

For a mid-market team launching an LLM feature into production, the sequencing we recommend:

Pre-launch (week zero). Tracing layer is instrumented. The application emits trace data for every LLM call to the chosen backend. Tracing dashboards are reviewed and the on-call engineer knows how to find traces by user ID, by error, and by latency outlier.

Pre-launch (week zero). Evaluation test set exists. 50-100 representative cases with expected outputs or grading rubrics. Pre-deployment evaluation runs in CI on every prompt or model change.

Month one post-launch. Production evaluation cadence is established. The evaluation pipeline runs daily (or weekly for low-change-velocity features) against the production configuration. Alerts route to the on-call engineer. The rolling baseline is captured.

Month three post-launch. Trace volume and storage are reviewed. Sampling or retention policies are tuned to the volume. The evaluation test set is reviewed and expanded with cases from production logs that surfaced edge cases.

Month six to nine. Drift detection is added if the feature volume justifies it. Input embedding drift is the first signal to add; output drift and semantic drift can follow once input drift is operational and tuned.

This sequence keeps the operational complexity proportional to the maturity of the deployment. Adding all three layers at week zero is a common pattern in teams that over-tool, and it slows the launch without catching anything the pre-launch evaluation suite would not have caught.

Frequently asked questions

What is the minimum viable observability stack for an LLM feature launching to production?

Tracing (Langfuse self-hosted or Phoenix) plus an evaluation test set of 50-100 cases running pre-deployment in CI. That is the minimum that lets you debug individual failures and catch quality regressions on prompt changes. The production evaluation cadence and drift detection can follow in months one and six respectively.

Should we use OpenTelemetry-based instrumentation or vendor SDKs for LLM tracing?

If you already run OpenTelemetry for application observability, instrument LLM calls against the gen_ai.* semantic conventions and route to your existing OpenTelemetry backend — but plan for a re-instrumentation window when the conventions transition out of their current Development status. If you do not, the vendor SDKs (Langfuse, Phoenix, LangSmith) are faster to set up and provide LLM-specific visualizations out of the box. The portability argument for OpenTelemetry matters more as the deployment grows; for the first 12 months of a small mid-market deployment, the vendor SDK path is usually the right tradeoff.

Helicone was the tool we were planning to deploy. What changed after the Mintlify acquisition?

Helicone was acquired by Mintlify in March 2026 (source). The open-source proxy and the existing managed product continue to function, but the strategic direction has shifted toward integration with Mintlify's documentation product and the roadmap is held by Mintlify. For new deployments today, we recommend Langfuse or Phoenix as lower-risk choices. Existing Helicone deployments do not require immediate migration; build the migration into your next tooling review cycle so you have time to evaluate options.

How does drift detection differ from evaluation?

Evaluation measures whether the model produces correct outputs on a fixed test set. Drift detection measures whether the production input distribution and output distribution have changed in ways that may make the test set unrepresentative. Evaluation tells you whether quality is regressing on what you tested; drift detection tells you whether what you tested is still what you should be testing.

Can we use the same tool for tracing and evaluation?

Yes, and we recommend it. Langfuse, Phoenix, and LangSmith all support tracing and evaluation in the same product, which makes per-case diagnosis of evaluation regressions significantly faster. The trace for a failed evaluation case is one click from the evaluation result rather than a manual correlation across two tools.

At what point does a mid-market team need a dedicated AI platform engineer for observability?

When the LLM feature footprint reaches three or more production features, the operational load of maintaining tracing, evaluation, and drift detection across them justifies dedicated ownership. Below that, the AI feature team can own observability as part of their feature work. Above that, observability fragments without an owner.

If you are building the observability stack for a production LLM application and want a review of the design before you commit to tooling, we offer an AI enablement systems review for mid-market engineering teams. The review covers the three-layer stack, the integration with your existing application observability, and the staffing model for ongoing operation. Our governance counterpart — the LLM governance framework for mid-market companies — pairs with the observability stack to cover both behavior measurement and behavior accountability.

An AI Observability Stack for Production: Tracing, Evals, and Drift Detection for Mid-Market Teams

The three-month problem

Why mid-market teams need a different shape of stack

The three observability layers

Layer 1: Request-level tracing (debugging)

Layer 2: Ongoing evaluation pipelines (quality regression detection)

Layer 3: Drift detection on inputs and outputs (behavioral change)

Mapping the layers to incident response

Where the framework breaks

Implementation sequencing

Frequently asked questions

Mohak Deep Singh

Stay Updated

Related Articles

Multi-Region Deployment Strategies for Low-Latency Indian Applications

Ultimate Cloud FinOps Savings Guide for 2026

Ready to Transform Your Cloud Infrastructure?