The AI pilot graveyard: why most mid-market POCs never reach production -- and what stops the pattern

The AI pilot failure rate at mid-market is not a strategy problem. It is a project-selection problem with predictable failure modes. The most-cited recent data point comes from MIT's NANDA initiative, whose mid-2025 "State of AI in Business" report described a "GenAI Divide" in which roughly 95% of enterprise GenAI pilots show no measurable P&L impact — a pilot-to-revenue gap, not necessarily outright technical failure (NANDA report summarized in Fortune: https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/). The distinction matters: NANDA is counting pilots that fail to produce rapid revenue acceleration, not pilots whose models broke. Industry estimates we have seen for the broader "pilots that never make it to durable production" number range from 70 to 95% depending on the analyst, the cohort, and the success definition. The operator-level observation is the same regardless of which number you trust: a small share of POCs ship, the rest sit in internal wiki pages and demo URLs nobody visits. We have seen seven specific patterns kill pilots before they reach production at 50-500 person companies. Each has an early diagnostic. This piece names them and ends with the project-selection rubric we use to bias toward pilots that ship.

What "pilot failure" actually means at mid-market

The vendor pitch decks make AI pilot failure sound dramatic — bad models, hallucinations, unsafe outputs. Almost none of the failures we see look like that.

The pilots that die in mid-market AI deployments mostly die quietly. The demo works. A few internal users try it. Sponsorship shifts to the next priority. The engineer who built it gets pulled to a customer escalation. Six months later, someone asks "whatever happened to that thing?" and nobody has a good answer. The infrastructure runs idle. The cost shows up on the AWS bill as a line item nobody owns.

This is not a model problem. It is the gap between "we built something interesting" and "we built something that crossed the production-readiness threshold the company actually enforces." We covered the technical version of this gap in our piece on LLM applications from POC to production. This piece is the organizational layer underneath it — the seven reasons mid-market POCs never reach a place where the technical work would even matter.

The seven failure modes

We see these across engagements at 50-500 person companies. They are not exhaustive and they overlap. But if you can rule out the seven, the pilot's odds of shipping improve materially.

Failure mode 1: The pilot has no production owner

The most common failure. An innovation team, a curious engineer, or a product manager builds a pilot. It works. There is genuine excitement. Then comes the question: who runs this in production?

Nobody had to answer that question to start the pilot. It now has to be answered to ship it. And the honest answer is usually that no team has capacity, no team has the right skill mix, and the team most naturally suited (the platform engineering team, usually) was not consulted before the build began.

The pilot does not die from a decision to kill it. It dies from a decision nobody made.

Diagnostic that catches it early: Before the pilot starts, name the production owner in writing and confirm they have capacity in the relevant quarter — not in principle, in their planning artifact. If the answer is "we will figure that out later," the pilot is already in trouble. If the answer is "the team that built it," ask whether that team's roadmap has room for ongoing operation. The diagnostic question we ask in the kickoff: "If this pilot succeeds, what does the on-call rotation for it look like in month four?" Teams that cannot answer should not start.

Failure mode 2: The pilot was picked for technical interestingness, not business value

This one is uncomfortable to name because most of the people building AI pilots are smart, curious engineers who got into this work because the technology is interesting. The pull toward the technically interesting POC is real.

The pattern: someone reads about agents, multi-agent orchestration, fine-tuning, or a new RAG technique. They want to try it. A use case gets reverse-engineered onto the technique. The pilot demonstrates the technique successfully. The business impact is murky, because business impact was not the starting point.

We see this most often in engineering-led organizations where the AI initiative was sponsored by a CTO or VP Engineering. The pilots that get picked optimize for what engineering finds interesting, not what the business needs.

Diagnostic that catches it early: Ask the sponsor to articulate the business outcome in one sentence, without mentioning the AI technology. "We will reduce ticket-handle-time by 15% in tier-1 support" is a business outcome. "We will demonstrate agent-based orchestration" is a technical outcome dressed as one. Pilots in the second category should be re-scoped or moved to an R&D budget where the success criteria are honest about what they are.

Failure mode 3: The pilot hits production-readiness walls it was never designed to clear

This is the largest category by volume. The pilot works in development. Then someone asks the questions a production deployment must answer:

How does authentication work for users of this thing?
Where does the data the AI accesses come from, and what are the access controls?
What is the observability story when an answer is wrong?
What happens when the model provider has an outage?
How do we evaluate that the AI is doing what it should over time?
Who reviews the prompts before they change, and how?

The pilot was never designed to clear these walls because the team building it did not know they existed, or knew but deferred them. Now the walls block the path to production, and re-designing the pilot to clear them is a six-month engineering project that nobody has budget for.

We saw this at a Series C logistics-SaaS company whose document-extraction pilot worked beautifully on a developer laptop and then stalled for four months because nobody had scoped the SSO integration, the audit log, or the per-tenant data isolation that their enterprise contract required.

Diagnostic that catches it early: Run a production-readiness checklist against the pilot scope before week one of building. The checklist does not have to be perfect; ours is roughly the production architecture patterns from the POC-to-production piece plus three governance items. Pilots whose authors look at the checklist and say "we will handle that later" are pilots that will hit the wall.

Failure mode 4: The pilot has no evaluation framework

Most pilots are evaluated by demo. Someone shows the application to a stakeholder, the stakeholder asks two or three questions, the answers are reasonable, the demo gets a thumbs-up.

This evaluation method is sufficient to start the pilot. It is not sufficient to ship it. Production deployment requires answers to questions the demo cannot address: what is the accuracy on a representative sample of real inputs? How does it perform on the edge cases? When the prompt changes, is quality better or worse than before? When the model provider releases a new version, will we know if quality degrades?

Pilots without an evaluation framework have no defense against the question "how do you know it works?" The honest answer ("we asked it some things and the answers seemed okay") is correct but not deployable.

We saw this at a 180-person fintech whose customer-support copilot reached internal launch with zero offline evaluation; the first week of real tickets surfaced an accuracy regression nobody had a way to quantify or correlate to a prompt change, and the rollout paused for six weeks while the team built the eval harness that should have existed at week three.

Diagnostic that catches it early: Require a golden dataset and an evaluation script before week three of the pilot. Not a perfect one — 30 to 50 representative input-output pairs is enough to start. If the team cannot produce that, they do not understand the use case well enough to ship. The conversation about evaluation surfaces problems with the use case definition before they become problems with the pilot.

Failure mode 5: The change-management work was never scoped

The pilot is technical. The change management is not. And the team that built the pilot is typically not the team that has to convince users to adopt it.

We have seen this pattern repeatedly: the AI works, the production deployment is technically complete, the rollout happens — and three weeks later, usage data shows that 4% of the intended users are engaging with it. The rest continue working the way they did before, because the new tool requires a change in habit that nobody helped them make.

Sales teams, support agents, ops staff — these groups do not adopt new tools because the tools exist. They adopt them because someone in their management chain made adoption part of the job, because the workflow visibly improved their day, or because peers showed them it was worth using. None of that is technical work. All of it has to happen for the pilot to deliver value.

Diagnostic that catches it early: In the pilot kickoff, ask the sponsor: "Who in the user community owns adoption?" If the answer is "the product manager" or "the engineering lead," the pilot has no real adoption owner. The right answer names someone in the user organization — a sales operations lead, a support team lead, a customer success manager — who has agreed to drive adoption in their team and who has a specific usage or outcome target written into their own goals. No such person, no adoption.

Failure mode 6: The pilot's data dependencies were never honestly assessed

The pilot works in development because the development data was hand-curated. Maybe 200 documents. Maybe one customer's data. Maybe a snapshot from January that the team cleaned up.

Production deployment touches the real data. The real data is messier, has access controls the development data did not, includes PII the development data did not, and updates continuously in ways the development data did not. The retrieval that worked beautifully on 200 documents returns noise on 200,000. The prompt that produced clean outputs on tidy data produces garbage on the actual messy ticketing system.

The pilot was building toward a data world that does not exist in production. The cost to bridge that gap was never estimated because the team did not realize the gap was there.

We saw this at a Series C health-tech whose clinical-summarization pilot trained on a sanitized 500-document sample; the first contact with the live records lake exposed nested PHI fields, broken access scopes, and ten-fold noisier inputs, and the production retrieval quality collapsed until the data engineering work was redone.

Diagnostic that catches it early: Within the first two weeks, point the pilot at a sample of real production data — access-controlled, messy, current. Whatever survives that contact is the actual baseline. Whatever does not survive surfaces the real engineering work that has to happen before production. The pilots that defer this contact are pilots that hit it as a surprise in month four.

Failure mode 7: The pilot's success criteria were never agreed in writing

This is the failure mode that compounds all the others. The pilot starts with vague enthusiasm ("we want to use AI for support"). The kickoff meeting produces broad alignment but no documented success criteria. The team builds. Three months later, the sponsor wants to know whether the pilot succeeded.

The sponsor's criteria turn out to be different from the team's. The sponsor expected a 30% reduction in handle time; the team understood the goal as "demonstrate that AI can help with support." Both are reasonable interpretations of what was said. Only one is achievable in a 12-week pilot. The conversation about whether the pilot succeeded becomes a conversation about what was agreed, and the absence of writing means the conversation has no resolution.

We saw this at a 220-person B2B SaaS where the CRO sponsor expected a measurable lift in lead-qualification throughput, the engineering team had been building toward "an internal demo of agentic outreach," and the post-pilot review burned a full quarter of trust on a misalignment that a one-page signed brief would have prevented.

Pilots that survive this conversation — with both sides reluctantly satisfied — are pilots that do not get promoted to a Stage 2. The sponsor remains uncertain whether to invest more. The team feels misunderstood. The next pilot is harder to fund.

Diagnostic that catches it early: Before any code is written, produce a one-page document with: the business outcome, the measurable success criterion, the timeline, the production owner, and the named adoption owner. Get the sponsor to sign it. If the sponsor will not sign it, the pilot does not have real sponsorship and should be re-scoped or deferred.

The pattern under the patterns

Looking at the seven failure modes together, a thread runs through them. The failures are not technical. The failures are organizational decisions deferred — ownership, scope, success criteria, change management, data access, evaluation. Each deferral feels reasonable in the moment ("we will figure that out once we know it works") and each deferral is the seed of a pilot that does not ship.

The successful pilots we see at mid-market are not the ones with the best models or the most sophisticated architecture. They are the ones where the organizational decisions were made up front, in writing, by people with the authority to commit. The technical work is comparatively easy. The organizational work is what separates the POCs that ship from the ones that quietly stop.

This is consistent with the broader pattern we cover in our AI enablement roadmap for mid-market companies — the framework's pilot stage is structured the way it is precisely because of these failure modes. The five-stage roadmap is, in part, a defense mechanism against the patterns above.

The project-selection rubric

The rubric we use to evaluate candidate AI pilots before greenlight. Each criterion scored 0-3. Pilots scoring below 14 of 21 are restructured or deferred.

Criterion	0 (high risk)	1	2	3 (low risk)
Production ownership (capacity-signed)	No team identified	Team identified, capacity not assessed	Team identified, capacity discussed informally	Team identified and their planning artifact confirms capacity for the relevant quarter
Business outcome clarity	Can only be stated with AI terminology	Stated in business terms but unmeasured	Measurable but no baseline	Measurable, baseline established, target agreed
Production-readiness scope	No checklist consulted	Checklist consulted, gaps unaddressed	Gaps named, mitigations planned	Mitigations resourced and scheduled
Evaluation approach	None defined	Manual review only	Golden dataset planned	Golden dataset + automated evaluation in build plan
Change management	No adoption owner	Adoption owner in product/eng	Adoption owner in user org, informal	Adoption owner in user org, with a documented usage or outcome target in their own goals
Data realism	Pilot built on synthetic or curated data only	Real data sample identified, not yet tested	Real data sample tested, gaps known	Real data integration complete, access controls validated
Written success criteria (criteria-signed)	None	Verbal alignment, no document	Document drafted, unsigned	Document signed by sponsor and team, with the specific business metric and threshold named

Note the deliberate split between row 1 and row 7. Row 1 is about whether a team has signed up to the capacity required to run the thing — it answers "who carries the pager." Row 7 is about whether the sponsor has signed up to the criteria by which success will be judged — it answers "what does shipped mean." Pilots often score well on one and badly on the other; both have to clear.

The criteria are not equally weighted in practice. The two we have learned to weight most heavily: production ownership and written success criteria. A pilot scoring 0 on either of those almost never ships, regardless of how it scores elsewhere. We will sometimes start a pilot scoring 14 of 21 if those two specific criteria are at 3; we will rarely start one scoring 18 if either of them is at 0.

The rubric is not a gate against ambition. The point is not to require certainty before starting. The point is to surface the decisions that will determine whether the pilot ships, before the team spends three months building something they cannot deploy.

What this means for the next pilot you fund

If you are a CTO, VP Engineering, or AI sponsor at a 50-500 person company about to greenlight a new pilot:

The most useful 90 minutes you will spend on the project is the kickoff conversation that names the production owner, the adoption owner, the measurable success criterion, and the production-readiness scope — and produces a written document signed by both the sponsor and the build team. We have seen this conversation save 12-week pilots. We have seen the absence of this conversation kill them.

This is also the work that nobody on the engineering side typically initiates, because engineers are oriented toward building. Sponsors who push for the conversation up front substantially raise the odds that the pilot ships. Sponsors who skip it are the sponsors who end up with another stalled pilot.

For mid-market companies running their first 1-3 AI pilots, this discipline matters disproportionately. The pattern at 200-person companies we have observed: the first successful pilot establishes that AI initiatives can ship; that proof unlocks the next round of investment. A first pilot that stalls sets the program back by quarters, not weeks.

We cover the production deployment patterns themselves — once the pilot is set up to succeed — in our GenAI deployment patterns for B2B SaaS piece. The patterns assume the pilot is structured to ship. This piece is the prerequisite work.

Where the rubric breaks

A few honest limitations:

Pure R&D pilots. Some AI work is genuinely research — exploring a technique to understand whether it could be useful later. The rubric does not fit R&D, and applying it forces R&D to be dressed as production work. Better to give R&D its own budget, its own success criteria ("we learned X, the technique is/isn't worth pursuing"), and let it run separately.

Vendor-driven evaluations. Pilots that exist to evaluate a specific vendor's product are different from pilots that exist to deliver a business outcome. The rubric can be adapted but the criteria around production ownership and adoption need different framing.

Very early-stage companies. Below about 50 employees, the rubric's overhead may exceed its value. The coordination costs that justify the discipline at 200 people are smaller at 30.

Pilots tied to specific regulatory deadlines. When the pilot must ship by a date set externally, the rubric's "should we start?" framing becomes "what do we have to fix to start?" The conversations change shape.

FAQ

Q: Our pilot failure rate is closer to 50% than 85%. Is that good?

It depends on what you are counting. If you count "pilots that produced a working demo," 50% may be optimistic. If you count "pilots that delivered measurable business outcomes within 12 weeks of production deployment," 50% would be unusually good for mid-market. In our engagements we mostly see the latter number in the 25-40% range across mid-market clients before any intervention, and we have seen it rise into the 60-70% range once the rubric above is applied consistently. These are our observed numbers, not industry research.

Q: How many pilots should a 200-person company run at once?

Two to four, in our observation. Fewer than two does not produce enough learning to refine the practice. More than four exceeds the capacity for any of them to receive the production-readiness discipline they need. The right number depends on capacity, not ambition.

Q: Can we run pilots without a production owner if we know it will be transferred later?

Sometimes. The pattern works when the receiving team is identified up front and is consulted on architectural choices during the build. The pattern fails when the build team and the receiving team have never spoken until the transfer is attempted. If you are going to do this, do it explicitly — weekly check-ins between the build team and the receiving team, and a written transfer plan from week one.

Q: What if our sponsor refuses to sign written success criteria?

The refusal is information. It usually means the sponsor is not yet ready to commit to what they want, or that the use case is too poorly defined to commit to. Both are reasons to slow down rather than push forward. We have seen pilots restart productively after a two-week pause to re-scope; we have seen pilots fail expensively that pushed forward despite the unsigned criteria.

Q: Does the rubric apply to AI tools the company buys versus AI applications it builds?

Yes, with adjustments. Bought tools change the production-readiness column (the vendor handles much of it) but raise the change-management and data-realism columns (because the tool's defaults may not match your use case). The rubric's structure stays; the weights shift.

Q: We are an engineering-led organization and the engineering team finds the rubric heavy. How do we introduce it?

Frame it as a pre-mortem rather than a gate. Walk the team through "if this pilot fails to ship, why will it fail?" using the seven failure modes as prompts. The conversation tends to produce the rubric's outputs naturally without requiring the rubric to be imposed. Once the team has run the conversation a few times, the formal rubric becomes faster than not having one.

If you are evaluating a current portfolio of AI pilots against these failure modes, or scoping a new one, our team can help. We work with mid-market companies on AI enablement engagements that include pilot selection, production-readiness, and the operational work to keep pilots from stalling between demo and production. The rubric is open — you can run it yourself from this post without engaging us.