Why AI Pilots Fail to Reach Production

Roughly seven in ten AI pilots never make it to production. The number gets cited so often it's started to feel like the weather — a fact of life, beyond intervention. It isn't. The pilots that fail are failing for predictable reasons, and the pilots that succeed are succeeding for predictable reasons.

We've worked through both sides of this with clients across financial services, manufacturing, healthtech, and SaaS in 2025–2026. The patterns are consistent. This is the framework we use to diagnose where a pilot is on the production-readiness curve and what the next intervention should be.

The four failure modes

Pilots don't die from one big cause. They die in four distinct ways, often in combination.

Failure mode 1: The pilot was never meaningful. It was set up to prove the technology worked at a small scale, not to solve a business problem at the real scale. When the time comes to roll out, there's no business sponsor with skin in the game and no clear ROI argument. The pilot achieves its narrow goal and dies politely.

Failure mode 2: The infrastructure gap. The pilot ran on a Jupyter notebook, a developer's local machine, or a cloud sandbox with no security controls. Productionising means building everything that wasn't built — auth, audit, monitoring, retraining pipelines, change management. The estimate for "make this production-ready" comes back at 5x the original pilot cost. The CFO declines.

Failure mode 3: Change management was an afterthought. The model works, the integration works, but the humans whose workflow it's about to change weren't consulted, weren't trained, and don't want it. The rollout gets resisted at the operator level and quietly retired by the third month.

Failure mode 4: The model degrades on real data. The pilot was tested on a clean, curated subset. In production the data is messier — missing fields, schema variations, edge cases the pilot never saw. Performance drops, confidence in the system collapses, the rollback comes faster than the rollout.

Most failed pilots have at least two of these going on. The pilots that succeed have addressed all four before the pilot even started.

“We stopped calling them 'pilots' on principle. Pilots imply that flying is the hard part. In our world the hard part is taxiing — the security review, the workflow change, the operator training. Once you've done that, the model itself is the cheap bit.”
Director of ML Platform/Series C enterprise SaaS

The five-gate framework

We use a five-gate framework for taking AI work from idea to production. Each gate is a discrete decision point with a binary outcome: pass and continue, fail and either redesign or stop. Most failed pilots failed because they walked past one of these gates without stopping.

Gate 1: Business case sharpness

Before any model work, the business case has to be sharp enough to defend. The questions:

What specifically does this AI feature change in the business — costs reduced, revenue added, errors avoided, speed gained?
What's the realistic upside if it works at the scale we'd deploy?
Who's the named business sponsor whose KPI moves if this works?
What does failure look like, and what's the cost?

If any of those answers is squishy ("we want to be more AI-forward", "we'll figure out the metrics during the pilot"), the pilot is unlikely to survive scrutiny later. The cost of doing the pilot is too high to discover the use case mid-stream.

The pilots that pass this gate cleanly are nearly always tied to a specific metric a specific executive owns. The pilots that don't are nearly always somebody's pet project.

Gate 2: Data readiness

The question isn't "do we have data." It's: do we have data that's labeled, accessible, representative, and refreshable?

Labeled: Can we evaluate model output against ground truth? If not, what's the substitute (human review, downstream signal, A/B test)?
Accessible: Is the data in a place the AI system can read in production? Or is it locked behind a VPN, a snapshot from six months ago, or a reporting system that only refreshes nightly?
Representative: Does the pilot data match the distribution the model will face in production? Or is it a clean subset that will mislead?
Refreshable: When the underlying world changes (new product, new region, new regulation), will the model see the new data automatically, or does someone need to re-export and re-deploy?

This gate eliminates a meaningful percentage of pilots before they start. That's the right outcome — better to discover the data problem now than after you've spent six months building.

Gate 3: Pilot architecture that maps to production

The pilot should be built on a slimmer version of the production architecture, not on a different one. If the pilot runs on a notebook calling an API, and production needs to run inside the firewall with audit logging and SSO, you're going to rebuild from scratch. That's not productionising — that's redoing.

The discipline: at the start of the pilot, sketch the production architecture. Build the pilot on a subset of that architecture — the same auth, the same logging, the same deployment path — just with smaller data, fewer tenants, looser SLAs. When the pilot succeeds, you don't rebuild. You scale.

This is the gate where infrastructure teams should be involved from day one, not introduced at the end. The cost of involving them is much lower than the cost of finding out at week ten that the pilot architecture won't pass internal security review.

Gate 4: Human-in-the-loop design

Almost every AI feature in 2026 has a human-in-the-loop dimension at production. Either the model assists a human, the human reviews the model, or the model handles routine cases and escalates to a human on exception.

The pilot has to figure out this surface, because it determines everything about adoption. The questions:

Who's the operator the model is built to assist or replace?
What's their workflow today? What does it look like with the model? Where do they accept, where do they override, where do they escalate?
What's the confidence threshold below which the model defers to the human?
How do they audit and correct the model's outputs over time?
What does training look like — for them and for the model based on their feedback?

Pilots that skip this gate end up at the rollout meeting facing the operators who weren't consulted, whose workflow is about to change, and who often have de-facto veto power over adoption. Pilots that address it have the operators on the project from week one, contributing to design, and become advocates rather than obstacles.

“Every model I've shipped died because the operators weren't in the room. Every one I've kept alive had them in the room from week one. The model quality was a third-order factor next to that.”
Former CTO/healthtech, two AI deployments

Gate 5: Observability and continuous evaluation

The production system has to answer, every day: is the model still working? The pilot has to have built the answer.

The required surfaces:

Output logging: every model call, every input, every output, with retention.
Quality metrics: continuous scoring against ground truth where available, against proxy signals where not.
Drift detection: monitoring the distribution of inputs and outputs over time, with alerts when they drift past thresholds.
Feedback ingestion: operator corrections fed back into a labeled dataset for the next training cycle.
A retraining pipeline: even if you don't retrain often, you have to be able to.

This is the gate most teams skip because it's not exciting. It's also the gate that most reliably determines whether the production system survives the first 90 days after rollout.

evals/daily.ts

typescript

1// Continuous quality scoring. Runs nightly against a labelled hold-out set
2// and alerts when score drops below threshold. Cheap insurance against
3// silent regressions when an upstream model or prompt changes.
4import { Eval } from "braintrust";
5import { classify } from "@/lib/pilot";
6import { getLabeledExamples } from "@/lib/eval-set";
7
8const BASELINE = 0.85;
9
10await Eval("pilot-classifier", {
11  data: () => getLabeledExamples(),
12  task: async (input) => classify(input),
13  scores: [
14    { name: "exact_match", scorer: ({ output, expected }) => output === expected ? 1 : 0 },
15  ],
16  threshold: BASELINE,
17  onRegression: (score) => alert(`p99 dropped to ${score}`),
18});

What 2026 has made easier — and what it hasn't

Easier: foundation-model APIs have made the model itself a much smaller part of the project. You don't train, you don't host, you don't tune (usually). The model is a stable input.

Easier: tooling for evals, prompt versioning, output monitoring (Braintrust, LangSmith, Helicone, Phoenix) has matured. The "we don't know if it's working" problem is now tooling-solvable.

Easier: MCP and similar tool-integration standards have reduced the glue-code cost of plugging models into existing systems.

Not easier: change management. The human dimension of AI rollout is unchanged from 2022. The training, the resistance, the workflow design, the operator buy-in — all the same.

Not easier: data quality. Whatever your data problems are, they're still your problems. The model can't fix them.

Not easier: ROI defensibility. The CFO conversation about whether the AI feature is paying for itself is harder, not easier, because inference costs are non-trivial.

A practical recovery playbook for stalled pilots

If your pilot is stuck — built, working, but not progressing to production — run this diagnostic:

Step 1: Re-do the business case as if you were pitching it for the first time. If you can't get to "this saves $X / generates $Y / avoids Z" with specific numbers, the pilot doesn't have a business case. That's why it isn't progressing.

Step 2: Map the production architecture. Compare it to the pilot architecture. List the gap. Estimate the cost honestly.

Step 3: Talk to the operators who would actually use this in production. Not the executive sponsor. The people at the keyboard. Ask them what they think. The answer tells you whether the rollout will succeed.

Step 4: Score the pilot against the five gates. Each gate it failed is a specific intervention. Don't try to fix all of them at once — fix the most fundamental one first (usually Gate 1 or Gate 2) and see whether the others stop mattering.

Step 5: If you can't honestly close the gap with the resources available, retire the pilot cleanly. Document what you learned, free up the capacity, and start the next pilot at Gate 1 properly. This is not a failure — this is the discipline that lets the next pilot succeed.

What "production-ready" actually means

That implies:

Someone other than the builder can deploy and roll back.
Someone other than the builder can interpret the metrics dashboard.
Someone other than the builder can debug a degradation incident at 2 a.m.
Someone other than the builder can retrain or update the model.
The dependencies on specific third-party APIs are documented and substitutable.

Most pilots that "work" do not meet this bar. Closing the gap is half of what production-ready means.

Frequently asked questions

Is the 70% number really right?

The exact number varies by survey methodology, sector, and how generously "production" is defined. The point is the order of magnitude — most pilots don't make it. That's been stable across multiple research firms for several years.

Are LLM pilots more or less likely to make it than classical-ML pilots?

About the same, for different reasons. LLM pilots have lower technical risk (the model works) but higher productionisation risk (cost, governance, hallucination management). Classical-ML pilots have higher technical risk (the model might not work well enough) but cleaner productionisation paths (the patterns are well-trodden).

How long should a good pilot take?

For most enterprise contexts: 8–12 weeks for the technical work, with 2–4 weeks before for the business case + data readiness gates, and 2–4 weeks after for operator preview and rollout planning. The teams who run 4-week pilots usually skip gates 4 and 5 and pay for it at productionisation time.

Who should run the pilot?

A team that includes (1) a senior AI engineer or applied scientist, (2) a product person who understands the operator workflow, (3) the named business sponsor as an active reviewer, and (4) someone from your infra/security team from week one. Pilots run by AI specialists in isolation are the highest-failure category.

Closing thought

The 70% failure rate isn't a feature of AI. It's a feature of how AI work has been structured in organisations that haven't yet built the institutional muscle for it. The companies that have built it — the ones that pass pilots through to production reliably — aren't smarter or richer. They're more disciplined about gating.

If you have a pilot that's stalled and you want a structured second opinion on which gate it's failing and what to do about it, we offer a focused pilot-recovery diagnostic. Most of our recovery engagements start with one stalled pilot and end with the process change that prevents the next three.

/SHARE

Why 70% of AI Pilots Never Reach Production — And How to Fix Yours