Skip to content
Inspired By Frustration

AI implementation

From AI Pilot to Production: The Playbook That Survives Friday Deploys

Most AI pilots do not die because the model is weak. They die at the walls: data, evals, ops, security, change management, and economics.

Ralph Duin · 8 min read
XLI
From AI Pilot to Production: The Playbook That Survives Friday Deploys

Most AI pilots do not die because the model is weak. They die because nobody built the operating system around it: data access, evals, deployment gates, security boundaries, adoption loops, and unit economics. The interesting part is what survives after the demo leaves the conference room.

This is a field report from inside the wall. I run a one-operator AI studio from Atlanta with roughly 30 concurrent coding agents. The IBF repo averages 55 merged PRs/day, 66/day over the last 7 days, with a peak of 111 on 2026-05-21 and a median queue→merge time of 5 minutes.

Every number below is measured, not aspirational. Receipts, not claims.

Start with the real failure range

An AI pilot is a contained experiment proving whether an AI workflow can create value under real constraints. It is not a chatbot demo or prompt workshop. It is a scoped bet with inputs, outputs, owners, risks, costs, and a path to production. Without that path, it is theater.

The lazy headline is “80% of AI pilots fail.” I do not use that number unless I can defend the source. MIT NANDA's 2025 “GenAI Divide” report found that 60% of organizations evaluated GenAI tools, 20% reached pilot stage, and only 5% reached production. BCG's 2024 AI adoption research said 74% of companies struggle to scale AI value. Gartner projected at least 30% of GenAI projects would be abandoned after proof of concept by the end of 2025.

That range is enough. The exact percentage matters less than the pattern. Pilots fail when teams treat production as a later phase instead of a design constraint.

The six walls that kill GenAI pilots

The “AI pilot to production” problem is six walls showing up in sequence. Most teams survive the demo, then die when the system touches real users, permissions, money, and deployment windows.

1. The data wall

The symptom: curated examples work, messy reality does not. Access is unclear, documents duplicate, retrieval returns stale content, and nobody knows which source wins.

The fix is boring and mandatory. Map systems of record. Classify sensitivity. Define freshness. Track provenance. Build test sets from real examples. For generative AI implementation, the first production artifact is usually the data contract.

2. The eval wall

The symptom: people say the AI is “pretty good,” but nobody can say where it fails, how often it fails, or whether this week's version is better than last week's version. A pilot without evals cannot be promoted. It can only be liked.

The fix is a small eval harness before expansion: golden tasks, adversarial tasks, regression tasks, and human review loops. Track false positives, false negatives, latency, cost, and escalation quality. The architecture is what makes them honest.

In my fleet, branch protection, ship gates, evals, and audit are the governance baseline. CI Gate is a single fail-closed required check. The merge queue cares whether checks pass.

3. The ops wall

The symptom: the pilot works when the builder is watching. It breaks when the owner is at dinner. There is no runbook, rollback, alerting, or clear distinction between “weird answer” and “production incident.”

This is why AppHandoff exists. Lovable can get you far, but production work still needs routing, state, ownership, and finish-line pressure. Inspired by frustration. I mean that literally.

The fix is to treat the pilot like software from day one: environments, logs, incident categories, rollback paths, and deployment gates. My default stack is Next.js + Supabase + Fly + Cloudflare + Infisical + MCP + a Claude Code agent fleet. Vague stacks create vague accountability.

4. The security wall

The symptom: legal, security, or compliance finally sees the demo and asks normal questions. Where are secrets stored? What data leaves the tenant? Who can call which tool? Where is the audit trail?

The fix is implementation, not a policy PDF. Secrets go into a vault. Tool access is scoped. Sensitive actions need approval. Logs show who asked, what was retrieved, which tool ran, and what changed. This is where a Senior AI Systems Architect earns the title.

5. The change-management wall

The symptom: the tool works, but nobody uses it. Actual users keep doing the old workflow. Managers do not trust the output. Frontline users do not know when to escalate. The AI creates one more tab instead of removing work.

The fix is role design, training, review rituals, and clear ownership. It also means killing unused features, even if the demo looked impressive. AI strategy consulting becomes operational here: which workflow changes, which user changes behavior, and which metric proves it stuck.

6. The economics wall

The symptom: the pilot is technically useful but financially stupid. The cost per task is too high. Latency breaks the user experience. Human review eats the savings. The vendor contract looks cheap until volume arrives.

The fix is to measure unit economics during the pilot: cost per resolved task, review minutes saved, exception rate, maintenance load, infra cost, and vendor exposure. Business judgment picks the bet. The swarm is the engine.

The remediation playbook

A good AI implementation playbook does not ask “can we build this?” first. It asks “what wall kills this if we are careless?” Then it builds the pilot around that wall.

For data, start with the five sources that matter. Not fifty. Define freshness, permissions, and source ranking. The pilot should not proceed until the system can show where the answer came from.

For evals, create a test pack before the demo. Include refusals, escalations, uncertainty cases, and regressions. Evals should run in CI, not in a spreadsheet someone forgets to open.

For ops, use normal software discipline: logs, deployment gates, feature flags, rollback, runbooks, and ownership. For security, draw the action boundary. Reading is one risk. Writing is another. Sending email, updating a CRM, approving invoices, or touching production data needs approval and audit rules.

For change management, put the user in the loop early. Watch what they do when the AI is wrong, slow, or annoying. For economics, build a simple cost ledger: tokens, infra, review time, maintenance, support, and exit cost. I like no lock-in — your accounts, your code, runbooks included.

The pilot-to-prod handoff checklist

Before an AI proof of concept moves to production, I want this clean.

Scope and value: one owner, one workflow, one before-and-after map, one success metric, one kill condition, one escalation path.

Data and permissions: approved sources, enforced permissions, sensitive-data classification, provenance, freshness rules, and test examples for stale, conflicting, missing, and restricted data.

Evals and quality gates: a real eval set with happy path, edge cases, refusals, and regressions. Failing safety or quality checks block release.

Operations: logs, alerts, runbooks, rollback paths, deployment gates, incident categories, support ownership, and degraded-mode behavior.

Security and audit: vault-backed secrets, scoped tools, approval for high-risk actions, and audit logs covering inputs, retrieval, model output, tool calls, and final action.

Economics: cost per task, review cost, model cost, infra cost, support cost, maintenance cost, and vendor switching pain.

The budget pattern that survives

Most AI budgets are backwards. They spend heavily on discovery, demos, and vendor exploration, then underfund the production wrapper. That is how you get prototypes and no shipped system.

A better pattern has four buckets. First, fund the production spine: data access, auth, secrets, logging, evals, and deployment. Second, fund one narrow workflow with visible pain, real volume, and a clear owner. Third, fund the human loop: review, training, exceptions, and adoption. Fourth, fund the second use case only after the first ships.

BCG's 2025 “Closing the AI Impact Gap” research points in the same direction: leaders focus on fewer, deeper use cases than peers. Depth beats portfolio theater.

Case pattern: what shipped vs what died

The dead pilot starts with a broad ambition: “AI assistant for the business.” It gets a polished interface, good demos, and a sponsor. Then it hits the walls. Nobody owns the data contract or evals. Security arrives late. Users keep the old workflow. Finance asks what each task costs.

The shipped version is smaller. It starts with one workflow, names the source systems, defines the action boundary, runs evals in CI, logs production behavior, stores secrets in a vault, and has one owner who can cut scope.

That is the pattern behind AppHandoff, Spark Central Hub, and ContextCapture. Two repos, one product. Lovable can generate the visible surface. Claude Code and the agent fleet finish production work. ContextCapture packages Lovable output as a versioned npm-style artifact into the Next.js parent.

The same pattern shows up in infrastructure. infra-gha-runners-fly runs a TeamK2K self-hosted GitHub Actions runner fleet on Fly.io, with JIT runner dispatch, two-tier cache, and 63 reusable composite GitHub Actions. k2k-merge-keeper plus Mergify uses a 5-minute settling window. The auto-fixer fixes its own CI failures.

One operator, one swarm. But the important part is not the swarm. The important part is the gates.

Copilot, ChatGPT, and the naming mess

Search results for “AI pilot to production” get polluted by Microsoft Copilot pages because “copilot” means two things now. Microsoft Copilot is a Microsoft product family. ChatGPT is OpenAI's conversational AI product. They can overlap, but they are not the same product, vendor model, deployment model, or governance surface.

Is AI Copilot free? Sometimes, depending on which Copilot you mean. Some Microsoft Copilot experiences have free tiers. Business and enterprise features usually sit behind paid Microsoft plans or add-ons. The production question matters more: data access, actions, governance, and cost at scale.

For custom AI agent development, the same rule applies. The product name is not the architecture. The architecture determines whether the pilot survives production.

The actual playbook

Run an AI readiness audit before the pilot. Pick one workflow. Define the six walls. Build the smallest production-shaped version. Put evals and gates in place before expansion. Treat security, ops, and economics as first-class work. Use agent orchestration only when multiple agents improve throughput, review, or separation of concerns.

The mistake is waiting until after the pilot to ask production questions. Production is not the next phase. Production is the constraint that makes the pilot honest.

Receipts, not claims. If you want the same shape applied to your workflow, talk to us.