ai agents
AI Agent Fleet Economics: What 100 PRs/Day Actually Costs
What does an AI coding agent fleet actually cost? A field report from a one-operator AI studio running high-volume PR throughput.

This is a field report from inside a one-operator AI studio running roughly 30 concurrent coding agents across a real product system. The headline is not “AI writes code.” That is the surface. The interesting part is what it costs to keep the swarm honest when the output target is 100 pull requests a day.
Every number below is measured, not aspirational. Most AI coding ROI posts collapse into vibes. Faster developer. Happier team. More output. Nice charts. Then nobody can tell you the actual ai agent cost per merged PR, CI burn, revert rate, or what the human still does.
Receipts, not claims. So here is the headline the rest of the post defends: the fleet runs on about $60 a day, and a fully loaded merged PR costs roughly 60 cents. The expensive part is not the machines. It is the one human who decides what ships.
The current operating baseline on the TeamK2K/inspiredbyfrustration repo is 55 merged PRs/day average over the last 30 days, 66/day over the last 7 days, and a peak of 111 merged PRs on 2026-05-21. Median queue-to-merge is 5 minutes.
One operator, one swarm.
The stack behind it is boring on purpose: Next.js, Supabase, Fly, Cloudflare, Infisical, MCP, Claude Code, Cursor, Codex, Copilot, GitHub Actions, Mergify, and a self-hosted runner fleet. AppHandoff is the agent-orchestration MCP server that finishes the Lovable 80%. infra-gha-runners-fly runs the TeamK2K self-hosted GitHub Actions runner fleet on Fly.io. CI Gate is the single fail-closed aggregate required check.
The architecture is what makes them honest.
The number people ask for first: cost per PR
The clean number is this:
Cost per merged PR = total model spend + runner spend + infra spend + human review time cost, divided by merged PRs.
Here is the actual ledger. The throughput is measured from GitHub. The tooling and infra lines are real monthly spend. The runner cost is already inside the infra line, because the runners are self-hosted on the same Fly account.
| Line | Value | Source |
|---|---|---|
| 30-day merged PR average (IBF repo) | 55/day | GitHub PR history |
| Fleet-wide merged PRs, all repos | ~100/day | GitHub PR history |
| Peak merged PR day | 111 on 2026-05-21 | GitHub PR history |
| Median queue→merge | 5 minutes | GitHub merge timing |
| Concurrent coding agents | ~30 | operator |
| AI tooling spend (Claude Code, Cursor, Codex, Copilot) | ~$1,500/mo | provider billing |
| Infra spend (Fly: apps + self-hosted runners, all apps) | ~$300/mo | Fly Cost Explorer ($11.82/day actual) |
| Total operating run cost | sum | |
| Fully loaded $/merged PR | ~$0.60 | $60/day ÷ ~100 PRs/day |
| Revert rate | ~0% (1 in 162 merges, a CI config revert) | git history |
| CI failure rate | ~2% (2 of 87 recent runs) | GitHub Actions |
The machine cost is almost a rounding error. Sixty cents a PR. That is the part everyone fixates on, and it is the part that matters least.
The mistake is treating model subscription cost as the whole cost. It is not. The real bill is the path from task claim to merged code: context load, branch churn, CI, flaky test recovery, runner capacity, merge queue time, human arbitration, rollback risk, and cleanup. And the single most expensive input does not appear on any invoice: the operator's attention. That line is covered later, and it dwarfs the $60.
For a normal team, the unit is developer-month. For an agent fleet, the unit should be merged production change.
What 100 PRs/day really means
A 100 PR day does not mean one human read 100 giant diffs and thoughtfully debated every line. That would be theater.
It means the work was shaped into small, reviewable, low-blast-radius lanes. Agents claimed narrow slices. Each slice had a contract. The contract defined what the agent was allowed to touch, what output it had to publish, and what checks had to pass before the work moved forward.
The human did not become 30 developers. The human became the dispatcher, judge, and escalation path.
When the system is working, most PRs are boring: copy fix, type cleanup, component extraction, test repair, migration step, route fix, dashboard state issue, ContextCapture patch, runner config improvement, AppHandoff guardrail.
That shape is the point. If every PR requires deep human reasoning, the fleet is not a fleet. Business judgment picks the bet. The swarm is the engine.
The fleet architecture: lane, claim, contract, publish, broker
This is a walkthrough of the actual infrastructure.
The basic loop is simple: lane, claim, contract, publish, broker.
The lane constrains the work. The claim gives one agent ownership. The contract defines allowed files, expected output, and tests. The publish step opens the PR with evidence. The broker decides whether the work can ship.
Inspired by frustration. I mean that literally.
AI coding demos usually fail after the first 80%. Lovable can make an impressive surface. Claude Code can push a feature far. Then the mess appears: context gaps, duplicate logic, broken build steps, stale branches, mystery auth, missing env vars, half-created migrations, and PRs that pass locally but fail where it matters.
AppHandoff exists because of that gap. ContextCapture takes Lovable output and ships it as a versioned npm-style artifact into the Next.js parent.
Two repos, one product.
Spark Central Hub is a good example: Lovable and Claude Code co-edited the repo behind leadingmomentum.lovable.app. The Lovable surface could move fast, but the production parent still needed contracts, artifact boundaries, CI, branch protection, and merge rules. That is where agent orchestration becomes plumbing.
The broker is the part most teams skip. They let agents push branches, open PRs, and ask a human to sort the pile. That breaks around the moment the system becomes useful.
In my setup, k2k-merge-keeper and Mergify run the merge queue with a 5-minute settling window. CI Gate acts as the single fail-closed aggregate required check.
One green check matters more than twenty optimistic ones.
CI is the hidden tax
CI is where the bill becomes real.
A coding agent can create code faster than your CI system can validate it. Once that happens, throughput is no longer limited by tokens. It is limited by queue depth, runner supply, cache quality, test stability, merge conflicts, and whether failed checks get fixed or ignored.
The fleet uses 63 reusable composite GitHub Actions shared across the system. That matters because every repeated YAML pattern becomes a tax at this volume. If the same setup bug exists in 20 workflows, you do not have 20 bugs. You have one platform bug with 20 invoices.
The runner layer uses fly-gha-status and fly-gha-medium JIT runner dispatch through infra-gha-runners-fly. The goal is not “more runners” as a blanket answer. The goal is the right runner class at the right moment, with enough elasticity to absorb bursts without turning every PR into idle cost.
There is also a two-tier cache: node_modules plus Turbo remote cache. Agents create many small validation events, and every minute saved there repeats all day.
The runner bill is not a separate invoice here. The runners are self-hosted on the same Fly account as the apps, so their cost is already inside the ~$300/mo infra line. That is the point of folding them into the product system instead of renting hosted minutes by the job.
What the runner layer buys at this volume is a low failure rate that stays low. About 2% of recent CI runs end red, and the reverts that reach main are near zero: one revert in the last 162 merges, and that one was a CI config change, not a product rollback. A fleet that ships 100 PRs a day and almost never reverts is not lucky. It is gated.
That failure rate is the real cost model. Cheap runners that let broken work through are expensive. Slightly pricier validation that holds the line is the bargain.
The cost breakdown that actually matters
Model spend
This includes Claude Code, Cursor, Codex, Copilot, and any API-driven model calls used by agents, MCP servers, review tools, or auto-fixers.
The trap is averaging this by calendar month. That hides productive versus wasteful use. Model spend should be tied to accepted output, failed output, and repeated attempts.
Useful cuts are model spend per merged PR, per closed-unmerged PR, per reverted PR, per lane type, per agent class, and per successful auto-fix.
A high token bill is not automatically bad. A cheap workflow that ships broken work is expensive. That is the point of AI agent development as an operating system, not just a prompt habit.
Runner spend
Runner spend is the first hard infrastructure bill that shows up when the swarm starts moving.
GitHub-hosted runners are convenient. Self-hosted runners can be cheaper or faster for the right workload. Both can be wasteful if the queue is badly shaped. The question is not “hosted or self-hosted?” It is “what is the cost per validated PR under real burst conditions?”
infra-gha-runners-fly exists because runner supply needed to become part of the product system. fly-gha-status and fly-gha-medium JIT dispatch give the fleet an elastic runner layer. The runner bill should be measured against merge rate, not test count. Tests that never protect a merge are just ritual.
Platform infra
This includes Fly, Supabase, Cloudflare, Infisical, remote cache, preview infrastructure, logs, and any MCP server runtime.
This number is usually smaller than the human cost, but it matters because agent fleets multiply environment touches: branches, previews, secrets, status checks, short-lived compute, and logs.
Infisical matters because secret handling cannot be a Slack message and a prayer. MCP servers matter because tool access needs a controlled path.
This is where what is an MCP server becomes relevant for operators. MCP is not magic. It is a way to expose controlled tools and context to agents. The control is the value.
Human operator time
This is the most important cost line and the easiest one to lie about.
One human still has to decide what matters. The human shapes the lanes, names the contracts, reviews evidence, catches product drift, resolves conflicts, kills bad branches, watches the queue, and decides when the system is optimizing the wrong thing.
That work is not “prompting.” It is operating.
The operator is doing product management, architecture, release management, QA triage, DevEx, and incident prevention in one loop. That is why this model fits my public positioning as both Fractional AI CTO and Senior AI Systems Architect.
When teams ask whether they need AI consulting services or generative AI implementation, this is the divide. Tool selection is easy. Operating 10 to 30 agents without chaos is architecture.
What one human actually does
The operator does not sit there approving every line.
The operator manages the shape of work: choosing the bet, slicing lanes, assigning agents, forcing contracts before code, watching queue health, reviewing evidence, reading failures as feedback, tuning branch protection, killing bad patterns, and keeping the product coherent.
That last one matters most. AI can make a repo busier while making the product worse.
The operator’s job is to prevent fake productivity.
This is also why DORA-style team metrics miss a lot of the story. Deployment frequency, lead time, change failure rate, and recovery time are useful for normal engineering orgs. They are not enough for an agent fleet. They were designed around human team flow. They do not see whether 30 agents are duplicating work, fighting over files, retrying bad prompts, overfitting tests, or generating PRs that look productive but carry no product value.
For a fleet, I want extra metrics:
| Metric | Why it matters |
|---|---|
| merged PRs per operator-hour | Measures human amplification |
| CI minutes per merged PR | Shows validation cost |
| closed-unmerged PR ratio | Shows waste |
| revert rate | Shows quality debt |
| median queue→merge | Shows flow health |
| auto-fix success rate | Shows self-healing quality |
| conflicts per lane | Shows work slicing quality |
| repeated failure signatures | Shows platform debt |
| human escalations per agent | Shows autonomy boundary |
| product-accepted changes per day | Shows output that mattered |
That is how you measure AI engineering productivity. Not by counting generated lines. Not by screenshots of an IDE. You measure the whole path from intention to shipped, accepted change.
Failure modes and guardrails
The failure modes are boring. That is why they are dangerous.
Agents write across boundaries
An agent starts with a small UI fix and ends up changing auth, layout, data fetching, and a shared type. The diff looks energetic. The work is bad.
Guardrail: file ownership, lane contracts, small PRs, and fail-closed checks.
Agents duplicate existing patterns
They solve the same problem again because they did not find the existing abstraction.
Guardrail: better context retrieval, repo maps, reusable actions, and MCP tools that point to the right source of truth.
Agents treat green tests as product truth
A PR can pass tests and still be wrong. It can satisfy the prompt and miss the business reason.
Guardrail: human product judgment stays in the loop. Business judgment picks the bet. The swarm is the engine.
CI becomes the product manager
If agents optimize only for passing CI, they will make changes that satisfy the gate while drifting from the product.
Guardrail: PR evidence needs to include why the change exists, what changed, how it was tested, and what risk remains.
Secrets and tool access get messy
Agents need access to useful tools, but useful tools can also damage real systems.
Guardrail: Infisical, scoped credentials, MCP boundaries, audit trails, and branch protection as a baseline. Branch protection + ship gates + evals + audit are governance, not paperwork.
That is also why an AI readiness audit should include repo hygiene, CI maturity, secret management, and release controls. If those are weak, agent adoption just reveals the weakness faster.
Does AI actually improve developer productivity?
Yes, but the answer is narrower than the hype.
AI improves developer productivity when work can be sliced, context can be supplied, validation is fast, and the human knows what good looks like. It performs badly when the work is ambiguous, the repo is messy, tests are slow, ownership is unclear, and nobody can tell whether the output matters.
The real productivity gain from Copilot, Claude, Cursor, Codex, or any other tool depends less on the tool and more on the operating model around it.
A single developer can ship far more with AI when the system is built for it. In my current setup, one operator can coordinate ~30 concurrent agents and average 55 merged PRs/day over 30 days. But I would not sell that as a universal promise. I would sell it as proof that the bottleneck moved.
The bottleneck is no longer “can the model write code?” The bottleneck is whether the architecture can turn model output into safe, accepted, production-shaped changes.
ROI is not a seat calculation
The ROI of AI coding tools is not:
developer salary ÷ tool subscription = magic savings
That is spreadsheet fiction.
The better model is:
accepted product output ÷ total operating cost
Then subtract the cost of rework, reversions, support load, incidents, and human attention.
A fleet with high model spend and low rework can beat a cheap setup that creates mess. A fleet with fast PR volume and poor product judgment can lose money while looking impressive. This is why I do not care much about generic “AI makes developers 30% faster” claims.
I care about whether the team can point to named products, real numbers, real dates.
AppHandoff. infra-gha-runners-fly. CI Gate. k2k-merge-keeper. Mergify. 63 reusable composite GitHub Actions. fly-gha-status. fly-gha-medium. ContextCapture. Spark Central Hub. 55 merged PRs/day average. Peak 111 on 2026-05-21. Median queue→merge 5 minutes.
That is the ledger.
Why team-level metrics miss the fleet
DORA metrics helped engineering teams stop arguing from feelings. That was good.
But agent fleets need a deeper ledger because the unit of labor changes. A human team has limited parallelism. An agent fleet has cheap parallelism and expensive coordination. That changes what can go wrong.
With humans, too much work in progress is visible. Meetings get crowded. Standups get painful. People complain. With agents, work in progress can explode quietly. Thirty branches can appear before the operator feels the pain.
DORA will eventually show pain if bad work reaches production. It may not show 40 PRs closed unmerged, 200 wasted CI runs, three agents solving the same issue, or a week of product drift hidden inside green checks.
The fleet needs operational metrics closer to air traffic control: what is in flight, who claimed it, what failed, what merged, what reverted, and what repeated.
The governance baseline
My baseline for serious fleet work is simple:
- branch protection
- ship gates
- evals where they make sense
- audit trail
- scoped secrets
- merge queue
- fail-closed aggregate check
- small PRs
- lane contracts
- reusable CI primitives
- runbooks included
No lock-in — your accounts, your code, runbooks included.
That last line matters. A vendor-owned black box is the wrong shape for this work. If the fleet touches your product, your repo, your CI, your secrets, and your release path, then the operating system needs to be understandable by your team.
The goal is to build an operating model that can survive contact with your real codebase.
The whole ledger, in one place
Here is the full picture, machine cost and human cost side by side.
| Input | Value | Source |
|---|---|---|
| Merged PRs, fleet-wide | ~100/day | GitHub PR history |
| Merged PRs, IBF repo | 55/day avg, peak 111 | GitHub PR history |
| Median queue→merge | 5 minutes | GitHub merge timing |
| AI tooling spend | ~$1,500/mo | provider billing |
| Infra spend (Fly, runners included) | ~$300/mo | Fly Cost Explorer |
| Total operating run cost | sum | |
| Fully loaded $/merged PR | ~$0.60 | $60/day ÷ ~100 PRs |
| Revert rate | ~0% (1 in 162 merges) | git history |
| CI failure rate | ~2% (2 of 87 runs) | GitHub Actions |
| Human operators | 1 | the whole point |
The machine side of that table totals about $60 a day. The human side is one person. Which means the real economics are not "what does a PR cost in dollars," they are "how far can one operator's judgment stretch across a swarm before the quality breaks." At this fleet the answer is roughly 100 PRs a day at a ~0% revert rate. The $60 is trivia. The operator is the business.
That is the honest version most AI ROI posts never reach: once the tooling is this cheap per unit of output, the cost question stops being about tokens and starts being about whether one human can keep 30 agents pointed at work that actually matters.
The practical takeaway
The real ai agent cost is not the model bill.
The real cost is the operating system required to make agents useful: lane design, context, CI, runners, cache, merge control, secret boundaries, audit, and one human who knows when to say no.
If you only buy the coding tool, you get faster code creation. If you build the fleet architecture, you get a shot at faster accepted change.
Those are not the same thing.
The model writes. The system validates. The operator decides. The ledger tells the truth.
That is the field report from inside the swarm. Receipts, not claims. If you want the operating model applied to your repo, talk to us.
related paths


