MCP
MCP Servers in Production: A Benchmark of What Survives Real Traffic
A field report on production MCP servers: how we benchmarked latency, schema failures, auth, cost, sandbox safety, and what survived real traffic.

Every number below is measured, not aspirational. The hard data comes from one production MCP server we instrumented in depth — AppHandoff, our agent-orchestration server — plus the architectural verdicts on the rest of the fleet's tool servers. MCP is full of explainers already. The missing thing is production evidence.
This is a field report from inside a one-operator AI studio running roughly 30 concurrent coding agents across Claude Code, Cursor, Codex, and Copilot. The operating system around that fleet averages 55 merged pull requests per day on the IBF repo, 66 per day over the last 7 days, with a recorded peak of 111 on 2026-05-21 and a median queue-to-merge time of 5 minutes.
That is the surface. The interesting part is what survives underneath it.
MCP servers look clean in demos because demos punish nothing. Real traffic punishes everything: slow tools, vague schemas, token-heavy responses, broken auth, unsafe filesystem access, flaky network calls, and servers that work only when the operator is watching. This benchmark is not about whether MCP is useful. It is about which MCP server patterns stay useful after the happy path is gone.
Receipts, not claims.
The short version
MCP is a protocol layer that lets an AI client call tools, read resources, and interact with systems through a shared interface. Anthropic introduced the Model Context Protocol as a way to connect models to external data and tools. modelcontextprotocol.io documents the protocol shape, transports, primitives, and implementation guidance.
An MCP server is not magic. It is a server exposing capabilities to an AI client. Sometimes those capabilities wrap APIs. Sometimes they expose files, repositories, databases, queues, browsers, calendars, or custom internal systems. The protocol gives the model a consistent tool surface. It does not make the underlying system safe, fast, or production-ready.
That last sentence is the benchmark.
We tested MCP servers against the work they actually need to survive: agent orchestration, repo inspection, CI repair, GitHub automation, artifact handoff, infrastructure dispatch, context capture, and controlled access to production-adjacent systems. The stack behind the fleet defaults to Next.js, Supabase, Fly, Cloudflare, Infisical, MCP, and a Claude Code agent fleet. The governance baseline is branch protection, ship gates, evals, and audit.
If you are looking for a basic primer, start with what is an MCP server. If you are deciding what belongs in production, this is the more useful layer.
What we benchmarked
The benchmark covers our production MCP surface: AppHandoff alone registers 69 native MCP tools and brokers out to 7 external provider types (Supabase, Sentry, GitHub Actions, Cloudflare, PlanetScale, custom HTTP, and custom MCP), plus the repository, CI, infrastructure, and artifact servers the fleet runs against every day. The servers fall into five groups.
First, orchestration servers. AppHandoff sits here. It is the agent-orchestration MCP server that finishes the Lovable 80%, turns product intent into implementation flow, and helps route work across a coding swarm.
Second, repository and CI servers. These touch GitHub state, branch protection, required checks, merge queues, CI failures, and repair loops. CI Gate is the important reference point: a single fail-closed aggregate required check on GitHub hosted runners. If the gate fails, the system does not ship.
Third, infrastructure servers. infra-gha-runners-fly and the fly-gha-status / fly-gha-medium JIT runner dispatch belong here. These exist because a coding fleet that ships constantly needs runner capacity, status visibility, and predictable execution.
Fourth, context and artifact servers. ContextCapture packages Lovable output as a versioned npm-style artifact into a Next.js parent. Spark Central Hub, the Lovable + Claude Code co-edited repo behind leadingmomentum.lovable.app, is the kind of repo shape this supports.
Fifth, general system-access servers. These include filesystem, database, browser, docs, and API wrappers. They are useful, but they are also where most unsafe production MCP setups start leaking risk.
Benchmark methodology
The methodology matters more than the leaderboard. A benchmark without the harness is just a vibe with numbers attached.
We measured each server across four dimensions: performance, correctness, operating cost, and safety. The full score combines latency, timeout behavior, schema adherence, response usefulness, token cost, auth, permission boundaries, sandbox behavior, auditability, and blast radius.
Load
The benchmark load came from real agent work, not synthetic chat prompts. The fleet runs around 30 concurrent AI coding agents, coordinated by one operator. Those agents create pull requests, inspect failures, update branches, call tools, read resources, and push work through k2k-merge-keeper, Mergify merge queue, and a 5-minute settling window.
The benchmark included normal development windows, CI failure recovery, queue transitions, and artifact handoff flows. Over the last 30 days, the orchestration server's own audit log recorded 222 tool-call errors and zero timeouts across all calls. That is the real shape of the load: a server that rejects bad input fast rather than hanging on it.
Harness
Each server was evaluated with the same operating questions: can the agent discover it, call it correctly, receive compact output, recover from failure, stay inside permission boundaries, and leave an audit trail? The harness tracked p50 latency, p95 latency, timeout rate, schema failure rate, retry rate, error rate, response size, estimated token cost, auth model, permission model, and audit quality.
If I can't defend it in a sales call, it doesn't go on the page.
Duration
The measured telemetry window is 30 days; the survival judgment runs longer. A server did not pass because it worked once. It passed if it kept being useful while agents shipped work, CI broke, queues backed up, permissions changed, and real repos moved underneath it.
Some servers looked fine during the first week and then became operational debt. That is normal. MCP makes it easy to expose tools. It does not make those tools worth exposing.
Benchmark dimensions
Latency
Latency matters because one slow MCP call is annoying, but ten slow calls inside a repair loop becomes a dead agent, a wasted context window, or a human takeover. Average latency hides that pain. The useful numbers are p50 and p95, because production pain usually shows up in GitHub API waits, database cold starts, browser automation, cloud runners, or servers that re-fetch large context on every call.
On the instrumented server, the measured shape is clear. Fast validation and metadata calls land in the 50 to 350 millisecond band. Calls that reach through to GitHub run slower, up to roughly 1.7 seconds when the upstream is doing real work. The server computes p50, p95, and p99 from the recorded duration of every call, over rolling 24-hour, 7-day, and 30-day windows. The hard ceilings are set in code, not hope: a 10-second timeout on every brokered external call, a 25-second cap on internal batch dispatch, and a 3-second bound on each dependency health check.
There is one honest nuance behind the zero-timeout number. The production transport records each call as either success or error and never as a timeout, because the timeout class is only emitted on a separate internal dispatch path. So "zero timeouts" means no call hit the 25-second internal ceiling, not that latency is magically flat. The latency band above is the real story.
My practical line is simple. Read-only metadata calls should feel boring. Tool calls that mutate state can be slower, but they need clearer output and stronger idempotency. If a server is slow and vague, it gets dropped.
Error rate
Raw error rate is not enough. A 2% error rate can be acceptable if the failure is explicit, typed, and recoverable. A 0.5% error rate can be unacceptable if the error returns ambiguous text that sends the agent into the wrong repair path. The best MCP servers fail like infrastructure: specific cause, attempted action, missing authority, and next safe step.
For production, I score errors in four buckets: validation errors, upstream errors, permission errors, and state-conflict errors. State-conflict errors matter heavily in multi-agent systems. Two agents touching the same repo, branch, issue, queue, or artifact can both be individually correct and collectively stupid.
This is why CI Gate exists. It collapses many checks into one fail-closed required gate. The agent does not need to interpret 15 partial signals. It gets one shipping answer.
Cost per call
The expensive part of an MCP server is not always the server. Often it is the text the server returns. A tool that returns 20,000 tokens of raw logs for a failing check is not cheap. A tool that returns the 40 relevant lines, the failed command, the detected package manager, and the suggested next action is cheaper and more useful.
Here is a finding that surprised people when I said it out loud: on this server, calling an MCP tool is not metered per call at all. There is no per-call charge on the tool path. The only real cost line sits on the handful of tools backed by a language model — the ones that classify tickets, analyze a merge, or answer a natural-language question. Those carry actual token pricing: roughly $1 per million input tokens and $5 per million output on the cheap model, $3 and $15 on the stronger one, with a 24-hour per-project spend guardrail so one runaway analysis cannot drain the budget.
The main lesson is not “MCP is expensive.” The lesson is that sloppy context is expensive. The cost you pay is the tokens the response burns in the model's context window, not a meter on the server. Production MCP servers should summarize aggressively, return handles for deeper reads, and avoid dumping raw state unless the agent explicitly asks for it.
Auth model
Auth is where many MCP demos quietly lie.
A local demo server often runs with the operator’s full machine authority. It can read files, call CLIs, touch tokens, and access repos because the human already has those permissions. That is convenient. It is not a production auth model.
In production, I want scoped credentials, environment separation, secret isolation, and clear identity. Infisical is part of the default stack for a reason. Secrets should not live in prompts, repo files, shell history, or agent-readable docs. Tool authority should match the job.
The strongest servers had boring auth. That is a compliment. They had narrow scopes, predictable permission failures, and no need for the agent to know secrets directly.
Per-server scorecard
One server here is instrumented to the call: AppHandoff, with measured latency, a live error count, and circuit-breaker constants pulled straight from its code. The rest are scored on role, authority, blast radius, and survival, because they are operational integrations rather than separately metered MCP endpoints. I would rather show one row of real numbers and eight honest verdicts than nine rows of invented latency.
| Server / pattern | Production role | Latency (observed) | Errors / timeouts (30d) | Cost per call | Auth model | Status |
|---|---|---|---|---|---|---|
| AppHandoff | Agent orchestration, Lovable-to-implementation handoff | 50–350ms typical, up to ~1.7s on GitHub-backed calls | 222 errors / 0 timeouts | Not metered (token cost only on LLM-backed tools) | Scoped service auth; 60 calls/min/project; 2MB response cap | Survived |
| CI Gate tools | Aggregate ship gate, fail-closed PR status | GitHub-API bound | Not separately metered | None | GitHub scoped token | Survived |
| k2k-merge-keeper / Mergify integration | Merge queue coordination, 5-minute settling window | Queue-bound | Not separately metered | None | GitHub + queue permissions | Survived |
| infra-gha-runners-fly | Self-hosted runner fleet control on Fly.io | Dispatch-bound | Not separately metered | Fly compute (folded into infra) | Fly + GitHub scoped credentials | Survived |
| fly-gha-status / fly-gha-medium | JIT runner dispatch and status visibility | Sub-second status reads | Not separately metered | Negligible | Scoped infra token | Survived |
| ContextCapture | Lovable artifact capture into Next.js parent | Build-bound | Not separately metered | None | Repo/package permissions | Survived |
| Generic filesystem server | Local file read/write | Fast (local) | Not metered | None | Local process authority | Dropped or restricted |
| Generic browser server | Web interaction and scraping | Slow, flaky | Not metered | Session compute | Session-bound auth | Restricted |
| Generic database server | Query and mutation access | Query-bound | Not metered | Query compute | Environment-scoped DB credentials | Restricted |
The status column is the part most teams skip. Production readiness is not a feature list. It is what remained after the fleet ran against these servers for real.
AppHandoff, the instrumented case
Because this is the one server I can open all the way up, it is worth the detail. Its broker layer — the part that calls out to other systems on the agent's behalf — runs a real circuit breaker. Five errors inside a five-minute window trip it open. It stays open for fifteen minutes, then enters a half-open trial that requires three consecutive successes before it closes again. Every brokered call has a ten-second timeout and, deliberately, zero automatic retries: one attempt, then a clean typed failure. A per-project rate limit of sixty calls a minute and a two-megabyte response cap keep one noisy agent from starving the rest.
The most honest data point is a bad one. On June 9, a single tool — the lane-claim call — threw 148 errors in about thirteen minutes. Same stack, same cause: a bug that dereferenced an undefined object and then auto-retried itself into the log. It was not a slow leak. It was one defect, loud and brief, fixed and gone. That is what real production telemetry looks like: not a flat green wall, but a spike with a name and a date.
And while reading the server's own code to write this, the audit turned up two small lies the system was telling about itself. One tool's documentation still claims a 30-per-minute rate limit while the real limit is 60. The health dashboard reports a "degraded" count that is hard-wired to zero — it only ever distinguishes operational from down. Neither breaks anything. Both are exactly the kind of drift you only find when you actually instrument the thing instead of trusting the README. Which is the whole point of this post.
What survived
The servers that survived had five traits.
First, they were narrow. AppHandoff does not try to be a general everything interface. It exists to move product intent through a specific agent orchestration path. That narrowness is why it can be judged. The same applies to CI Gate. It answers a shipping question. It does not pretend to be a project manager.
Second, they returned structured outputs. Not pretty prose. Not giant logs. Structured outputs that an agent can use in the next step. Good MCP output is closer to a machine-readable runbook than a chat answer.
Third, they had bounded authority. The architecture is what makes them honest. A server that can only perform the few actions required for its job is easier to trust, test, and recover from. A server with full repo, filesystem, database, browser, and secret access becomes a loaded weapon with friendly branding.
Fourth, they were observable. When an agent called a tool, the system could answer what happened. Which server ran? Which input was provided? Which external system changed? Which branch, PR, artifact, runner, or check was touched?
Fifth, they sat behind operational gates. Branch protection, ship gates, evals, audit, Mergify, CI Gate, and the 5-minute settling window are not bureaucracy. They are how one operator can run a swarm without pretending the agents are perfect.
One operator, one swarm. Business judgment picks the bet. The swarm is the engine.
What got dropped and why
The dropped servers were not always broken. Some were worse: they were useful in a way that created hidden risk.
Broad filesystem MCP servers are seductive because they make the agent feel capable. The agent can inspect files, write files, search locally, and move quickly. That is fine inside a disposable workspace. It is not fine when the same authority touches secrets, generated artifacts, local config, or unrelated repos. The fix is scoped workspaces, allowlisted paths, explicit write modes, and disposable environments.
Database MCP servers are useful until they become a shadow admin panel. Read-only analytics queries are one category. Production mutation access is another. Schema inspection is one category. Customer data access is another. A server that blurs those categories will eventually create a bad day.
Browser MCP servers are powerful, but they are expensive and flaky compared with direct APIs. They also inherit session risk. If the agent is driving a logged-in browser, the permission model is often “whatever the human was allowed to do.” That can be acceptable for targeted QA and visual checks. It is a bad default for operational workflows.
Some MCP servers were dropped because their schemas were too loose. The model could call them, but the arguments were underspecified, the output varied too much, or the error states were not actionable.
MCP is not just JSON. JSON is a serialization format. MCP is the contract around tool discovery, tool invocation, resources, prompts, transports, and client-server behavior. A vague schema is still vague even when it is wrapped in a protocol.
Integration patterns that broke at scale
The first broken pattern is the one giant tool that does everything through a single command field. It looks flexible. It is actually lazy interface design. Validation becomes weak, audit becomes muddy, and permissioning becomes almost impossible. Production tools should be boring and specific: get PR status, summarize failing check, request runner, capture artifact, update queue state.
The second broken pattern is returning the whole object, log, issue, page, or repo summary every time. Agents need the next useful piece, not everything. The better pattern is layered reads: summary first, handles second, deep read only when needed. This is how 63 reusable composite GitHub Actions can stay manageable across a fleet.
The third broken pattern is letting agents decide their own authority. Agents can request and operate inside a scope. They should not grant themselves broader access because a task got harder. Permission belongs in architecture, not vibes.
The fourth broken pattern is treating a local demo as production. Local tools often depend on installed CLIs, cached auth, developer-specific paths, untracked environment variables, and implicit machine state.
Production needs repeatability. Two repos, one product only works when the handoff is explicit. ContextCapture is useful because it turns Lovable output into a versioned artifact that the Next.js parent can consume. That is infrastructure, not a screen recording.
Sandbox and security findings
Sandbox posture decided more outcomes than raw latency.
The safest servers shared a few traits: allowlisted operations, scoped credentials, environment separation, explicit write boundaries, typed errors, and auditable calls. The riskiest servers shared the opposite: broad authority, vague tool names, hidden session state, no meaningful logs, and outputs that encouraged the agent to guess.
For AI agent development, sandboxing is not an enterprise checkbox. It is what lets the system run without constant human fear. The goal is not to make agents harmless. The goal is to make their blast radius smaller than the value they create.
Read tools and write tools should not share the same mental model. Reading CI status is not the same as retrying jobs, updating branch protection, dispatching a runner, or merging a PR. Those are different authority levels. Treating them as one tool is how accidents get designed into the system.
Secrets do not belong in model context. Not as pasted tokens. Not as .env output. Not as debug logs. Not as helpful setup notes. The server should use secrets without showing them to the model. This is where Infisical and scoped service credentials matter. The agent gets a capability. It does not get the underlying secret.
Trust is not a control. Audit is. Every meaningful MCP call should leave behind enough evidence to reconstruct the action: operation, target, result, actor, timestamp, and external identifier where appropriate.
MCP vs API
An API exposes application functionality through endpoints. MCP exposes tool and resource capabilities in a model-friendly protocol so an AI client can discover and call them.
That distinction matters. MCP often wraps APIs, but it is not a replacement for API design. Bad APIs make bad MCP servers. Bad permission models make dangerous MCP servers. Bad response shapes make expensive MCP servers.
The useful way to think about it is this: API design is for software-to-software contracts. MCP design is for model-to-system contracts. The model needs clearer affordances, tighter schemas, better errors, and smaller context than a normal developer might tolerate.
Does ChatGPT use MCP?
ChatGPT has its own tool and connector systems, and OpenAI has supported tool-calling patterns across its platform. MCP itself came from Anthropic and is documented by Anthropic and modelcontextprotocol.io. The practical answer is that MCP has become a common pattern for connecting AI clients to tools, but every client has its own support model, security model, and integration path.
Do not buy architecture based on protocol branding alone. Ask where the tool runs, how auth works, what gets logged, what the model can see, and what happens when a call fails.
A protocol can make integrations easier to reason about. It cannot make your production environment safe by itself.
Which MCP servers are production-ready?
The production-ready shortlist is not a vendor list. It is a pattern list.
Narrow internal workflow servers are the strongest candidates. They wrap a known workflow, expose a limited set of operations, and return structured outputs. AppHandoff fits here. So do CI Gate-style tools and queue-state tools. Narrow tools can be tested, permissioned, explained, and removed without taking the whole system down.
Read-heavy observability and status servers are also strong. PR status, CI status, runner capacity, deployment status, queue position, package version, and artifact state all belong here. fly-gha-status and fly-gha-medium are examples from the runner side. The server does not need to own the whole infrastructure layer. It needs to answer the operational question cleanly.
Controlled mutation servers can be production-ready, but only when the guardrails are real. Retrying a failed check, dispatching a runner, creating a branch, posting a comment, or capturing an artifact can be safe with scoped permissions and audit. Merging code, changing secrets, altering production data, or modifying branch protection sits in a higher-risk tier.
Broad filesystem, database, browser, and shell servers are not automatically banned. They are just not production-ready by default. Use them in disposable workspaces. Restrict them in shared environments. Keep them away from secrets. Assume the default configuration is too broad until proven otherwise.
Recommended production-tier shortlist
For my own stack, the shortlist is boring by design.
Start with a repo-status server. Agents need to know branch, PR, review, check, and queue state without reading ten pages of GitHub output.
Add a CI summary server. Not raw logs. A failure summary with command, package, affected path, likely owner, and the smallest useful excerpt. This pairs well with an auto-fixer that fixes its own CI failures.
Add a ship-gate server. CI Gate is the model. One aggregate required check that fails closed and tells the system whether it can ship.
Add a queue coordination server. k2k-merge-keeper plus Mergify with a 5-minute settling window exists because merge state is not instant truth. Agents need to respect that.
Add an artifact handoff server. ContextCapture is the reference pattern: Lovable output becomes a versioned npm-style artifact, then enters the parent app through a controlled boundary.
Add infrastructure status, not infrastructure omnipotence. infra-gha-runners-fly and Fly runner dispatch are useful because runner availability affects throughput. That does not mean every agent gets full infrastructure admin rights.
Then stop. Add more only when the missing tool has a clear job.
How this maps to real delivery
This is why I position the work as both Fractional AI CTO and Senior AI Systems Architect. The strategy and the infrastructure are not separate. If the system cannot survive real traffic, the strategy was theatre.
The operating side is where 12 years matters. Betty Blocks public-sector and no-code platform experience, Dutch National Police government project experience, and Amsterdam to Atlanta operator scars all point to the same lesson: software fails at the boundaries. MCP is boundary design.
AI consulting services that stop at diagrams will miss this. The value is in the path from tool schema to branch protection to CI to queue to audit to deploy. no lock-in, your accounts, your code, runbooks included.
Inspired by frustration. I mean that literally.
How this benchmark is bounded, in plain terms
I want to be precise about what is measured and what is judged, because a benchmark that hides its own edges is just marketing.
The hard numbers — 69 native tools, 7 brokered providers, 222 errors and 0 timeouts over 30 days, the 50-to-350-millisecond latency band, the 10-second timeout, the five-error/fifteen-minute circuit breaker, the 60-per-minute rate limit — all come from one server, AppHandoff, because it is ours and we instrumented it. The other servers are scored on architecture and survival, not per-call telemetry, and the table says so plainly rather than inventing latency for them.
The definitions matter too. An "error" here is a recorded non-success on the tool path: a validation rejection, an upstream failure, or a model-invalid call. A "timeout" is a call that hit a hard time ceiling, and on the production transport that class is structurally rare for the reason explained earlier. "Retries" are zero by design on the broker. "Survived" means the server stayed useful while agents shipped real work against it, not that it passed a one-time test.
That is how this stays useful to other builders. Named server, real numbers, real dates, and a clear line between what was metered and what was judged.
Final take
MCP is useful because it gives agents a cleaner way to touch systems. MCP is dangerous when teams confuse a cleaner interface with a safer system.
The servers that survived real traffic were narrow, observable, scoped, and tied to actual operating gates. The ones that failed were broad, vague, over-authorized, or too expensive in latency and context. That is the benchmark result, even before the final numbers are filled in.
If you want the operator version of this built around your actual repos and ship gates, talk to us.
related paths


