mcp server
Four layers every production MCP server needs — transport, JSON-RPC shape, tool definitions, and guardrails. With real code for rate limiting, circuit breakers, and structured errors.
TL;DR — A production MCP server is not "an AI plugin." It's a tool runtime with four hard-to-get-right layers: transport (stdio vs HTTP), JSON-RPC 2.0 routing, tool definitions (schemas + handlers), and guardrails (rate limits, circuit breakers, response caps). Most MCP servers you find on GitHub nail layer 3 and punt on layer 4 — which is exactly the layer that matters when Claude hits your server 60 times a minute at 2am.
I've shipped two production MCP servers: AppHandoff and MCP Beast. This post is what I'd tell the version of me who started the first one — the architecture decisions that actually matter once real agents start hammering your endpoints.
Use stdio when the server needs local files or CLI credentials. Use HTTP when multiple users share the server or you want a single cloud deployment.
My actual ~/.cursor/mcp.json:
{
"mcpServers": {
"apphandoff": { "type": "http", "url": "https://api.apphandoff.com/api/mcp-bot", "autoConnect": true },
"context7": { "type": "http", "url": "https://mcp.context7.com/mcp", "headers": { "X-API-Key": "[REDACTED]" } },
"fly": { "type": "stdio", "command": "fly", "args": ["mcp", "server"] },
"mailgun": { "type": "stdio", "command": "npx", "args": ["-y", "@mailgun/mcp-server"],
"env": { "MAILGUN_API_KEY": "[REDACTED]" } }
}
}
The choice has real operational consequences. stdio servers inherit the parent process's file descriptors and credentials, which is fantastic for local dev and terrible for supply-chain safety — every npx -y some-mcp-server is running untrusted code with your shell's permissions. I only use stdio for servers I've audited or vendors I trust (Fly, Mailgun, my own). Anything else goes over HTTP behind a bearer token so the blast radius is bounded by what the token can do.
HTTP has its own failure modes. SSE streams drop silently when a load balancer idle-timeouts at 60s — your agent hangs waiting for a tools/call response that will never arrive. Either keep idle timeouts above your longest tool's p99 latency, or use plain HTTP POST and design tools to return fast (long work goes to a job queue with a separate get_job_status tool). AppHandoff's proxy uses POST-only for this reason; the two times I tried SSE in production I regretted it.
When AppHandoff proxies tool calls to user-configured remote MCP servers:
// apphandoff/apps/web/lib/mcp-foreign.ts
async function execCustomMcp(tool, args, secrets, dryRun) {
const serverUrl = tool.config.mcp_server_url
const headers = {
'Content-Type': 'application/json',
...(secrets.MCP_AUTH_TOKEN ? { Authorization: `Bearer ${secrets.MCP_AUTH_TOKEN}` } : {}),
}
if (dryRun) {
const res = await fetchWithTimeout(serverUrl, {
method: 'POST', headers,
body: JSON.stringify({ jsonrpc: '2.0', id: 1, method: 'tools/list', params: {} }),
}, 10_000)
const json = await res.json()
return { data: { tools_count: (json?.result?.tools ?? []).length }, status: 'success' }
}
const res = await fetchWithTimeout(serverUrl, {
method: 'POST', headers,
body: JSON.stringify({
jsonrpc: '2.0', id: 1, method: 'tools/call',
params: { name: tool.config.remote_tool_name ?? tool.name, arguments: args },
}),
}, 10_000)
const json = await res.json()
if (json.error) return { error: json.error.message, status: 'error' }
return { data: json.result, status: 'success' }
}
Every outbound call is wrapped in a 10-second timeout. If your MCP server blocks for 60 seconds, the agent gives up and marks the tool broken.
search_github_issues beats tool_1. Include units, ranges, and enum values in the description.list_handoff_tickets is safe to retry; send_email is not. Say so in the description.{"status": "error", "code": "rate_limited", "retry_after_ms": 30000} beats Error: 429.The third rule is the one I missed on AppHandoff v1. Once I made errors structured, the agent started waiting and retrying on its own. That single change cut support tickets roughly in half.
MCP exposes three primitive types and most teams conflate them. Tools are verbs the agent calls — they have side effects or compute answers, and the agent decides when to invoke them. Resources are nouns the agent reads — a file, a row, a rendered page, addressable by URI, fetched on demand. Prompts are parameterized templates the user picks from a menu. If you model everything as a tool, the agent has to guess from descriptions which ones are safe to call speculatively, and it will guess wrong on the expensive ones. My rule: anything idempotent and cheap that the agent might want to prefetch goes on the resource surface; anything with side effects or cost stays a tool. MCP Beast got this wrong on the first pass — we shipped read_config as a tool, and Claude called it on every turn "just in case." Moving it to a resource cut per-session tool calls by about a third.
The schema in a tool definition isn't just documentation — it's the contract the model generates against. Three things bite people: (1) additionalProperties: false is almost always what you want, otherwise the model invents fields and you silently drop them; (2) prefer enum over free-form strings for anything with a known set — the model picks from the list instead of hallucinating; (3) put units in the field name, not just the description — timeout_ms beats timeout because the description gets summarized away when the tool list is large.
// mcp-foreign.ts — checkRateLimit
const RATE_LIMIT_PER_MIN = 60
const RATE_LIMIT_CACHE_TTL_MS = 5_000
export async function checkRateLimit(projectId: string) {
const cached = rateLimitCache.get(projectId)
if (cached && Date.now() < cached.expiresAt) {
return { allowed: cached.allowed, retryAfterMs: cached.retryAfterMs }
}
const oneMinuteAgo = new Date(Date.now() - 60_000).toISOString()
const { count } = await supa
.from('foreign_tool_calls')
.select('*', { count: 'exact', head: true })
.eq('project_id', projectId)
.gte('created_at', oneMinuteAgo)
const result = (count ?? 0) >= RATE_LIMIT_PER_MIN
? { allowed: false, retryAfterMs: 30_000 }
: { allowed: true }
rateLimitCache.set(projectId, { ...result, expiresAt: Date.now() + RATE_LIMIT_CACHE_TTL_MS })
return result
}
The 5-second cache cuts Postgres reads per call from 2 to ~0.03.
Five errors in a 5-minute window opens the circuit for 15 minutes. Three successes in half-open close it:
const CIRCUIT_ERROR_THRESHOLD = 5
const CIRCUIT_WINDOW_MS = 5 * 60_000
const CIRCUIT_COOLDOWN_MS = 15 * 60_000
const HALF_OPEN_TRIAL_SUCCESSES = 3
Without this, a downstream outage turns into cascading timeouts that hold Claude's context hostage.
const MAX_FOREIGN_BODY_BYTES = 2 * 1024 * 1024
async function readBodyCapped(res: Response, maxBytes: number) {
const reader = res.body?.getReader()
let totalBytes = 0
const chunks = []
while (true) {
const { done, value } = await reader.read()
if (done) break
totalBytes += value.byteLength
if (totalBytes > maxBytes) { reader.cancel(); throw new Error(`Response exceeded ${maxBytes} bytes`) }
chunks.push(value)
}
return new TextDecoder().decode(Buffer.concat(chunks))
}
Agents don't know what's big. A 40 MB JSON payload will push the context over the limit and crash the session.
Log every call with a status you can aggregate on: success | error | timeout | auth_failure.
const totalErrors = rows.filter(r => r.status === 'error').length
const totalTimeouts = rows.filter(r => r.status === 'timeout').length
const totalAuthFailures = rows.filter(r => r.status === 'auth_failure').length
const errorRatePct = Math.round(
((totalErrors + totalTimeouts + totalAuthFailures) / total) * 1000
) / 10
When error rate on a single tool jumps from 2% to 40%, you need to know within 5 minutes.
Every production MCP server ends up being multi-tenant whether you planned for it or not — the moment a second customer connects, you need per-tenant rate limits, per-tenant audit logs, and per-tenant blast radius on a leaked credential. AppHandoff issues bearer tokens scoped to a single project; the token resolves to a project_id on every call, and every query (foreign_tool_calls, handoff_tickets, rate-limit checks) is filtered by it at the database layer via Supabase RLS. That means a compromised token can't read another tenant's data even if a tool handler forgets to filter — the policy catches it.
The common failure mode I see in other people's MCP servers: a single shared API key per server, passed through the Authorization header unchanged to every downstream call. When that key leaks (and it will — agents paste tool output into logs, screenshots, support tickets) you're rotating credentials across every tenant at once. Issue scoped tokens, log which token made each call, and make revocation a single SQL update instead of a deploy.
See Fly.io + Supabase: the reliability stack that scales.
If the only requirement is "let Claude read our docs," a REST endpoint + retrieval is simpler. MCP earns its overhead when multiple AI clients hit the same surface, the agent needs to take actions (not just read), and per-tool audit logs matter.
See the full decision framework in MCP Server vs REST API.
Need a production MCP server built? Describe the system you want to make AI-native and I'll tell you what the first version should look like.
// keep reading
mcp server · 7 min
The actual working config for ~/.cursor/mcp.json — two transports, env-var secrets, the 5-minute setup flow, and the debugging loop for the red dot.
mcp server · 8 min
MCP is not a REST replacement. Decision table, real code from both, side-by-side comparison across 10 dimensions, and the hybrid pattern that ships in production.