The Gap Between Demo and Production
The first MCP server we built took ninety minutes. The second one — same scope, but ready for production — took two weeks. The gap between those numbers is what this article is about.
MCP (Model Context Protocol) lets AI agents call tools through a standard interface. The protocol itself is elegant. The operational concerns around running it in production are not.
If you're building an MCP server that real customers will use, plan for three problems the spec doesn't solve: authentication, rate limiting, and observability.
Problem 1 — Authentication
The MCP spec is intentionally agnostic about auth. That's the right design decision for a protocol. It's the wrong starting point for production.
The three patterns in the wild
Pattern A — Bearer token per agent Each agent gets a long-lived token. Simple to implement, easy to revoke. Fits most internal-use cases.
Pattern B — OAuth2 with refresh The user authorizes the agent once; the agent gets a refresh token and rotates. Required if the MCP server accesses user-scoped data on a third-party system (Gmail, GitHub, etc).
Pattern C — mTLS between trusted nodes Used when MCP servers and agents are in the same controlled network. Strongest guarantees, highest setup cost.
We default to Pattern A for internal tools and Pattern B for anything customer-facing. Pattern C only when compliance requires it.
What to log on every authenticated call
- Agent identifier (which agent is calling)
- Tool name (which method was invoked)
- Truncated arguments (full args may contain PII — store hashes or summaries)
- Latency
- Result status (success / business-error / system-error)
This is the trace data you'll need when something goes wrong. Skipping it during initial build is the single most common production regret.
Problem 2 — Rate Limiting
An MCP server with no rate limits is one buggy agent away from exhausting downstream APIs or your database.
Two layers, not one
Layer 1 — Per-agent rate limits. Each agent identity has a budget (calls/minute, calls/day). Catches runaway agents.
Layer 2 — Per-tool rate limits. Each tool has its own budget (especially expensive tools — database writes, third-party API calls). Catches expensive misuse.
These layers compose. An agent can be under its global budget but be blocked from an expensive tool because that tool's budget is exhausted.
The right algorithm
Token bucket. It's been the right answer for twenty years and it's still right. Sliding window log gives marginally better burst behavior but is harder to operate. Stick with token bucket unless you have a specific reason not to.
Return codes that agents actually understand
Don't just return 429. Return:
429withRetry-Afterheader — when the agent should retry429with a structured body explaining why (per-agent vs per-tool budget)- A different error code for "permanently denied" so the agent doesn't retry forever
Agents that don't get retry guidance retry aggressively. Aggressive retries can take down your downstream services faster than the original request would have.
Problem 3 — Observability
Logs aren't enough. You need traces.
What "trace" means for MCP
A single user request often crosses:
- The agent (LLM call to decide what to do)
- The MCP server (handles the tool call)
- A downstream service (the actual API or database)
Without a trace ID propagated across all three, debugging a slow request is archaeology. With it, you find the bottleneck in five minutes.
We use OpenTelemetry. The MCP server accepts a trace context header, propagates it to downstream calls, and emits spans for each tool invocation. The agent emits spans for each LLM call. Together they form a single trace.
What to alert on
- Error rate above baseline — but only after the baseline is established. Alerting on day one produces noise.
- P99 latency spikes — sudden jumps usually mean a downstream API is degraded
- Tool-specific budget exhaustion — early warning of a misbehaving agent
- Auth failures spike — possible compromise or misconfigured deployment
Avoid alerting on every error. The number of false alerts kills your team's ability to respond when a real one fires.
The Tooling Stack We Run
For a production MCP server, our default stack:
| Concern | Choice | Why |
|---|---|---|
| Language | TypeScript or Python | MCP SDK quality is highest here |
| Auth | Pattern A (bearer) for internal, Pattern B (OAuth2) for user-scoped | Match the use case |
| Rate limiting | Redis-backed token bucket | Survives restarts, scales to multiple instances |
| Tracing | OpenTelemetry → Honeycomb or Datadog | Vendor-neutral, queryable |
| Logging | Structured JSON → Loki or CloudWatch | Searchable, correlatable with traces |
| Deployment | Container + horizontal autoscaling | Stateless servers scale easily |
The tools are interchangeable. The categories are not.
Mistakes We Made (So You Don't Have To)
Mistake 1 — Treating MCP servers as internal microservices
We initially gave MCP servers the same operational treatment as internal HTTP APIs. That was a mistake — MCP servers are accessed by autonomous agents that retry, escalate, and combine tool calls in ways human callers don't. They need stronger guardrails.
Mistake 2 — Skipping rate limits on the first version
Our first internal MCP server had no rate limits. A misbehaving agent hit it 200 times in 30 seconds and burned through our database connection pool. Rate limits before launch, not after.
Mistake 3 — Logging full tool arguments
One of our tools accepted email addresses. Logging full arguments meant PII landed in our log aggregator. We had to delete and reindex. Hash or summarize PII fields in logs from day one.
Mistake 4 — One trace per LLM call instead of per user request
We started with one OpenTelemetry trace per LLM call. Debugging a multi-step workflow meant correlating across traces by timestamp. Painful. Now we propagate one trace ID for the entire user request.
When to Build vs Buy
If your MCP server's purpose is unique business logic (your data, your APIs), build it. There's no off-the-shelf option that knows your domain.
If your MCP server is standard integrations (GitHub, Slack, GCal, Gmail), use existing community servers and focus your engineering on the parts that are actually yours.
Closing
MCP is the right standard at the right time, but the protocol stops where production starts. Authentication, rate limiting, and observability aren't optional — they're what separates a demo from a system real customers can rely on.
If you're building MCP servers in production and want a second pair of eyes on the operational design, we offer free 30-minute reviews for teams in Korea and globally.
Further Reading
- Model Context Protocol (MCP) Explained
- AI Agent Orchestration Platforms in 2026 — A Comparison
- Heterogeneous Agents in Production
Last updated: 2026-05-18. Production patterns evolve with the protocol — we update as we learn.