Skip to main content
Back to Blog
AI Agents

MCP Servers in Production — Authentication, Rate Limits, and Observability

Hypemarc AI Team
May 18, 2026
MCP Servers in Production — Authentication, Rate Limits, and Observability

The Gap Between Demo and Production

The first MCP server we built took ninety minutes. The second one — same scope, but ready for production — took two weeks. The gap between those numbers is what this article is about.

MCP (Model Context Protocol) lets AI agents call tools through a standard interface. The protocol itself is elegant. The operational concerns around running it in production are not.

If you're building an MCP server that real customers will use, plan for three problems the spec doesn't solve: authentication, rate limiting, and observability.

Problem 1 — Authentication

The MCP spec is intentionally agnostic about auth. That's the right design decision for a protocol. It's the wrong starting point for production.

The three patterns in the wild

Pattern A — Bearer token per agent Each agent gets a long-lived token. Simple to implement, easy to revoke. Fits most internal-use cases.

Pattern B — OAuth2 with refresh The user authorizes the agent once; the agent gets a refresh token and rotates. Required if the MCP server accesses user-scoped data on a third-party system (Gmail, GitHub, etc).

Pattern C — mTLS between trusted nodes Used when MCP servers and agents are in the same controlled network. Strongest guarantees, highest setup cost.

We default to Pattern A for internal tools and Pattern B for anything customer-facing. Pattern C only when compliance requires it.

What to log on every authenticated call

  • Agent identifier (which agent is calling)
  • Tool name (which method was invoked)
  • Truncated arguments (full args may contain PII — store hashes or summaries)
  • Latency
  • Result status (success / business-error / system-error)

This is the trace data you'll need when something goes wrong. Skipping it during initial build is the single most common production regret.

Problem 2 — Rate Limiting

An MCP server with no rate limits is one buggy agent away from exhausting downstream APIs or your database.

Two layers, not one

Layer 1 — Per-agent rate limits. Each agent identity has a budget (calls/minute, calls/day). Catches runaway agents.

Layer 2 — Per-tool rate limits. Each tool has its own budget (especially expensive tools — database writes, third-party API calls). Catches expensive misuse.

These layers compose. An agent can be under its global budget but be blocked from an expensive tool because that tool's budget is exhausted.

The right algorithm

Token bucket. It's been the right answer for twenty years and it's still right. Sliding window log gives marginally better burst behavior but is harder to operate. Stick with token bucket unless you have a specific reason not to.

Return codes that agents actually understand

Don't just return 429. Return:

  • 429 with Retry-After header — when the agent should retry
  • 429 with a structured body explaining why (per-agent vs per-tool budget)
  • A different error code for "permanently denied" so the agent doesn't retry forever

Agents that don't get retry guidance retry aggressively. Aggressive retries can take down your downstream services faster than the original request would have.

Problem 3 — Observability

Logs aren't enough. You need traces.

What "trace" means for MCP

A single user request often crosses:

  1. The agent (LLM call to decide what to do)
  2. The MCP server (handles the tool call)
  3. A downstream service (the actual API or database)

Without a trace ID propagated across all three, debugging a slow request is archaeology. With it, you find the bottleneck in five minutes.

We use OpenTelemetry. The MCP server accepts a trace context header, propagates it to downstream calls, and emits spans for each tool invocation. The agent emits spans for each LLM call. Together they form a single trace.

What to alert on

  • Error rate above baseline — but only after the baseline is established. Alerting on day one produces noise.
  • P99 latency spikes — sudden jumps usually mean a downstream API is degraded
  • Tool-specific budget exhaustion — early warning of a misbehaving agent
  • Auth failures spike — possible compromise or misconfigured deployment

Avoid alerting on every error. The number of false alerts kills your team's ability to respond when a real one fires.

The Tooling Stack We Run

For a production MCP server, our default stack:

ConcernChoiceWhy
LanguageTypeScript or PythonMCP SDK quality is highest here
AuthPattern A (bearer) for internal, Pattern B (OAuth2) for user-scopedMatch the use case
Rate limitingRedis-backed token bucketSurvives restarts, scales to multiple instances
TracingOpenTelemetry → Honeycomb or DatadogVendor-neutral, queryable
LoggingStructured JSON → Loki or CloudWatchSearchable, correlatable with traces
DeploymentContainer + horizontal autoscalingStateless servers scale easily

The tools are interchangeable. The categories are not.

Mistakes We Made (So You Don't Have To)

Mistake 1 — Treating MCP servers as internal microservices

We initially gave MCP servers the same operational treatment as internal HTTP APIs. That was a mistake — MCP servers are accessed by autonomous agents that retry, escalate, and combine tool calls in ways human callers don't. They need stronger guardrails.

Mistake 2 — Skipping rate limits on the first version

Our first internal MCP server had no rate limits. A misbehaving agent hit it 200 times in 30 seconds and burned through our database connection pool. Rate limits before launch, not after.

Mistake 3 — Logging full tool arguments

One of our tools accepted email addresses. Logging full arguments meant PII landed in our log aggregator. We had to delete and reindex. Hash or summarize PII fields in logs from day one.

Mistake 4 — One trace per LLM call instead of per user request

We started with one OpenTelemetry trace per LLM call. Debugging a multi-step workflow meant correlating across traces by timestamp. Painful. Now we propagate one trace ID for the entire user request.

When to Build vs Buy

If your MCP server's purpose is unique business logic (your data, your APIs), build it. There's no off-the-shelf option that knows your domain.

If your MCP server is standard integrations (GitHub, Slack, GCal, Gmail), use existing community servers and focus your engineering on the parts that are actually yours.

Closing

MCP is the right standard at the right time, but the protocol stops where production starts. Authentication, rate limiting, and observability aren't optional — they're what separates a demo from a system real customers can rely on.

If you're building MCP servers in production and want a second pair of eyes on the operational design, we offer free 30-minute reviews for teams in Korea and globally.

Further Reading


Last updated: 2026-05-18. Production patterns evolve with the protocol — we update as we learn.

Need More Insights?

Consult with AI marketing experts and grow your business

Contact Us
MCP Servers in Production — Authentication, Rate Limits, and Observability - Hypemarc Blog | Hypemarc