RFC: Rate limiting for MCP servers by jerm-dro · Pull Request #57 · stacklok/toolhive-rfcs

jerm-dro · 2026-03-18T22:05:40Z

Summary

Adds an RFC for configurable rate limiting in ToolHive's MCPServer and VirtualMPCServer.

Scope:

Per-user and global rate limits at the server level
Per-operation limits for individual tools, prompts, and resources
Consistent config across MCPServer and VirtualMCPServer
Redis-backed shared counters for multi-replica correctness

This is an early draft focused on problem definition, user-facing configuration, and open questions — not implementation details.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rfcs/THV-0057-rate-limiting.md

JAORMX

So, the windowing section is the one part of this RFC that I think needs to be pinned down before we move forward. Right now it reads as "we'll figure it out during implementation"... but the algorithm choice affects burst behavior, storage, and how operators should reason about the limits they configure. Worth deciding here.

I'd suggest going with a token bucket. Here's why:

Your config schema (requestsPerWindow + windowSeconds) already maps to it naturally. A token bucket has two parameters: a refill rate (requestsPerWindow / windowSeconds) and a burst capacity (requestsPerWindow). Each token represents a single allowed request. The bucket starts full, refills at a steady rate, and when a request comes in it takes one token out. Empty bucket? Request rejected.

So "100 requests per 60 seconds" means: bucket holds 100 tokens, refills at ~1.67 tokens/second. The operator doesn't need to think about buckets or tokens at all... the config they already write just works.

Per-user limiting works the same way. Each user just gets their own bucket, keyed by identity. The algorithm is identical, only the Redis key changes:

Global limit: thv:rl:{server}:global
Per-user limit: thv:rl:{server}:user:{userId}
Per-user per-tool: thv:rl:{server}:user:{userId}:tool:{toolName}

In Redis, each bucket is a hash with two fields: token count and last refill timestamp. The check-and-decrement runs as a single atomic operation in Redis, so no race conditions across replicas.

Storage is O(1) per counter (two fields per hash). Total footprint is O(users × limits). So 500 users with 10 per-operation limits = 5,000 tiny hashes... basically nothing for Redis. Keys auto-expire when inactive too, so no garbage collection needed.

Compare that to a sliding window log, which stores a sorted set entry per request and is O(requests) per counter. That's the one to avoid.

Fixed window counters are simpler (single integer, also O(1)), but they have the classic boundary burst problem where a user can fire 2x their limit by timing requests across a window edge. Token bucket avoids that without adding real complexity.

A few things in the RFC that would need updating if you go with this:

The "Windowing" section should be reframed. Token bucket is continuous refill, not windowing. The current language about "approximate windowing" wouldn't be accurate anymore.
Burst behavior should be documented. An idle user can send a full burst of requestsPerWindow requests at once, since their bucket is full. That's a feature (it handles legitimate traffic spikes), but operators should understand it so they set capacity accordingly.
Retry-After becomes easy to compute. With token bucket you know exactly when the next token arrives (1 / refill_rate). The rejection behavior section is vague on this right now... worth noting that the algorithm gives you precise Retry-After values for free.

- Adopt token bucket algorithm per Ozz's review (replaces vague windowing section) - Document burst behavior, Redis key structure, and precise Retry-After - Clarify scope is Kubernetes-based deployments (Derek's question) - Replace "operators" with "administrators" throughout (Derek's question) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jerm-dro · 2026-03-19T15:41:13Z

@JAORMX Thank you for the suggestion. I've updated the algorithm / solution section to use the proposed algorithm.

JAORMX · 2026-03-20T12:44:52Z

Hey @jerm-dro, two things I'd like to see before we move this forward:

The security considerations section is missing entirely, and it's a required one per the template. Rate limiting sits in the request hot-path and makes accept/reject decisions on every request... we need a threat model here. Think Redis as an attack surface, input validation on operation names used in Redis keys, audit logging for rejections (especially given agent exfiltration is a motivating threat), and Redis unavailability semantics (probably should be configurable like THV-0017's failure_policy).

Also, there's no alternatives considered section. At minimum, let's capture the algorithm discussion we already had. You originally proposed a sliding window approach and we went with token bucket instead. That decision and the reasoning behind it should live in the RFC so future readers have context.

Everything else is looking good. The token bucket write-up and additive limit semantics are solid.

yrobla

Reviewed RFC-0057. Two blockers, two suggestions, two questions, and one nitpick — comments inline below.

yrobla · 2026-03-30T10:53:55Z

rfcs/THV-0057-rate-limiting.md

+
+### High-Level Design
+
+Rate limiting is implemented as a new middleware in ToolHive's middleware chain. When a request arrives, the middleware checks the applicable limits (global, per-user, per-operation) and either allows the request to proceed or returns an appropriate error response.


question: What counts as one "request" for rate limiting purposes?

The RFC doesn't define the unit being counted. Does tools/list consume a token? What about tools/call vs tools/list — are they treated the same? And for streaming SSE connections, does the connection itself count, each event, or neither?

This matters especially for per-operation limits. Defining the unit here (e.g. "one token per MCP method invocation") would remove ambiguity for the implementation.

yrobla · 2026-03-30T10:53:55Z

rfcs/THV-0057-rate-limiting.md

+  rateLimiting:
+    # Global limit: total requests across all users
+    global:
+      requestsPerWindow: 1000


suggestion: The field names requestsPerWindow / windowSeconds imply a sliding or fixed window algorithm, not a token bucket.

An operator reading requestsPerWindow: 100, windowSeconds: 60 will expect no more than 100 requests in any 60-second window. But with a token bucket, an idle user accumulates up to the full capacity and can fire all 100 requests instantly — which is correct behavior for the algorithm but surprising given the name.

Consider renaming to match the algorithm:

requestsPerWindow → capacity (or burstCapacity)

windowSeconds → refillPeriodSeconds

This also makes the burst behavior section easier to understand without needing a separate explanation.

yrobla · 2026-03-30T10:53:55Z

rfcs/THV-0057-rate-limiting.md

+
+#### VirtualMCPServer
+
+The same `rateLimiting` configuration is available on `VirtualMCPServer`. Limits configured here apply to the proxied traffic for each backend server independently.


suggestion: The semantics of perUser at the vMCP level are ambiguous.

"Limits apply to the proxied traffic for each backend server independently" reads like it only describes operation-level limits. For a server-level perUser limit on a VirtualMCPServer: is that per-user per backend, or per-user across all backends in the vMCP? These are meaningfully different enforcement models.

Please make this explicit — e.g. "The perUser server-level limit on a VirtualMCPServer is applied per user per backend, not aggregated across backends."

yrobla · 2026-03-30T10:53:55Z

rfcs/THV-0057-rate-limiting.md

+          perUser:
+            requestsPerWindow: 5
+            windowSeconds: 60
+      prompts:


nitpick: This entry appears under tools: but is named "backend_b/heavy_prompt", which suggests it belongs under prompts: instead. Looks like a copy-paste error from the server-level example.

yrobla · 2026-03-30T10:53:55Z

rfcs/THV-0057-rate-limiting.md

+Per-user limits work identically — each user gets their own bucket, keyed by identity. Redis keys follow a structure like:
+
+- Global: `thv:rl:{server}:global`
+- Per-user: `thv:rl:{server}:user:{userId}`


blocker: The Redis key structure is incomplete — the global per-operation case is missing.

The config already supports global limits on individual tools (e.g. shared_resource with a global: limit in the per-operation example), but there is no corresponding key pattern documented. Add:

Global per-tool: thv:rl:{server}:global:tool:{toolName}

And the analogous patterns for prompts and resources.

Without this, the key structure does not cover all the cases the config allows, which will cause confusion during implementation.

yrobla · 2026-03-30T10:53:56Z

rfcs/THV-0057-rate-limiting.md

+**Burst behavior**: An idle user accumulates tokens up to the bucket capacity. This means they can send a full burst of `requestsPerWindow` requests at once after a period of inactivity. This is intentional — it handles legitimate traffic spikes — but administrators should understand it when setting capacity.
+
+### Rejection Behavior
+


question: Retry-After may not be accurate when the rejection comes from a global limit.

1 / refill_rate is the time until the next token is available in a single bucket. But if the rejection was caused by a global bucket being exhausted (not the user's own bucket), the actual retry time depends on how many other users are consuming from that global bucket — the token could be consumed by another request before this client retries.

Consider documenting that Retry-After is a best-effort lower bound, not a guarantee, and that it reflects the most restrictive limit that rejected the request.

yrobla · 2026-03-30T10:53:56Z

rfcs/THV-0057-rate-limiting.md

+When a request is rate-limited, the middleware returns an MCP-appropriate error response. Because token bucket tracks the refill rate, the middleware can compute a precise `Retry-After` value (`1 / refill_rate`) telling the client exactly when the next token will be available.
+
+## Open Questions
+


blocker: This open question needs to be resolved before implementation begins — the answer directly shapes how the middleware is coded and tested.

Recommendation: fail open with an observable signal (structured log entry + metric/alert).

Rationale: failing closed turns a Redis hiccup into a complete MCP outage, which is a worse availability trade-off than temporarily losing rate limiting. Fail-open with observability means operators can alert on (e.g.) rate_limit_redis_unavailable and address Redis health independently, without taking down all MCP traffic in the process.

If the decision is to fail closed, the RFC should also address whether this affects readiness probes, and whether clients receive a distinct error code that distinguishes "rate limited" from "backend unavailable".

jerm-dro and others added 4 commits March 18, 2026 15:05

RFC: Rate limiting for MCP servers

aa1d2fc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rename RFC to THV-0057 to match PR number

5fa232f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix author GitHub handle to @jerm-dro

1b8ae9d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

RFC updates: clarify summary, add agent exfiltration risk, wording fixes

5fcd7db

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jerm-dro requested a review from Derek2Tu March 18, 2026 22:53

Derek2Tu reviewed Mar 18, 2026

View reviewed changes

rfcs/THV-0057-rate-limiting.md Outdated Show resolved Hide resolved

rfcs/THV-0057-rate-limiting.md Outdated Show resolved Hide resolved

JAORMX reviewed Mar 19, 2026

View reviewed changes

jerm-dro and others added 2 commits March 19, 2026 08:27

Clarify scope formatting, remove vMCP non-goal, document lazy refill

4a79621

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yrobla reviewed Mar 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Rate limiting for MCP servers#57

RFC: Rate limiting for MCP servers#57
jerm-dro wants to merge 6 commits intomainfrom
jerm/2026-03-18-rate-limiting

jerm-dro commented Mar 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

JAORMX left a comment

Uh oh!

jerm-dro commented Mar 19, 2026

Uh oh!

JAORMX commented Mar 20, 2026

Uh oh!

yrobla left a comment

Uh oh!

yrobla Mar 30, 2026

Uh oh!

yrobla Mar 30, 2026

Uh oh!

yrobla Mar 30, 2026

Uh oh!

yrobla Mar 30, 2026

Uh oh!

yrobla Mar 30, 2026

Uh oh!

yrobla Mar 30, 2026

Uh oh!

yrobla Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		### High-Level Design

		Rate limiting is implemented as a new middleware in ToolHive's middleware chain. When a request arrives, the middleware checks the applicable limits (global, per-user, per-operation) and either allows the request to proceed or returns an appropriate error response.


		#### VirtualMCPServer

		The same `rateLimiting` configuration is available on `VirtualMCPServer`. Limits configured here apply to the proxied traffic for each backend server independently.

		Burst behavior: An idle user accumulates tokens up to the bucket capacity. This means they can send a full burst of `requestsPerWindow` requests at once after a period of inactivity. This is intentional — it handles legitimate traffic spikes — but administrators should understand it when setting capacity.

		### Rejection Behavior

		When a request is rate-limited, the middleware returns an MCP-appropriate error response. Because token bucket tracks the refill rate, the middleware can compute a precise `Retry-After` value (`1 / refill_rate`) telling the client exactly when the next token will be available.

		## Open Questions

Conversation

jerm-dro commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Uh oh!

Uh oh!

JAORMX left a comment

Choose a reason for hiding this comment

Uh oh!

jerm-dro commented Mar 19, 2026

Uh oh!

JAORMX commented Mar 20, 2026

Uh oh!

yrobla left a comment

Choose a reason for hiding this comment

Uh oh!

yrobla Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

yrobla Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

yrobla Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

yrobla Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

yrobla Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

yrobla Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

yrobla Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jerm-dro commented Mar 18, 2026 •

edited

Loading