Add partial failure mode support to vMCP aggregator#3533
Add partial failure mode support to vMCP aggregator#3533yrobla wants to merge 3 commits intoissue-3036-v1from
Conversation
Implement circuit breaker state machine for backend health monitoring in Virtual MCP Server. The circuit breaker prevents cascading failures by tracking backend failures and transitioning through three states: Closed (normal operation), Open (failing), and HalfOpen (recovery testing). Key changes: - Add CircuitBreaker with thread-safe state machine - Integrate circuit breaker with status tracker - Integrate circuit breaker with health monitor - Add comprehensive test coverage The circuit breaker is disabled by default for backwards compatibility. Configuration wiring from YAML/CRD to health monitor is intentionally left for a follow-up PR to keep this change focused on the core implementation. Related-to: #3036
Complete the circuit breaker integration by wiring it through the Virtual
MCP Server stack for automatic backend failure handling.
Configuration wiring (cmd/vmcp/app/commands.go):
- Wire circuit breaker config from YAML to health monitor
- Pass configurable failure threshold and timeout to monitor
- Log circuit breaker enablement on server startup
Capability filtering (pkg/vmcp/aggregator/default_aggregator.go):
- Filter tools, resources, and prompts from unhealthy backends
- Skip capabilities when backend status is Unhealthy, Unknown, or
Unauthenticated
- Keep capabilities from Healthy and Degraded backends (degraded
backends are still operational)
- Add isBackendHealthy helper for status evaluation
Routing protection (pkg/vmcp/router/default_router.go):
- Check backend health before routing requests
- Return ErrBackendUnavailable when backend is unhealthy
- Add isTargetHealthy helper for status evaluation
- Log warnings when routing fails due to backend unavailability
Testing (pkg/vmcp/router/default_router_test.go):
- Add HealthStatus to all test fixtures
- Add test cases for unhealthy backend routing failures
- Add test cases for unauthenticated backend failures
- Add test cases for degraded backend success (still operational)
- Cover all three routing methods: tools, resources, prompts
Related-to: #3036
Consolidate health check logic into BackendHealthStatus method
Add IsHealthyForRouting() method to BackendHealthStatus type to eliminate
duplicate health checking logic in aggregator and router packages.
Changes:
- pkg/vmcp/types.go: Add IsHealthyForRouting() method
- pkg/vmcp/aggregator/default_aggregator.go: Use method, remove duplicate
- pkg/vmcp/router/default_router.go: Use method, remove duplicate
This consolidation:
- Reduces code duplication (removes 29 lines of duplicate code)
- Centralizes health status logic in one place
- Makes future changes to health checking easier to maintain
- Follows DRY principle for better code organization
All tests pass. Linting clean.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## issue-3036-v1 #3533 +/- ##
=================================================
+ Coverage 65.28% 65.35% +0.06%
=================================================
Files 400 401 +1
Lines 39100 39317 +217
=================================================
+ Hits 25526 25695 +169
- Misses 11592 11639 +47
- Partials 1982 1983 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR implements configurable failure handling for the vMCP aggregator when backend MCP servers become unavailable. It introduces two operational modes: "fail" (default) which returns an error only when all backends are unavailable, and "best_effort" which continues with available backends regardless of failures. This addresses a key acceptance criterion from issue #3036.
Changes:
- Added
PartialFailureModeFailandPartialFailureModeBestEffortconstants to define backend failure handling behavior - Extended
NewDefaultAggregatorto accept afailureModeparameter and implemented failure mode enforcement logic inMergeCapabilities - Updated all aggregator instantiations across tests and production code to explicitly specify the failure mode
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/vmcp/config/config.go | Adds partial failure mode constants (fail and best_effort) with documentation |
| pkg/vmcp/aggregator/default_aggregator.go | Implements failure mode support in aggregator with health tracking and enforcement logic |
| pkg/vmcp/aggregator/default_aggregator_test.go | Adds comprehensive test coverage for both failure modes with various backend health scenarios |
| cmd/vmcp/app/commands.go | Integrates failure mode configuration from operational config into aggregator initialization |
| pkg/vmcp/server/integration_test.go | Updates all test aggregator instantiations to use explicit fail mode |
| test/integration/vmcp/helpers/vmcp_server.go | Updates test helper to use fail mode by default for test consistency |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Implements configurable failure handling when backend MCP servers
become unavailable. The aggregator now supports two modes:
- fail (default): Returns error if all backends unavailable; logs
warning if some backends are down but continues with healthy ones
- best_effort: Continues with available backends regardless of
failures; logs info about unavailable backends
Related-to: #3036
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Track backend health | ||
| if backend.HealthStatus.IsHealthyForRouting() { | ||
| healthyBackends[backend.ID] = true | ||
| } else { | ||
| unhealthyBackends[backend.ID] = string(backend.HealthStatus) | ||
| } |
There was a problem hiding this comment.
Suggestion: The factoring of the defaultAggregator leaves a bit to be desired and I think this change motivates improving it.
Background
The defaultAggregator implements the Aggregator interface, which has 4 unused public methods: QueryCapabilities, QueryAllCapabilities, ResolveConflicts, and MergeCapabilities.
Suggestion
Rather than inlining more logic into defaultAggregator, lets:
- Remove the unused methods from the interface.
- Implement a
circuitBreakerAggregatorfor your new logic. This implements the new and improvedAggregatorinterface and decorates any arbitrary aggregator:
type circuitBreakerAggregator struct {
// this would really just be the defaultAggregator
inner Aggregator
}
func (c circuitBreakerAggregator) AggregateCapabilities(...) {
// check all the backends for health
err := c.enforceFailureMode(...)
if err != nil {
return nil, err
}
return inner.AggregateCapabilities(...)
}I think this makes your change cleaner, because it doesn't add any complexity to the existing defaultAggregator's construction or runtime behavior. It would also enable pretty straightforward unit testing of the circuitBreakerAggregator, because you don't need the whole defaultAggregator to test it.
What do you think?
|
reimplementing as there were so many changes, makes sense to start from scratch |
Implements configurable failure handling when backend MCP servers become unavailable. The aggregator now supports two modes:
Related-to: #3036