Skip to content

Comments

Add partial failure mode support to vMCP aggregator#3533

Closed
yrobla wants to merge 3 commits intoissue-3036-v1from
issue-3036-v2
Closed

Add partial failure mode support to vMCP aggregator#3533
yrobla wants to merge 3 commits intoissue-3036-v1from
issue-3036-v2

Conversation

@yrobla
Copy link
Contributor

@yrobla yrobla commented Jan 30, 2026

Implements configurable failure handling when backend MCP servers become unavailable. The aggregator now supports two modes:

  • fail (default): Returns error if all backends unavailable; logs warning if some backends are down but continues with healthy ones
  • best_effort: Continues with available backends regardless of failures; logs info about unavailable backends

Related-to: #3036

taskbot and others added 2 commits January 30, 2026 09:13
Implement circuit breaker state machine for backend health monitoring
in Virtual MCP Server. The circuit breaker prevents cascading failures
by tracking backend failures and transitioning through three states:
Closed (normal operation), Open (failing), and HalfOpen (recovery testing).

Key changes:
  - Add CircuitBreaker with thread-safe state machine
  - Integrate circuit breaker with status tracker
  - Integrate circuit breaker with health monitor
  - Add comprehensive test coverage

The circuit breaker is disabled by default for backwards compatibility.
Configuration wiring from YAML/CRD to health monitor is intentionally
left for a follow-up PR to keep this change focused on the core
implementation.

Related-to: #3036
Complete the circuit breaker integration by wiring it through the Virtual
MCP Server stack for automatic backend failure handling.

Configuration wiring (cmd/vmcp/app/commands.go):
  - Wire circuit breaker config from YAML to health monitor
  - Pass configurable failure threshold and timeout to monitor
  - Log circuit breaker enablement on server startup

Capability filtering (pkg/vmcp/aggregator/default_aggregator.go):
  - Filter tools, resources, and prompts from unhealthy backends
  - Skip capabilities when backend status is Unhealthy, Unknown, or
    Unauthenticated
  - Keep capabilities from Healthy and Degraded backends (degraded
    backends are still operational)
  - Add isBackendHealthy helper for status evaluation

Routing protection (pkg/vmcp/router/default_router.go):
  - Check backend health before routing requests
  - Return ErrBackendUnavailable when backend is unhealthy
  - Add isTargetHealthy helper for status evaluation
  - Log warnings when routing fails due to backend unavailability

Testing (pkg/vmcp/router/default_router_test.go):
  - Add HealthStatus to all test fixtures
  - Add test cases for unhealthy backend routing failures
  - Add test cases for unauthenticated backend failures
  - Add test cases for degraded backend success (still operational)
  - Cover all three routing methods: tools, resources, prompts

Related-to: #3036

Consolidate health check logic into BackendHealthStatus method

Add IsHealthyForRouting() method to BackendHealthStatus type to eliminate
duplicate health checking logic in aggregator and router packages.

Changes:
- pkg/vmcp/types.go: Add IsHealthyForRouting() method
- pkg/vmcp/aggregator/default_aggregator.go: Use method, remove duplicate
- pkg/vmcp/router/default_router.go: Use method, remove duplicate

This consolidation:
- Reduces code duplication (removes 29 lines of duplicate code)
- Centralizes health status logic in one place
- Makes future changes to health checking easier to maintain
- Follows DRY principle for better code organization

All tests pass. Linting clean.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@github-actions github-actions bot added the size/S Small PR: 100-299 lines changed label Jan 30, 2026
@codecov
Copy link

codecov bot commented Jan 30, 2026

Codecov Report

❌ Patch coverage is 79.92701% with 55 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.35%. Comparing base (241e9bb) to head (1602ebe).
⚠️ Report is 51 commits behind head on issue-3036-v1.

Files with missing lines Patch % Lines
pkg/vmcp/health/monitor.go 71.18% 16 Missing and 1 partial ⚠️
pkg/vmcp/aggregator/default_aggregator.go 75.38% 13 Missing and 3 partials ⚠️
cmd/vmcp/app/commands.go 0.00% 12 Missing ⚠️
pkg/vmcp/health/circuit_breaker.go 93.93% 4 Missing ⚠️
pkg/vmcp/health/status.go 93.10% 3 Missing and 1 partial ⚠️
pkg/vmcp/types.go 75.00% 2 Missing ⚠️
Additional details and impacted files
@@                Coverage Diff                @@
##           issue-3036-v1    #3533      +/-   ##
=================================================
+ Coverage          65.28%   65.35%   +0.06%     
=================================================
  Files                400      401       +1     
  Lines              39100    39317     +217     
=================================================
+ Hits               25526    25695     +169     
- Misses             11592    11639      +47     
- Partials            1982     1983       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements configurable failure handling for the vMCP aggregator when backend MCP servers become unavailable. It introduces two operational modes: "fail" (default) which returns an error only when all backends are unavailable, and "best_effort" which continues with available backends regardless of failures. This addresses a key acceptance criterion from issue #3036.

Changes:

  • Added PartialFailureModeFail and PartialFailureModeBestEffort constants to define backend failure handling behavior
  • Extended NewDefaultAggregator to accept a failureMode parameter and implemented failure mode enforcement logic in MergeCapabilities
  • Updated all aggregator instantiations across tests and production code to explicitly specify the failure mode

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/vmcp/config/config.go Adds partial failure mode constants (fail and best_effort) with documentation
pkg/vmcp/aggregator/default_aggregator.go Implements failure mode support in aggregator with health tracking and enforcement logic
pkg/vmcp/aggregator/default_aggregator_test.go Adds comprehensive test coverage for both failure modes with various backend health scenarios
cmd/vmcp/app/commands.go Integrates failure mode configuration from operational config into aggregator initialization
pkg/vmcp/server/integration_test.go Updates all test aggregator instantiations to use explicit fail mode
test/integration/vmcp/helpers/vmcp_server.go Updates test helper to use fail mode by default for test consistency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Implements configurable failure handling when backend MCP servers
become unavailable. The aggregator now supports two modes:

  - fail (default): Returns error if all backends unavailable; logs
    warning if some backends are down but continues with healthy ones
  - best_effort: Continues with available backends regardless of
    failures; logs info about unavailable backends

Related-to: #3036
@yrobla yrobla requested a review from ChrisJBurns as a code owner January 30, 2026 15:58
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/S Small PR: 100-299 lines changed labels Jan 30, 2026
@yrobla yrobla self-assigned this Jan 30, 2026
@yrobla yrobla requested a review from Copilot January 30, 2026 16:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +326 to +331
// Track backend health
if backend.HealthStatus.IsHealthyForRouting() {
healthyBackends[backend.ID] = true
} else {
unhealthyBackends[backend.ID] = string(backend.HealthStatus)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The factoring of the defaultAggregator leaves a bit to be desired and I think this change motivates improving it.

Background

The defaultAggregator implements the Aggregator interface, which has 4 unused public methods: QueryCapabilities, QueryAllCapabilities, ResolveConflicts, and MergeCapabilities.

Suggestion

Rather than inlining more logic into defaultAggregator, lets:

  1. Remove the unused methods from the interface.
  2. Implement a circuitBreakerAggregator for your new logic. This implements the new and improved Aggregator interface and decorates any arbitrary aggregator:
type circuitBreakerAggregator struct {
   // this would really just be the defaultAggregator
    inner Aggregator
}


func (c circuitBreakerAggregator) AggregateCapabilities(...) {
    // check all the backends for health
    err := c.enforceFailureMode(...)
    if err != nil {
        return nil, err
    }
    return inner.AggregateCapabilities(...)
}

I think this makes your change cleaner, because it doesn't add any complexity to the existing defaultAggregator's construction or runtime behavior. It would also enable pretty straightforward unit testing of the circuitBreakerAggregator, because you don't need the whole defaultAggregator to test it.

What do you think?

@yrobla
Copy link
Contributor Author

yrobla commented Feb 5, 2026

reimplementing as there were so many changes, makes sense to start from scratch

@yrobla yrobla closed this Feb 5, 2026
@dmjb dmjb deleted the issue-3036-v2 branch February 6, 2026 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants