Add partial failure mode support to vMCP aggregator by yrobla · Pull Request #3533 · stacklok/toolhive

yrobla · 2026-01-30T15:11:30Z

Implements configurable failure handling when backend MCP servers become unavailable. The aggregator now supports two modes:

fail (default): Returns error if all backends unavailable; logs warning if some backends are down but continues with healthy ones
best_effort: Continues with available backends regardless of failures; logs info about unavailable backends

Related-to: #3036

Implement circuit breaker state machine for backend health monitoring in Virtual MCP Server. The circuit breaker prevents cascading failures by tracking backend failures and transitioning through three states: Closed (normal operation), Open (failing), and HalfOpen (recovery testing). Key changes: - Add CircuitBreaker with thread-safe state machine - Integrate circuit breaker with status tracker - Integrate circuit breaker with health monitor - Add comprehensive test coverage The circuit breaker is disabled by default for backwards compatibility. Configuration wiring from YAML/CRD to health monitor is intentionally left for a follow-up PR to keep this change focused on the core implementation. Related-to: #3036

Complete the circuit breaker integration by wiring it through the Virtual MCP Server stack for automatic backend failure handling. Configuration wiring (cmd/vmcp/app/commands.go): - Wire circuit breaker config from YAML to health monitor - Pass configurable failure threshold and timeout to monitor - Log circuit breaker enablement on server startup Capability filtering (pkg/vmcp/aggregator/default_aggregator.go): - Filter tools, resources, and prompts from unhealthy backends - Skip capabilities when backend status is Unhealthy, Unknown, or Unauthenticated - Keep capabilities from Healthy and Degraded backends (degraded backends are still operational) - Add isBackendHealthy helper for status evaluation Routing protection (pkg/vmcp/router/default_router.go): - Check backend health before routing requests - Return ErrBackendUnavailable when backend is unhealthy - Add isTargetHealthy helper for status evaluation - Log warnings when routing fails due to backend unavailability Testing (pkg/vmcp/router/default_router_test.go): - Add HealthStatus to all test fixtures - Add test cases for unhealthy backend routing failures - Add test cases for unauthenticated backend failures - Add test cases for degraded backend success (still operational) - Cover all three routing methods: tools, resources, prompts Related-to: #3036 Consolidate health check logic into BackendHealthStatus method Add IsHealthyForRouting() method to BackendHealthStatus type to eliminate duplicate health checking logic in aggregator and router packages. Changes: - pkg/vmcp/types.go: Add IsHealthyForRouting() method - pkg/vmcp/aggregator/default_aggregator.go: Use method, remove duplicate - pkg/vmcp/router/default_router.go: Use method, remove duplicate This consolidation: - Reduces code duplication (removes 29 lines of duplicate code) - Centralizes health status logic in one place - Makes future changes to health checking easier to maintain - Follows DRY principle for better code organization All tests pass. Linting clean. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

codecov · 2026-01-30T15:20:09Z

Codecov Report

❌ Patch coverage is 79.92701% with 55 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.35%. Comparing base (241e9bb) to head (1602ebe).
⚠️ Report is 51 commits behind head on issue-3036-v1.

Files with missing lines	Patch %	Lines
pkg/vmcp/health/monitor.go	71.18%	16 Missing and 1 partial ⚠️
pkg/vmcp/aggregator/default_aggregator.go	75.38%	13 Missing and 3 partials ⚠️
cmd/vmcp/app/commands.go	0.00%	12 Missing ⚠️
pkg/vmcp/health/circuit_breaker.go	93.93%	4 Missing ⚠️
pkg/vmcp/health/status.go	93.10%	3 Missing and 1 partial ⚠️
pkg/vmcp/types.go	75.00%	2 Missing ⚠️

Additional details and impacted files

@@                Coverage Diff                @@
##           issue-3036-v1    #3533      +/-   ##
=================================================
+ Coverage          65.28%   65.35%   +0.06%     
=================================================
  Files                400      401       +1     
  Lines              39100    39317     +217     
=================================================
+ Hits               25526    25695     +169     
- Misses             11592    11639      +47     
- Partials            1982     1983       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR implements configurable failure handling for the vMCP aggregator when backend MCP servers become unavailable. It introduces two operational modes: "fail" (default) which returns an error only when all backends are unavailable, and "best_effort" which continues with available backends regardless of failures. This addresses a key acceptance criterion from issue #3036.

Changes:

Added PartialFailureModeFail and PartialFailureModeBestEffort constants to define backend failure handling behavior
Extended NewDefaultAggregator to accept a failureMode parameter and implemented failure mode enforcement logic in MergeCapabilities
Updated all aggregator instantiations across tests and production code to explicitly specify the failure mode

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
pkg/vmcp/config/config.go	Adds partial failure mode constants (fail and best_effort) with documentation
pkg/vmcp/aggregator/default_aggregator.go	Implements failure mode support in aggregator with health tracking and enforcement logic
pkg/vmcp/aggregator/default_aggregator_test.go	Adds comprehensive test coverage for both failure modes with various backend health scenarios
cmd/vmcp/app/commands.go	Integrates failure mode configuration from operational config into aggregator initialization
pkg/vmcp/server/integration_test.go	Updates all test aggregator instantiations to use explicit fail mode
test/integration/vmcp/helpers/vmcp_server.go	Updates test helper to use fail mode by default for test consistency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/vmcp/config/config.go

pkg/vmcp/aggregator/default_aggregator.go

Implements configurable failure handling when backend MCP servers become unavailable. The aggregator now supports two modes: - fail (default): Returns error if all backends unavailable; logs warning if some backends are down but continues with healthy ones - best_effort: Continues with available backends regardless of failures; logs info about unavailable backends Related-to: #3036

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jerm-dro · 2026-01-30T21:59:33Z

pkg/vmcp/aggregator/default_aggregator.go

+		// Track backend health
+		if backend.HealthStatus.IsHealthyForRouting() {
+			healthyBackends[backend.ID] = true
+		} else {
+			unhealthyBackends[backend.ID] = string(backend.HealthStatus)
+		}


Suggestion: The factoring of the defaultAggregator leaves a bit to be desired and I think this change motivates improving it.

Background

The defaultAggregator implements the Aggregator interface, which has 4 unused public methods: QueryCapabilities, QueryAllCapabilities, ResolveConflicts, and MergeCapabilities.

Suggestion

Rather than inlining more logic into defaultAggregator, lets:

Remove the unused methods from the interface.

Implement a circuitBreakerAggregator for your new logic. This implements the new and improved Aggregator interface and decorates any arbitrary aggregator:

type circuitBreakerAggregator struct { // this would really just be the defaultAggregator inner Aggregator } func (c circuitBreakerAggregator) AggregateCapabilities(...) { // check all the backends for health err := c.enforceFailureMode(...) if err != nil { return nil, err } return inner.AggregateCapabilities(...) }

I think this makes your change cleaner, because it doesn't add any complexity to the existing defaultAggregator's construction or runtime behavior. It would also enable pretty straightforward unit testing of the circuitBreakerAggregator, because you don't need the whole defaultAggregator to test it.

What do you think?

yrobla · 2026-02-05T14:27:43Z

reimplementing as there were so many changes, makes sense to start from scratch

taskbot and others added 2 commits January 30, 2026 09:13

yrobla requested review from JAORMX, amirejaz, jerm-dro and jhrozek as code owners January 30, 2026 15:11

github-actions bot added the size/S Small PR: 100-299 lines changed label Jan 30, 2026

yrobla requested a review from Copilot January 30, 2026 15:28

Copilot started reviewing on behalf of yrobla January 30, 2026 15:29 View session

Copilot AI reviewed Jan 30, 2026

View reviewed changes

pkg/vmcp/config/config.go Show resolved Hide resolved

pkg/vmcp/config/config.go Outdated Show resolved Hide resolved

pkg/vmcp/aggregator/default_aggregator.go Outdated Show resolved Hide resolved

yrobla force-pushed the issue-3036-v2 branch from 3748109 to 1602ebe Compare January 30, 2026 15:58

yrobla requested a review from ChrisJBurns as a code owner January 30, 2026 15:58

github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/S Small PR: 100-299 lines changed labels Jan 30, 2026

yrobla self-assigned this Jan 30, 2026

yrobla requested a review from Copilot January 30, 2026 16:01

Copilot started reviewing on behalf of yrobla January 30, 2026 16:01 View session

Copilot AI reviewed Jan 30, 2026

View reviewed changes

jerm-dro reviewed Jan 30, 2026

View reviewed changes

jerm-dro mentioned this pull request Jan 30, 2026

Integrate circuit breaker with vMCP routing and aggregation #3530

Closed

yrobla force-pushed the issue-3036-v1 branch from d5c6c36 to 9cd1767 Compare February 5, 2026 08:29

yrobla closed this Feb 5, 2026

dmjb deleted the issue-3036-v2 branch February 6, 2026 09:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add partial failure mode support to vMCP aggregator#3533

Add partial failure mode support to vMCP aggregator#3533
yrobla wants to merge 3 commits intoissue-3036-v1from
issue-3036-v2

yrobla commented Jan 30, 2026

Uh oh!

codecov bot commented Jan 30, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

jerm-dro Jan 30, 2026

Uh oh!

yrobla commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

yrobla commented Jan 30, 2026

Uh oh!

codecov bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

jerm-dro Jan 30, 2026

Choose a reason for hiding this comment

Background

Suggestion

Uh oh!

yrobla commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Jan 30, 2026 •

edited

Loading