Skip to content

Conversation

@yrobla
Copy link
Contributor

@yrobla yrobla commented Dec 22, 2025

Add circuit breaker pattern to the vMCP health monitoring system to prevent cascading failures and enable graceful degradation when backends become unhealthy. The circuit breaker fast-fails health checks when backends are down, reducing unnecessary network calls and allowing faster recovery.

Implementation:

  • Add circuit breaker to health monitoring system (pkg/vmcp/health/)
  • Circuit states map to health states (Closed→Healthy, Open→Unhealthy, HalfOpen→Degraded)
  • Per-backend circuit isolation - each backend has independent circuit state
  • Configurable failure threshold and timeout for circuit transitions
  • Fast-fail behavior skips health checks when circuit is open

Configuration:

  • Disabled by default for backward compatibility
  • Optional circuit_breaker config in VirtualMCPServer operational settings
  • Configurable failure_threshold and timeout parameters

Testing:

  • Unit tests for circuit state machine logic (circuit_breaker_test.go)
  • Integration tests with health monitor (monitor_test.go)
  • End-to-end tests in Kubernetes environment (virtualmcp_circuit_breaker_test.go)
  • All tests run in parallel for faster execution

The circuit breaker opens after consecutive failures, transitions to half-open after timeout, and closes on successful recovery. This prevents overwhelming failing backends while maintaining healthy backend availability.

🤖 Generated with Claude Code

Large PR Justification

This PR implements the circuit breaker pattern for vMCP backend health monitoring as a single atomic feature. The circuit breaker logic is tightly coupled across multiple health system components (status tracker, monitor, config) that must work together - splitting would create intermediate states where the feature is incomplete or broken. The configuration flows through multiple layers (YAML → commands → server → monitor → status), and separating config from implementation would leave the system non-functional. Additionally, the circuit states (Closed/Open/HalfOpen) map directly to health states (Healthy/Unhealthy/Degraded), a contract that must be established atomically.

taskbot and others added 6 commits December 22, 2025 11:19
author taskbot <[email protected]> 1766072123 +0100
committer taskbot <[email protected]> 1766158585 +0100

Integrate health monitoring into vMCP server

Integrates the health monitoring infrastructure (from previous into
the vMCP server, enabling periodic backend health checks with configurable
Related-to: #3036

  intervals and thresholds.

changes from review

changes from review

add missing method

Apply suggestion from @Copilot

Co-authored-by: Copilot <[email protected]>
Implement health check system that monitors backend MCP server availability
through periodic ListCapabilities calls. This is the foundation for the
health monitoring and circuit breaker system described in issue #3036.

Addresses first part of #3036
Adds health monitoring integration to the Kubernetes operator controller,                                                                   enabling real-time backend health status tracking and reporting in the
VirtualMCPServer CRD status.
@github-actions github-actions bot added the size/XL Extra large PR: 1000+ lines changed label Dec 22, 2025
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

@codecov
Copy link

codecov bot commented Dec 22, 2025

Codecov Report

❌ Patch coverage is 88.80597% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.05%. Comparing base (5d3dc03) to head (24a6b2c).

Files with missing lines Patch % Lines
cmd/vmcp/app/commands.go 0.00% 8 Missing ⚠️
pkg/vmcp/health/status.go 93.00% 5 Missing and 2 partials ⚠️
Additional details and impacted files
@@                        Coverage Diff                        @@
##           feat/issue-3036-healthcheck-3    #3136      +/-   ##
=================================================================
+ Coverage                          56.93%   57.05%   +0.12%     
=================================================================
  Files                                341      342       +1     
  Lines                              34201    34332     +131     
=================================================================
+ Hits                               19471    19589     +118     
- Misses                             13114    13126      +12     
- Partials                            1616     1617       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add circuit breaker pattern to the vMCP health monitoring system to prevent
cascading failures and enable graceful degradation when backends become
unhealthy. The circuit breaker fast-fails health checks when backends are
down, reducing unnecessary network calls and allowing faster recovery.

Implementation:
  - Add circuit breaker to health monitoring system (pkg/vmcp/health/)
  - Circuit states map to health states (Closed→Healthy, Open→Unhealthy, HalfOpen→Degraded)
  - Per-backend circuit isolation - each backend has independent circuit state
  - Configurable failure threshold and timeout for circuit transitions
  - Fast-fail behavior skips health checks when circuit is open

Configuration:
  - Disabled by default for backward compatibility
  - Optional circuit_breaker config in VirtualMCPServer operational settings
  - Configurable failure_threshold and timeout parameters

Testing:
  - Unit tests for circuit state machine logic (circuit_breaker_test.go)
  - Integration tests with health monitor (monitor_test.go)
  - End-to-end tests in Kubernetes environment (virtualmcp_circuit_breaker_test.go)
  - All tests run in parallel for faster execution

The circuit breaker opens after consecutive failures, transitions to half-open
after timeout, and closes on successful recovery. This prevents overwhelming
failing backends while maintaining healthy backend availability.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@yrobla yrobla force-pushed the feat/issue-3036-healthcheck-circuitbreaker branch from cc090f9 to aee49c0 Compare December 22, 2025 12:19
@github-actions github-actions bot dismissed their stale review December 22, 2025 12:20

Large PR justification has been provided. Thank you!

@github-actions github-actions bot removed the size/XL Extra large PR: 1000+ lines changed label Dec 22, 2025
@github-actions
Copy link
Contributor

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

@github-actions github-actions bot added the size/XL Extra large PR: 1000+ lines changed label Dec 22, 2025
@yrobla yrobla requested a review from Copilot December 22, 2025 12:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a circuit breaker pattern for vMCP backend health monitoring to prevent cascading failures and enable graceful degradation. The circuit breaker fast-fails health checks when backends are down, reducing unnecessary network calls and allowing faster recovery.

Key changes:

  • New circuit breaker state machine with three states (Closed, Open, HalfOpen) that map to health states (Healthy, Unhealthy, Degraded)
  • Per-backend circuit isolation with independent state tracking for each backend
  • Disabled by default for backward compatibility with configurable threshold and timeout parameters

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/vmcp/health/config.go Defines circuit breaker configuration, validation logic, and state constants
pkg/vmcp/health/config_test.go Unit tests for circuit breaker configuration validation and state string conversion
pkg/vmcp/health/circuit_breaker_test.go Comprehensive unit tests for circuit breaker state machine logic and transitions
pkg/vmcp/health/status.go Integrates circuit breaker into status tracker with state management methods
pkg/vmcp/health/status_test.go Updates status tracker tests to pass circuit breaker config parameter
pkg/vmcp/health/monitor.go Implements fast-fail logic by checking circuit state before health checks
pkg/vmcp/health/monitor_test.go Integration tests for circuit breaker with health monitor including full cycle and backward compatibility tests
cmd/vmcp/app/commands.go Maps circuit breaker configuration from YAML to health monitor config
cmd/vmcp/README.md Comprehensive documentation of health monitoring and circuit breaker features
examples/operator/virtual-mcps/vmcp_health_monitoring.yaml Example configuration demonstrating circuit breaker settings
test/e2e/thv-operator/virtualmcp/virtualmcp_circuit_breaker_test.go End-to-end Kubernetes tests verifying circuit breaker behavior in production-like environment

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Dec 22, 2025
@yrobla yrobla force-pushed the feat/issue-3036-healthcheck-3 branch from 5d3dc03 to 7c7fc41 Compare January 5, 2026 09:49
@yrobla
Copy link
Contributor Author

yrobla commented Jan 19, 2026

pending on changes for status reporter abstraction

@yrobla yrobla marked this pull request as draft January 19, 2026 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants