Skip to content

[Feature Request]: Configurable Connection Retries for DB and Redis #179

@crivetimihai

Description

@crivetimihai

🧭 Epic

Title: Configurable Connection-Retry Back-off for Database & Redis
Goal: Prevent gateway startup failures caused by slow-to-start services by making retry limits and intervals fully configurable.
Why now: In containerised deployments Postgres (or any other SQL backend) and Redis often take several seconds to accept TCP connections. Today the gateway gives up on the first failure, crashing with psycopg2.OperationalError: connection … refused. Operators need a simple way—via environment variables, just like DB_POOL_SIZE, CACHE_TYPE, etc.—to tune how many times and how often the gateway retries before aborting.


🧭 Type of Feature

  • Enhancement to existing functionality

🙋‍♂️ User Story 1

As a: DevOps engineer running MCP Gateway in Kubernetes
I want: to set DB_MAX_RETRIES and DB_RETRY_INTERVAL_MS for any SQLAlchemy-backed database
So that: the application patiently waits until the database container is ready, instead of crashing and restarting in a loop

✅ Acceptance Criteria

Scenario: Database retries during startup
Given MCP Gateway starts with DB_MAX_RETRIES=10 and DB_RETRY_INTERVAL_MS=5000
And the database is not yet ready
When the first connection attempt fails
Then the gateway should log "DB connection failed (attempt 1/10), retrying in 5 s"
And it should retry every 5 s
And after 10 failed attempts it should exit with a clear error message

🙋‍♂️ User Story 2

As a: Site Reliability Engineer using Redis caching
I want: analogous knobs REDIS_MAX_RETRIES and REDIS_RETRY_INTERVAL_MS
So that: the gateway’s cache layer behaves consistently with the database layer

✅ Acceptance Criteria

Scenario: Redis retries respect custom settings
Given REDIS_MAX_RETRIES=15 and REDIS_RETRY_INTERVAL_MS=2000
And Redis is unavailable
When the cache subsystem initialises
Then it must retry up to 15 times at 2 s intervals
And surface a single aggregated error if all attempts fail

🙋‍♂️ User Story 3

As a: Platform administrator
I want: sensible defaults (3 retries × 2000 ms) so I don’t have to set anything for typical installs
So that: small deployments gain resilience out-of-the-box

✅ Acceptance Criteria


Scenario: Defaults work when variables are unset
Given none of the four *_RETRIES or *_INTERVAL_MS variables are set
When the gateway cannot reach the database
Then it retries 3 times with 2 s between attempts


📐 Design Sketch (optional)

flowchart TD
subgraph Settings
ENV[".env / ENV vars"]
end
ENV --> Config[config.py load Settings]
Config -->|DB config| DBClient[SQLAlchemy Engine]
Config -->|Cache config| RedisClient[Redis Async Client]

DBClient --> RetryLogic
RedisClient --> RetryLogic

RetryLogic{{"Retry up to N timesnsleep(interval_ms)"}}
RetryLogic --> Logger
Loading
Component Change Detail
config.py Add four @env_settings fields db_max_retries:int=3, db_retry_interval_ms:int=2000, redis_max_retries:int=3, redis_retry_interval_ms:int=2000
db.py Wrap DB connect logic in retry loop Use asyncio.sleep() for interval
Redis cache init Retry connect with same config Applies if CACHE_TYPE=redis
Logging One log per retry attempt Include attempt count and retry delay
README Add env var docs Match style of existing DB/Redis config

🔗 MCP Standards Check

  • No protocol-level breaking changes
  • Follows existing env var config conventions
  • Deviations: none

🔄 Alternatives Considered

Alternative Pros Cons
docker-compose healthcheck + depends_on Simple for local dev Doesn’t help bare-metal or k8s
Exponential back-off Faster in some cases Adds complexity
Gunicorn retry preload script Works for some runners Not general-purpose

📓 Additional Context

  • Mirrors retry pattern used in max_tool_retries
  • Naming aligned with DB_POOL_SIZE, CACHE_TYPE, etc.
  • Ensure this is then changed in the helm chart.

Example configuration:

export DB_MAX_RETRIES=10
export DB_RETRY_INTERVAL_MS=5000
export REDIS_MAX_RETRIES=10
export REDIS_RETRY_INTERVAL_MS=5000

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestgood first issueGood for newcomerspythonPython / backend development (FastAPI)triageIssues / Features awaiting triage

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions