Description
🧭 Epic
Title: Configurable Connection-Retry Back-off for Database & Redis
Goal: Prevent gateway startup failures caused by slow-to-start services by making retry limits and intervals fully configurable.
Why now: In containerised deployments Postgres (or any other SQL backend) and Redis often take several seconds to accept TCP connections. Today the gateway gives up on the first failure, crashing with psycopg2.OperationalError: connection … refused
. Operators need a simple way—via environment variables, just like DB_POOL_SIZE
, CACHE_TYPE
, etc.—to tune how many times and how often the gateway retries before aborting.
🧭 Type of Feature
- Enhancement to existing functionality
🙋♂️ User Story 1
As a: DevOps engineer running MCP Gateway in Kubernetes
I want: to set DB_MAX_RETRIES
and DB_RETRY_INTERVAL_MS
for any SQLAlchemy-backed database
So that: the application patiently waits until the database container is ready, instead of crashing and restarting in a loop
✅ Acceptance Criteria
Scenario: Database retries during startup
Given MCP Gateway starts with DB_MAX_RETRIES=10 and DB_RETRY_INTERVAL_MS=5000
And the database is not yet ready
When the first connection attempt fails
Then the gateway should log "DB connection failed (attempt 1/10), retrying in 5 s"
And it should retry every 5 s
And after 10 failed attempts it should exit with a clear error message
🙋♂️ User Story 2
As a: Site Reliability Engineer using Redis caching
I want: analogous knobs REDIS_MAX_RETRIES
and REDIS_RETRY_INTERVAL_MS
So that: the gateway’s cache layer behaves consistently with the database layer
✅ Acceptance Criteria
Scenario: Redis retries respect custom settings
Given REDIS_MAX_RETRIES=15 and REDIS_RETRY_INTERVAL_MS=2000
And Redis is unavailable
When the cache subsystem initialises
Then it must retry up to 15 times at 2 s intervals
And surface a single aggregated error if all attempts fail
🙋♂️ User Story 3
As a: Platform administrator
I want: sensible defaults (3 retries × 2000 ms) so I don’t have to set anything for typical installs
So that: small deployments gain resilience out-of-the-box
✅ Acceptance Criteria
Scenario: Defaults work when variables are unset
Given none of the four *_RETRIES or *_INTERVAL_MS variables are set
When the gateway cannot reach the database
Then it retries 3 times with 2 s between attempts
📐 Design Sketch (optional)
flowchart TD
subgraph Settings
ENV[".env / ENV vars"]
end
ENV --> Config[config.py load Settings]
Config -->|DB config| DBClient[SQLAlchemy Engine]
Config -->|Cache config| RedisClient[Redis Async Client]
DBClient --> RetryLogic
RedisClient --> RetryLogic
RetryLogic{{"Retry up to N timesnsleep(interval_ms)"}}
RetryLogic --> Logger
Component | Change | Detail |
---|---|---|
config.py |
Add four @env_settings fields |
db_max_retries:int=3 , db_retry_interval_ms:int=2000 , redis_max_retries:int=3 , redis_retry_interval_ms:int=2000 |
db.py |
Wrap DB connect logic in retry loop | Use asyncio.sleep() for interval |
Redis cache init | Retry connect with same config | Applies if CACHE_TYPE=redis |
Logging | One log per retry attempt | Include attempt count and retry delay |
README | Add env var docs | Match style of existing DB/Redis config |
🔗 MCP Standards Check
- No protocol-level breaking changes
- Follows existing env var config conventions
- Deviations: none
🔄 Alternatives Considered
Alternative | Pros | Cons |
---|---|---|
docker-compose healthcheck + depends_on |
Simple for local dev | Doesn’t help bare-metal or k8s |
Exponential back-off | Faster in some cases | Adds complexity |
Gunicorn retry preload script | Works for some runners | Not general-purpose |
📓 Additional Context
- Mirrors retry pattern used in
max_tool_retries
- Naming aligned with
DB_POOL_SIZE
,CACHE_TYPE
, etc. - Ensure this is then changed in the helm chart.
Example configuration:
export DB_MAX_RETRIES=10
export DB_RETRY_INTERVAL_MS=5000
export REDIS_MAX_RETRIES=10
export REDIS_RETRY_INTERVAL_MS=5000