[Feature Request]: Configurable Connection Retries for DB and Redis

### 🧭 Epic

**Title:** Configurable Connection-Retry Back-off for Database & Redis  
**Goal:** Prevent gateway startup failures caused by slow-to-start services by making retry limits and intervals fully configurable. 
**Why now:** In containerised deployments Postgres (or any other SQL backend) and Redis often take several seconds to accept TCP connections. Today the gateway gives up on the first failure, crashing with `psycopg2.OperationalError: connection … refused`. Operators need a simple way—via environment variables, just like `DB_POOL_SIZE`, `CACHE_TYPE`, etc.—to tune how many times and how often the gateway retries before aborting.

---

### 🧭 Type of Feature

- [x] Enhancement to existing functionality

---

### 🙋‍♂️ User Story 1

**As a:** DevOps engineer running MCP Gateway in Kubernetes  
**I want:** to set `DB_MAX_RETRIES` and `DB_RETRY_INTERVAL_MS` for any SQLAlchemy-backed database  
**So that:** the application patiently waits until the database container is ready, instead of crashing and restarting in a loop

#### ✅ Acceptance Criteria

```gherkin
Scenario: Database retries during startup
Given MCP Gateway starts with DB_MAX_RETRIES=10 and DB_RETRY_INTERVAL_MS=5000
And the database is not yet ready
When the first connection attempt fails
Then the gateway should log "DB connection failed (attempt 1/10), retrying in 5 s"
And it should retry every 5 s
And after 10 failed attempts it should exit with a clear error message
```

---

### 🙋‍♂️ User Story 2

**As a:** Site Reliability Engineer using Redis caching  
**I want:** analogous knobs `REDIS_MAX_RETRIES` and `REDIS_RETRY_INTERVAL_MS`  
**So that:** the gateway’s cache layer behaves consistently with the database layer

#### ✅ Acceptance Criteria

```gherkin
Scenario: Redis retries respect custom settings
Given REDIS_MAX_RETRIES=15 and REDIS_RETRY_INTERVAL_MS=2000
And Redis is unavailable
When the cache subsystem initialises
Then it must retry up to 15 times at 2 s intervals
And surface a single aggregated error if all attempts fail
```

---

### 🙋‍♂️ User Story 3

**As a:** Platform administrator  
**I want:** sensible defaults (3 retries × 2000 ms) so I don’t have to set anything for typical installs  
**So that:** small deployments gain resilience out-of-the-box

#### ✅ Acceptance Criteria

```

Scenario: Defaults work when variables are unset
Given none of the four *_RETRIES or *_INTERVAL_MS variables are set
When the gateway cannot reach the database
Then it retries 3 times with 2 s between attempts

```

---

### 📐 Design Sketch (optional)

```mermaid
flowchart TD
subgraph Settings
ENV[".env / ENV vars"]
end
ENV --> Config[config.py load Settings]
Config -->|DB config| DBClient[SQLAlchemy Engine]
Config -->|Cache config| RedisClient[Redis Async Client]

DBClient --> RetryLogic
RedisClient --> RetryLogic

RetryLogic{{"Retry up to N timesnsleep(interval_ms)"}}
RetryLogic --> Logger
```

| Component | Change | Detail |
|----------|--------|--------|
| `config.py` | Add four `@env_settings` fields | `db_max_retries:int=3`, `db_retry_interval_ms:int=2000`, `redis_max_retries:int=3`, `redis_retry_interval_ms:int=2000` |
| `db.py` | Wrap DB connect logic in retry loop | Use `asyncio.sleep()` for interval |
| Redis cache init | Retry connect with same config | Applies if `CACHE_TYPE=redis` |
| Logging | One log per retry attempt | Include attempt count and retry delay |
| README | Add env var docs | Match style of existing DB/Redis config

---

### 🔗 MCP Standards Check

- [x] No protocol-level breaking changes
- [x] Follows existing env var config conventions
- [ ] Deviations: none

---

### 🔄 Alternatives Considered

| Alternative | Pros | Cons |
|------------|------|------|
| `docker-compose` healthcheck + `depends_on` | Simple for local dev | Doesn’t help bare-metal or k8s |
| Exponential back-off | Faster in some cases | Adds complexity |
| Gunicorn retry preload script | Works for some runners | Not general-purpose

---

### 📓 Additional Context

- Mirrors retry pattern used in `max_tool_retries`
- Naming aligned with `DB_POOL_SIZE`, `CACHE_TYPE`, etc.
- Ensure this is then changed in the helm chart.

Example configuration:

```bash
export DB_MAX_RETRIES=10
export DB_RETRY_INTERVAL_MS=5000
export REDIS_MAX_RETRIES=10
export REDIS_RETRY_INTERVAL_MS=5000
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request]: Configurable Connection Retries for DB and Redis #179

🧭 Epic

🧭 Type of Feature

🙋‍♂️ User Story 1

✅ Acceptance Criteria

🙋‍♂️ User Story 2

✅ Acceptance Criteria

🙋‍♂️ User Story 3

✅ Acceptance Criteria

📐 Design Sketch (optional)

🔗 MCP Standards Check

🔄 Alternatives Considered

📓 Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Change	Detail
`config.py`	Add four `@env_settings` fields	`db_max_retries:int=3`, `db_retry_interval_ms:int=2000`, `redis_max_retries:int=3`, `redis_retry_interval_ms:int=2000`
`db.py`	Wrap DB connect logic in retry loop	Use `asyncio.sleep()` for interval
Redis cache init	Retry connect with same config	Applies if `CACHE_TYPE=redis`
Logging	One log per retry attempt	Include attempt count and retry delay
README	Add env var docs	Match style of existing DB/Redis config

Alternative	Pros	Cons
`docker-compose` healthcheck + `depends_on`	Simple for local dev	Doesn’t help bare-metal or k8s
Exponential back-off	Faster in some cases	Adds complexity
Gunicorn retry preload script	Works for some runners	Not general-purpose

[Feature Request]: Configurable Connection Retries for DB and Redis #179

Description

🧭 Epic

🧭 Type of Feature

🙋‍♂️ User Story 1

✅ Acceptance Criteria

🙋‍♂️ User Story 2

✅ Acceptance Criteria

🙋‍♂️ User Story 3

✅ Acceptance Criteria

📐 Design Sketch (optional)

🔗 MCP Standards Check

🔄 Alternatives Considered

📓 Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions