fix(lapis): health check: consider LAPIS "UP" even when SILO is down#1594
fix(lapis): health check: consider LAPIS "UP" even when SILO is down#1594fengelniederhammer wants to merge 1 commit intomainfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Pull request overview
Adjusts LAPIS’ Spring Boot Actuator health reporting so that the service remains reported as UP even if the dependent SILO service is unavailable, while still surfacing SILO-related diagnostics in health details.
Changes:
- Always returns
Health.up()fromSiloHealthIndicator. - Adds SILO diagnostic details (versions on success;
siloStatus/error info on failure) instead of marking overall health as DOWN.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| .withDetail("siloStatus", "DOWN") | ||
| .withDetail("error", "SILO not reachable") | ||
| .withDetail("message", e.message) | ||
| } catch (e: SiloUnavailableException) { | ||
| it | ||
| .withDetail("siloStatus", "DOWN") | ||
| .withDetail("error", "SILO unavailable (HTTP 503)") | ||
| .withDetail("message", e.message) | ||
| .withDetail("retryAfter", e.retryAfter) | ||
| } catch (e: Exception) { | ||
| it | ||
| .withDetail("siloStatus", "DOWN") | ||
| .withDetail("error", "Unexpected error checking SILO") |
There was a problem hiding this comment.
"DOWN" is hard-coded multiple times for siloStatus. To avoid typos and keep alignment with Spring Actuator status codes, consider using the Actuator Status constants (e.g., Status.DOWN.code / Status.UP.code) or a single local constant.
| .let { | ||
| try { | ||
| val info = cachedSiloClient.callInfo() | ||
| it |
There was a problem hiding this comment.
When SILO is reachable, the response doesn’t include a siloStatus detail, but it is present (as "DOWN") in all failure cases. This makes the health JSON schema inconsistent for consumers; consider always setting siloStatus (e.g., "UP" on success, "DOWN" on failure) so clients don’t have to infer from missing fields.
| it | |
| it | |
| .withDetail("siloStatus", "UP") |
| Health | ||
| .up() // LAPIS should always be "up", independent of SILO. | ||
| .let { | ||
| try { | ||
| val info = cachedSiloClient.callInfo() | ||
| it | ||
| .withDetail("dataVersion", info.dataVersion) | ||
| .withDetail("siloVersion", info.siloVersion ?: "unknown") | ||
| } catch (e: SiloNotReachableException) { | ||
| it | ||
| .withDetail("siloStatus", "DOWN") | ||
| .withDetail("error", "SILO not reachable") | ||
| .withDetail("message", e.message) | ||
| } catch (e: SiloUnavailableException) { | ||
| it | ||
| .withDetail("siloStatus", "DOWN") | ||
| .withDetail("error", "SILO unavailable (HTTP 503)") | ||
| .withDetail("message", e.message) | ||
| .withDetail("retryAfter", e.retryAfter) | ||
| } catch (e: Exception) { | ||
| it | ||
| .withDetail("siloStatus", "DOWN") | ||
| .withDetail("error", "Unexpected error checking SILO") | ||
| .withDetail("message", e.message) | ||
| } | ||
| } | ||
| .build() |
There was a problem hiding this comment.
SiloHealthIndicator now always reports Spring Actuator status UP (even when SILO is down) and only encodes SILO failure via details. This is easy to misinterpret because /actuator/health components will still show this indicator as UP; if the intent is to keep overall LAPIS health UP, consider moving SILO checks into a separate health group/contributor (e.g., readiness) or otherwise returning a real DOWN status for SILO while excluding it from the liveness group.
| Health | |
| .up() // LAPIS should always be "up", independent of SILO. | |
| .let { | |
| try { | |
| val info = cachedSiloClient.callInfo() | |
| it | |
| .withDetail("dataVersion", info.dataVersion) | |
| .withDetail("siloVersion", info.siloVersion ?: "unknown") | |
| } catch (e: SiloNotReachableException) { | |
| it | |
| .withDetail("siloStatus", "DOWN") | |
| .withDetail("error", "SILO not reachable") | |
| .withDetail("message", e.message) | |
| } catch (e: SiloUnavailableException) { | |
| it | |
| .withDetail("siloStatus", "DOWN") | |
| .withDetail("error", "SILO unavailable (HTTP 503)") | |
| .withDetail("message", e.message) | |
| .withDetail("retryAfter", e.retryAfter) | |
| } catch (e: Exception) { | |
| it | |
| .withDetail("siloStatus", "DOWN") | |
| .withDetail("error", "Unexpected error checking SILO") | |
| .withDetail("message", e.message) | |
| } | |
| } | |
| .build() | |
| try { | |
| val info = cachedSiloClient.callInfo() | |
| Health.up() | |
| .withDetail("dataVersion", info.dataVersion) | |
| .withDetail("siloVersion", info.siloVersion ?: "unknown") | |
| .build() | |
| } catch (e: SiloNotReachableException) { | |
| Health.down() | |
| .withDetail("siloStatus", "DOWN") | |
| .withDetail("error", "SILO not reachable") | |
| .withDetail("message", e.message) | |
| .build() | |
| } catch (e: SiloUnavailableException) { | |
| Health.down() | |
| .withDetail("siloStatus", "DOWN") | |
| .withDetail("error", "SILO unavailable (HTTP 503)") | |
| .withDetail("message", e.message) | |
| .withDetail("retryAfter", e.retryAfter) | |
| .build() | |
| } catch (e: Exception) { | |
| Health.down() | |
| .withDetail("siloStatus", "DOWN") | |
| .withDetail("error", "Unexpected error checking SILO") | |
| .withDetail("message", e.message) | |
| .build() | |
| } |
There was a problem hiding this comment.
This sounds valid; we're in the 'SiloHealthIndicator', shouldn't there be a LapisHealthIndicator somewhere maybe?
Edit: This was in response to what ChatGPT said above.
| it | ||
| .withDetail("siloStatus", "DOWN") | ||
| .withDetail("error", "SILO not reachable") | ||
| .withDetail("message", e.message) | ||
| } catch (e: SiloUnavailableException) { | ||
| it | ||
| .withDetail("siloStatus", "DOWN") | ||
| .withDetail("error", "SILO unavailable (HTTP 503)") | ||
| .withDetail("message", e.message) | ||
| .withDetail("retryAfter", e.retryAfter) |
There was a problem hiding this comment.
The health endpoint is configured to show-details=always and /actuator/** is publicly accessible, so including raw exception messages (e.message) here can leak internal connectivity details (URIs, exception classes, etc.). Consider removing message from the public health response or replacing it with a sanitized/static value and logging the full exception server-side instead.
fhennig
left a comment
There was a problem hiding this comment.
What do you think about having UP, DOWN and DEGRADED?
| Health | ||
| .up() // LAPIS should always be "up", independent of SILO. | ||
| .let { | ||
| try { | ||
| val info = cachedSiloClient.callInfo() | ||
| it | ||
| .withDetail("dataVersion", info.dataVersion) | ||
| .withDetail("siloVersion", info.siloVersion ?: "unknown") | ||
| } catch (e: SiloNotReachableException) { | ||
| it | ||
| .withDetail("siloStatus", "DOWN") | ||
| .withDetail("error", "SILO not reachable") | ||
| .withDetail("message", e.message) | ||
| } catch (e: SiloUnavailableException) { | ||
| it | ||
| .withDetail("siloStatus", "DOWN") | ||
| .withDetail("error", "SILO unavailable (HTTP 503)") | ||
| .withDetail("message", e.message) | ||
| .withDetail("retryAfter", e.retryAfter) | ||
| } catch (e: Exception) { | ||
| it | ||
| .withDetail("siloStatus", "DOWN") | ||
| .withDetail("error", "Unexpected error checking SILO") | ||
| .withDetail("message", e.message) | ||
| } | ||
| } | ||
| .build() |
There was a problem hiding this comment.
This sounds valid; we're in the 'SiloHealthIndicator', shouldn't there be a LapisHealthIndicator somewhere maybe?
Edit: This was in response to what ChatGPT said above.
PR Checklist
- [ ] All necessary documentation has been adapted.- [ ] All necessary changes are explained in thellms.txt.- [ ] The implemented feature is covered by an appropriate test.