Skip to content

Conversation

hellolittlej
Copy link
Collaborator

@hellolittlej hellolittlej commented Sep 25, 2025

Context

Mantis has two layers heartbeat. One for container/node level, another is for worker/internal process level. For TE that lost heartbeat for over 30mins (a given threshold), we should mark it as permanently failure and terminate the instance + internal process. Because the master would reassign the worker to a different TE after worker heartbeat timeout, and if we don't remove the ghost instance, we would have two process running the task (one inside the ghost container, and another is reassigned by the master).

Screenshot 2025-09-25 at 11 01 12 AM
  • Introduce a zombie collection to save list of TE that non-recoverable and should be killed
  • Add logs to collect the data for #TE that lost heartbeat for > 30mins.

Checklist

  • ./gradlew build compiles code correctly
  • Added new tests where applicable
  • ./gradlew test passes all tests
  • Extended README or added javadocs where applicable

@hellolittlej hellolittlej requested a deployment to Integrate Pull Request September 25, 2025 18:02 — with GitHub Actions Waiting
Copy link

Test Results

306 tests   - 355   300 ✅  - 350   7m 30s ⏱️ - 1m 37s
 87 suites  -  65     6 💤  -   5 
 87 files    -  65     0 ❌ ±  0 

Results for commit 3a91985. ± Comparison against base commit 8853a81.

This pull request removes 355 tests.
io.mantisrx.master.api.akka.route.JacksonTest ‑ testAckSerialization
io.mantisrx.master.api.akka.route.JacksonTest ‑ testDeser4
io.mantisrx.master.api.akka.route.JacksonTest ‑ testOptionalSerialization
io.mantisrx.master.api.akka.route.LeaderRedirectionFilterTest ‑ testRouteChangesIfNotLeader
io.mantisrx.master.api.akka.route.LeaderRedirectionFilterTest ‑ testRouteUnchangedIfLeader
io.mantisrx.master.api.akka.route.LeaderRedirectionRouteTest ‑ testMasterInfoAPIWhenLeader
io.mantisrx.master.api.akka.route.pagination.ListObjectTests ‑ testEmptyList
io.mantisrx.master.api.akka.route.pagination.ListObjectTests ‑ testPaginationInvalidLimit
io.mantisrx.master.api.akka.route.pagination.ListObjectTests ‑ testPaginationLimit
io.mantisrx.master.api.akka.route.pagination.ListObjectTests ‑ testPaginationLimitAndOffset
…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant