Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZK Orphan Report Issues #2174

Open
2 of 4 tasks
terrywbrady opened this issue Feb 25, 2025 · 5 comments
Open
2 of 4 tasks

ZK Orphan Report Issues #2174

terrywbrady opened this issue Feb 25, 2025 · 5 comments
Assignees

Comments

@terrywbrady
Copy link
Contributor

terrywbrady commented Feb 25, 2025

  • Why are these errors occurring?
/jobs/jid0000279468/bid
/jobs/jid0000279471/bid
["/jobs/states/completed/51-jid0000279468", "/jobs/states/failed/51-jid0000279468"]
["/jobs/states/completed/51-jid0000279471", "/jobs/states/failed/51-jid0000279471"]
  • Delete button should not appear for an array

  • Delete button should not appear for /bild

  • Which of these issues will resolve once the batches are cleaned up?

@terrywbrady terrywbrady self-assigned this Feb 25, 2025
@mreyescdl
Copy link
Contributor

There is a duplicate entry for a Job in the Orphan report.
If you look at the Znode entries they are created off hours, when I assume that no Admin manual interaction took place (ask @elopatin-uc3 )

stat /batches/bid0000027151/states/batch-failed/jid0000279471
ctime = Mon Feb 24 18:41:03 PST 2025
mtime = Mon Feb 24 18:41:03 PST 2025

stat /batches/bid0000027151/states/batch-completed/jid0000279471
ctime = Mon Feb 24 18:41:49 PST 2025
mtime = Mon Feb 24 18:41:49 PST 2025

To do: Did Ingest retry logic reprocessing cause job to change state?

@elopatin-uc3
Copy link
Contributor

@mreyescdl Confirming that I did not intervene with any batches via the Admin tool on Monday, Feb 24 at 6:41PM.

@mreyescdl
Copy link
Contributor

More analysis on Zookeeper errors shows that Ingest worker 02, and only worker 02 experienced a network issue connecting to ZK servers. All sessions expired from Ingest 02 at 18:40 on Feb 24th.
Ingest 03 and Ingest 01 were unaffected.
To do: analyze Librato network graphs. Ingest 02 resides in AZ us-west-2a

I'm assuming that this was a network issue causing client-side disruption of processing.
The retry/reconnection logic did not work 100% resulting in orphaned nodes.

Here is an example of Client logs showing the issue:

18:40:39.861 [Thread-6-SendThread(uc3-mrtzk-prd03.cdlib.org:2181)] WARN  org.apache.zookeeper.ClientCnxn - Session 0x3002673e8fbbbd4 for server uc3-mrtzk-prd03.cdlib.org/172.30.42.133:2181, C
losing socket connection. Attempting reconnect except it is a SessionExpiredException or SessionTimeoutException.
org.apache.zookeeper.ClientCnxn$ConnectionTimeoutException: Client connection timed out, have not heard from server in 31435ms for session id 0x3002673e8fbbbd4

@mreyescdl
Copy link
Contributor

Terry made changes to Zookeeper library and redeployed to Stage.
There are no errors in submissions of large manifests, but we still have a transient Znode duplication.

["/jobs/states/downloading/03-jid0000000346", "/jobs/states/processing/03-jid0000000346"]               Duplicate JID   FAIL
["/jobs/states/downloading/03-jid0000000347", "/jobs/states/processing/03-jid0000000347"]               Duplicate JID   FAIL
["/jobs/states/downloading/03-jid0000000348", "/jobs/states/processing/03-jid0000000348"]               Duplicate JID   FAIL
["/jobs/states/estimating/03-jid0000000354", "/jobs/states/provisioning/03-jid0000000354"]              Duplicate JID   FAIL

@terrywbrady will look into

@mreyescdl
Copy link
Contributor

mreyescdl commented Mar 4, 2025

We believe that the orphans reported bu the Admin UI during active processing are in fact false positives.
Running the Zookeeper command line tool to look at states shows no orphans during active processing.
The states can be listed by the following on a ZK worker:

/dpr2/bin/zookeeper/zkCli.sh ls -R /jobs/states

This will give a recursive listing of all active Job states.

@terrywbrady may look into the Admin Orphan report, but we decided that this is not a high priority now that we know the ZK libs is working as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants