-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failures in com.hazelcast.jet.impl.connector.WriteJdbcPTest #3027
Comments
Looks like the system was overloaded, but not by the test process:
|
This issue seems similar to #3021: snapshot got stuck when concurrently some member completed execution. Due to the stuck the snapshot we were unable to gracefully restart the job. Due to that all the remaining tests in the class were affected because they share cluster.
The phase 2 was initiated. It completed successfully on member 5702, but the execution completed on 5701 while in phase 2, which should not happen. I'm going to review the code around this for a possible race. |
The failure scenario: - snapshot phase 1 is started in `SnapshotContext` - Before P1 does phase1, it completes. This is normal. It emits DONE_ITEM to snapshotOutbox - `StoreSnapshotTasklet` receives the DONE_ITEM and calls `SnapshotContext.storeSnapshotTaskletDone()`. Because it does so before phase1 was done, it marks the phase1 as done for the processor. - we respond to master that phase1 is done on member, master initiates phase2 - Note that the P1 didn't yet call `SnapshotContext.processorTaskletDone()` so `numPTasklets` isn't decremented - phase2 is initiated with non-decremented `numPTasklets`. We expect this number of processors to do the phase2. - now the processor thread calls `SnapshotContext.processorTaskletDone()` and decrements `numPTasklets`. Since it didn't do the phase1, it completes and never does the phase2 and the execution is stuck waiting for that. This scenario is very unlikely because another threads have a lot more work to do than the processor's thread: send the DONE_ITEM to snapshot queue, handle it in another thread, complete a future, respond to an operation in yet another thread, handle the response and issue a the phase2 operation, handle the phase 2 operation, all in a time while the processor thread proceeds to the very next line. It's easily reproducible by inserting a sleep after `outbox.offerToEdgesAndSnapshot(DONE_ITEM)` in `ProcessorTasklet`. The fix is to ensure we first call `SnapshotContext.processorTaskletDone()` before adding the DONE_ITEM to snapshotQueue to ensure it's called before `SnapshotContext.storeSnapshotTaskletDone()`. We also add an assert that a tasklet isn't done without doing phase2, if phase2 was initiated. Fixes hazelcast/hazelcast-jet#3027
The failure scenario: - snapshot phase 1 is started in `SnapshotContext` - Before P1 does phase1, it completes. This is normal. It emits DONE_ITEM to snapshotOutbox - `StoreSnapshotTasklet` receives the DONE_ITEM and calls `SnapshotContext.storeSnapshotTaskletDone()`. Because it does so before phase1 was done, it marks the phase1 as done for the processor. - we respond to master that phase1 is done on member, master initiates phase2 - Note that the P1 didn't yet call `SnapshotContext.processorTaskletDone()` so `numPTasklets` isn't decremented - phase2 is initiated with non-decremented `numPTasklets`. We expect this number of processors to do the phase2. - now the processor thread calls `SnapshotContext.processorTaskletDone()` and decrements `numPTasklets`. Since it didn't do the phase1, it completes and never does the phase2 and the execution is stuck waiting for that. This scenario is very unlikely because another threads have a lot more work to do than the processor's thread: send the DONE_ITEM to snapshot queue, handle it in another thread, complete a future, respond to an operation in yet another thread, handle the response and issue a the phase2 operation, handle the phase 2 operation, all in a time while the processor thread proceeds to the very next line. It's easily reproducible by inserting a sleep after `outbox.offerToEdgesAndSnapshot(DONE_ITEM)` in `ProcessorTasklet`. The fix is to ensure we first call `SnapshotContext.processorTaskletDone()` before adding the DONE_ITEM to snapshotQueue to ensure it's called before `SnapshotContext.storeSnapshotTaskletDone()`. We also add an assert that a tasklet isn't done without doing phase2, if phase2 was initiated. Fixes hazelcast/hazelcast-jet#3027
master (commit bed77b2)
Failed on IBM JDK 8: http://jenkins.hazelcast.com/job/jet-oss-master-ibm-jdk8/356/testReport/com.hazelcast.jet.impl.connector/WriteJdbcPTest/
Alle tests beside test named
test
incom.hazelcast.jet.impl.connector.WriteJdbcPTest
failed. Seems like an problem with establishing cluster (there are stacktrace messages likeCluster has not elected a master
andMaster address unknown: instance is not yet initialized or is shut down
). See files with stacktraces and standard outputs for details:com.hazelcast.jet.impl.connector.WriteJdbcPTest.txt
com.hazelcast.jet.impl.connector.WriteJdbcPTest-output.txt
The text was updated successfully, but these errors were encountered: