Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failures in com.hazelcast.jet.impl.connector.WriteJdbcPTest #3027

Closed
olukas opened this issue Apr 14, 2021 · 2 comments · Fixed by hazelcast/hazelcast#18530
Closed

Failures in com.hazelcast.jet.impl.connector.WriteJdbcPTest #3027

olukas opened this issue Apr 14, 2021 · 2 comments · Fixed by hazelcast/hazelcast#18530
Assignees
Milestone

Comments

@olukas
Copy link
Collaborator

olukas commented Apr 14, 2021

master (commit bed77b2)

Failed on IBM JDK 8: http://jenkins.hazelcast.com/job/jet-oss-master-ibm-jdk8/356/testReport/com.hazelcast.jet.impl.connector/WriteJdbcPTest/

Alle tests beside test named test in com.hazelcast.jet.impl.connector.WriteJdbcPTest failed. Seems like an problem with establishing cluster (there are stacktrace messages like Cluster has not elected a master and Master address unknown: instance is not yet initialized or is shut down). See files with stacktraces and standard outputs for details:
com.hazelcast.jet.impl.connector.WriteJdbcPTest.txt
com.hazelcast.jet.impl.connector.WriteJdbcPTest-output.txt

@olukas olukas added this to the 4.6 milestone Apr 14, 2021
@viliam-durina
Copy link
Contributor

viliam-durina commented Apr 14, 2021

Looks like the system was overloaded, but not by the test process: load.process=<1%, load.system=>99%

2021-04-14 02:39:10,734 [ INFO] [hz.sleepy_cartwright.HealthMonitor] [c.h.i.d.HealthMonitor]: [127.0.0.1]:5702 [jet] [4.6-SNAPSHOT] processors=72, physical.memory.total=377.6G, physical.memory.free=77.6G, swap.space.total=4.0G, swap.space.free=3.4G, heap.memory.used=459.8M, heap.memory.free=564.2M, heap.memory.total=1024.0M, heap.memory.max=1024.0M, heap.memory.used/total=44.90%, heap.memory.used/max=44.90%, minor.gc.count=119, minor.gc.time=14614ms, major.gc.count=0, major.gc.time=0ms, load.process=0.96%, load.system=99.62%, load.systemAverage=4850.39, thread.count=678, thread.peakCount=2362, cluster.timeDiff=-4, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.client.query.size=0, executor.q.client.blocking.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=0, executor.q.priorityOperation.size=0, operations.completed.count=11, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=0, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=0, proxy.count=6, clientEndpoint.count=0, connection.active.count=0, client.connection.count=0, connection.count=0
2021-04-14 02:39:29,762 [ INFO] [hz.dreamy_cartwright.HealthMonitor] [c.h.i.d.HealthMonitor]: [127.0.0.1]:5701 [jet] [4.6-SNAPSHOT] processors=72, physical.memory.total=377.6G, physical.memory.free=78.7G, swap.space.total=4.0G, swap.space.free=3.4G, heap.memory.used=370.6M, heap.memory.free=653.4M, heap.memory.total=1024.0M, heap.memory.max=1024.0M, heap.memory.used/total=36.19%, heap.memory.used/max=36.19%, minor.gc.count=122, minor.gc.time=15114ms, major.gc.count=0, major.gc.time=0ms, load.process=1.28%, load.system=99.14%, load.systemAverage=4816.44, thread.count=817, thread.peakCount=2362, cluster.timeDiff=0, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.client.query.size=0, executor.q.client.blocking.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=0, executor.q.priorityOperation.size=0, operations.completed.count=11826, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=0, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=0, proxy.count=3, clientEndpoint.count=0, connection.active.count=0, client.connection.count=0, connection.count=0
2021-04-14 02:39:30,741 [ INFO] [hz.sleepy_cartwright.HealthMonitor] [c.h.i.d.HealthMonitor]: [127.0.0.1]:5702 [jet] [4.6-SNAPSHOT] processors=72, physical.memory.total=377.6G, physical.memory.free=80.8G, swap.space.total=4.0G, swap.space.free=3.4G, heap.memory.used=384.7M, heap.memory.free=639.3M, heap.memory.total=1024.0M, heap.memory.max=1024.0M, heap.memory.used/total=37.57%, heap.memory.used/max=37.57%, minor.gc.count=122, minor.gc.time=15114ms, major.gc.count=0, major.gc.time=0ms, load.process=1.34%, load.system=99.65%, load.systemAverage=4816.44, thread.count=817, thread.peakCount=2362, cluster.timeDiff=0, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.client.query.size=0, executor.q.client.blocking.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=0, executor.q.priorityOperation.size=0, operations.completed.count=13485, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=0, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=0, proxy.count=3, clientEndpoint.count=0, connection.active.count=0, client.connection.count=0, connection.count=0
2021-04-14 02:39:49,798 [ INFO] [hz.dreamy_cartwright.HealthMonitor] [c.h.i.d.HealthMonitor]: [127.0.0.1]:5701 [jet] [4.6-SNAPSHOT] processors=72, physical.memory.total=377.6G, physical.memory.free=81.8G, swap.space.total=4.0G, swap.space.free=3.4G, heap.memory.used=463.3M, heap.memory.free=560.7M, heap.memory.total=1024.0M, heap.memory.max=1024.0M, heap.memory.used/total=45.25%, heap.memory.used/max=45.25%, minor.gc.count=123, minor.gc.time=15216ms, major.gc.count=0, major.gc.time=0ms, load.process=0.84%, load.system=100.00%, load.systemAverage=4733.54, thread.count=817, thread.peakCount=2362, cluster.timeDiff=0, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.client.query.size=0, executor.q.client.blocking.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=0, executor.q.priorityOperation.size=0, operations.completed.count=15208, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=0, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=0, proxy.count=3, clientEndpoint.count=0, connection.active.count=0, client.connection.count=0, connection.count=0
2021-04-14 02:39:50,746 [ INFO] [hz.sleepy_cartwright.HealthMonitor] [c.h.i.d.HealthMonitor]: [127.0.0.1]:5702 [jet] [4.6-SNAPSHOT] processors=72, physical.memory.total=377.6G, physical.memory.free=81.7G, swap.space.total=4.0G, swap.space.free=3.4G, heap.memory.used=475.6M, heap.memory.free=548.4M, heap.memory.total=1024.0M, heap.memory.max=1024.0M, heap.memory.used/total=46.44%, heap.memory.used/max=46.44%, minor.gc.count=123, minor.gc.time=15216ms, major.gc.count=0, major.gc.time=0ms, load.process=0.96%, load.system=99.58%, load.systemAverage=4733.54, thread.count=817, thread.peakCount=2362, cluster.timeDiff=-1, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.client.query.size=0, executor.q.client.blocking.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=0, executor.q.priorityOperation.size=0, operations.completed.count=17151, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=0, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=0, proxy.count=3, clientEndpoint.count=0, connection.active.count=0, client.connection.count=0, connection.count=0
2021-04-14 02:40:09,802 [ INFO] [hz.dreamy_cartwright.HealthMonitor] [c.h.i.d.HealthMonitor]: [127.0.0.1]:5701 [jet] [4.6-SNAPSHOT] processors=72, physical.memory.total=377.6G, physical.memory.free=82.4G, swap.space.total=4.0G, swap.space.free=3.4G, heap.memory.used=317.7M, heap.memory.free=706.3M, heap.memory.total=1024.0M, heap.memory.max=1024.0M, heap.memory.used/total=31.02%, heap.memory.used/max=31.02%, minor.gc.count=124, minor.gc.time=15269ms, major.gc.count=0, major.gc.time=0ms, load.process=0.93%, load.system=99.90%, load.systemAverage=4665.33, thread.count=815, thread.peakCount=2362, cluster.timeDiff=0, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.client.query.size=0, executor.q.client.blocking.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=0, executor.q.priorityOperation.size=0, operations.completed.count=18455, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=0, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=0, proxy.count=3, clientEndpoint.count=0, connection.active.count=0, client.connection.count=0, connection.count=0
2021-04-14 02:40:10,755 [ INFO] [hz.sleepy_cartwright.HealthMonitor] [c.h.i.d.HealthMonitor]: [127.0.0.1]:5702 [jet] [4.6-SNAPSHOT] processors=72, physical.memory.total=377.6G, physical.memory.free=82.3G, swap.space.total=4.0G, swap.space.free=3.4G, heap.memory.used=335.2M, heap.memory.free=688.8M, heap.memory.total=1024.0M, heap.memory.max=1024.0M, heap.memory.used/total=32.73%, heap.memory.used/max=32.73%, minor.gc.count=125, minor.gc.time=15312ms, major.gc.count=0, major.gc.time=0ms, load.process=0.82%, load.system=99.59%, load.systemAverage=4665.33, thread.count=815, thread.peakCount=2362, cluster.timeDiff=0, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.client.query.size=0, executor.q.client.blocking.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=0, executor.q.priorityOperation.size=0, operations.completed.count=20954, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=0, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=0, proxy.count=3, clientEndpoint.count=0, connection.active.count=0, client.connection.count=0, connection.count=0
2021-04-14 02:40:29,806 [ INFO] [hz.dreamy_cartwright.HealthMonitor] [c.h.i.d.HealthMonitor]: [127.0.0.1]:5701 [jet] [4.6-SNAPSHOT] processors=72, physical.memory.total=377.6G, physical.memory.free=83.0G, swap.space.total=4.0G, swap.space.free=3.4G, heap.memory.used=383.6M, heap.memory.free=640.4M, heap.memory.total=1024.0M, heap.memory.max=1024.0M, heap.memory.used/total=37.45%, heap.memory.used/max=37.45%, minor.gc.count=126, minor.gc.time=15336ms, major.gc.count=0, major.gc.time=0ms, load.process=1.13%, load.system=98.49%, load.systemAverage=4570.90, thread.count=692, thread.peakCount=2362, cluster.timeDiff=0, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.client.query.size=0, executor.q.client.blocking.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=0, executor.q.priorityOperation.size=0, operations.completed.count=21838, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=0, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=0, proxy.count=3, clientEndpoint.count=0, connection.active.count=0, client.connection.count=0, connection.count=0
2021-04-14 02:40:30,758 [ INFO] [hz.sleepy_cartwright.HealthMonitor] [c.h.i.d.HealthMonitor]: [127.0.0.1]:5702 [jet] [4.6-SNAPSHOT] processors=72, physical.memory.total=377.6G, physical.memory.free=83.0G, swap.space.total=4.0G, swap.space.free=3.4G, heap.memory.used=394.9M, heap.memory.free=629.1M, heap.memory.total=1024.0M, heap.memory.max=1024.0M, heap.memory.used/total=38.56%, heap.memory.used/max=38.56%, minor.gc.count=126, minor.gc.time=15336ms, major.gc.count=0, major.gc.time=0ms, load.process=1.23%, load.system=98.85%, load.systemAverage=4570.90, thread.count=692, thread.peakCount=2362, cluster.timeDiff=0, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.client.query.size=0, executor.q.client.blocking.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=0, executor.q.priorityOperation.size=0, operations.completed.count=24755, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=0, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=0, proxy.count=3, clientEndpoint.count=0, connection.active.count=0, client.connection.count=0, connection.count=0

@viliam-durina
Copy link
Contributor

This issue seems similar to #3021: snapshot got stuck when concurrently some member completed execution. Due to the stuck the snapshot we were unable to gracefully restart the job. Due to that all the remaining tests in the class were affected because they share cluster.

2021-04-14 02:39:15,210 [DEBUG] [hz.dreamy_cartwright.cached.thread-53] [c.h.j.i.MasterSnapshotContext]: [127.0.0.1]:5701 [jet] [4.6-SNAPSHOT] Snapshot 14 phase 1 for job '0607-cea8-5f42-0001', execution 0607-cea8-5f47-0001 completed with status SUCCESS in 100ms, 37 bytes, 1 keys in 1 chunks, stored in '__jet.snapshot.0607-cea8-5f42-0001.0', proceeding to phase 2
2021-04-14 02:39:15,220 [DEBUG] [hz.dreamy_cartwright.cached.thread-53] [c.h.j.i.e.ExecutionContext]: [127.0.0.1]:5701 [jet] [4.6-SNAPSHOT] Starting snapshot 14 phase 2 for job '0607-cea8-5f42-0001', execution 0607-cea8-5f47-0001 on member
2021-04-14 02:39:15,220 [DEBUG] [hz.sleepy_cartwright.generic-operation.thread-13] [c.h.j.i.e.ExecutionContext]: [127.0.0.1]:5702 [jet] [4.6-SNAPSHOT] Starting snapshot 14 phase 2 for job '0607-cea8-5f42-0001', execution 0607-cea8-5f47-0001 on member
2021-04-14 02:39:15,225 [DEBUG] [hz.sleepy_cartwright.jet.blocking.thread-0] [c.h.j.i.o.SnapshotPhase2Operation]: [127.0.0.1]:5702 [jet] [4.6-SNAPSHOT] Snapshot 14 phase 2 for job '0607-cea8-5f42-0001', execution 0607-cea8-5f47-0001 finished successfully on member
2021-04-14 02:39:15,243 [DEBUG] [hz.dreamy_cartwright.jet.cooperative.thread-54] [c.h.j.i.JobExecutionService]: [127.0.0.1]:5701 [jet] [4.6-SNAPSHOT] Execution of job '0607-cea8-5f42-0001', execution 0607-cea8-5f47-0001 completed

The phase 2 was initiated. It completed successfully on member 5702, but the execution completed on 5701 while in phase 2, which should not happen. I'm going to review the code around this for a possible race.

@viliam-durina viliam-durina self-assigned this Apr 14, 2021
viliam-durina added a commit to viliam-durina/hazelcast that referenced this issue Apr 14, 2021
The failure scenario:

- snapshot phase 1 is started in `SnapshotContext`

- Before P1 does phase1, it completes. This is normal. It emits
DONE_ITEM to snapshotOutbox

- `StoreSnapshotTasklet` receives the DONE_ITEM and calls
`SnapshotContext.storeSnapshotTaskletDone()`. Because it does so before
phase1 was done, it marks the phase1 as done for the processor.

- we respond to master that phase1 is done on member, master initiates
phase2

- Note that the P1 didn't yet call
`SnapshotContext.processorTaskletDone()` so `numPTasklets` isn't
decremented

- phase2 is initiated with non-decremented `numPTasklets`. We expect
this number of processors to do the phase2.

- now the processor thread calls
`SnapshotContext.processorTaskletDone()` and decrements `numPTasklets`.
Since it didn't do the phase1, it completes and never does the phase2
and the execution is stuck waiting for that.

This scenario is very unlikely because another threads have a lot more
work to do than the processor's thread: send the DONE_ITEM to snapshot
queue, handle it in another thread, complete a future, respond to an
operation in yet another thread, handle the response and issue a the
phase2 operation, handle the phase 2 operation, all in a time while the
processor thread proceeds to the very next line. It's easily
reproducible by inserting a sleep after
`outbox.offerToEdgesAndSnapshot(DONE_ITEM)` in `ProcessorTasklet`.

The fix is to ensure we first call `SnapshotContext.processorTaskletDone()`
before adding the DONE_ITEM to snapshotQueue to ensure it's called
before `SnapshotContext.storeSnapshotTaskletDone()`.

We also add an assert that a tasklet isn't done without doing phase2, if
phase2 was initiated.

Fixes hazelcast/hazelcast-jet#3027
viliam-durina added a commit to hazelcast/hazelcast that referenced this issue Apr 19, 2021
The failure scenario:

- snapshot phase 1 is started in `SnapshotContext`

- Before P1 does phase1, it completes. This is normal. It emits
DONE_ITEM to snapshotOutbox

- `StoreSnapshotTasklet` receives the DONE_ITEM and calls
`SnapshotContext.storeSnapshotTaskletDone()`. Because it does so before
phase1 was done, it marks the phase1 as done for the processor.

- we respond to master that phase1 is done on member, master initiates
phase2

- Note that the P1 didn't yet call
`SnapshotContext.processorTaskletDone()` so `numPTasklets` isn't
decremented

- phase2 is initiated with non-decremented `numPTasklets`. We expect
this number of processors to do the phase2.

- now the processor thread calls
`SnapshotContext.processorTaskletDone()` and decrements `numPTasklets`.
Since it didn't do the phase1, it completes and never does the phase2
and the execution is stuck waiting for that.

This scenario is very unlikely because another threads have a lot more
work to do than the processor's thread: send the DONE_ITEM to snapshot
queue, handle it in another thread, complete a future, respond to an
operation in yet another thread, handle the response and issue a the
phase2 operation, handle the phase 2 operation, all in a time while the
processor thread proceeds to the very next line. It's easily
reproducible by inserting a sleep after
`outbox.offerToEdgesAndSnapshot(DONE_ITEM)` in `ProcessorTasklet`.

The fix is to ensure we first call `SnapshotContext.processorTaskletDone()`
before adding the DONE_ITEM to snapshotQueue to ensure it's called
before `SnapshotContext.storeSnapshotTaskletDone()`.

We also add an assert that a tasklet isn't done without doing phase2, if
phase2 was initiated.

Fixes hazelcast/hazelcast-jet#3027
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants