Failures in com.hazelcast.jet.impl.connector.WriteJdbcPTest #3027

olukas · 2021-04-14T06:15:14Z

master (commit bed77b2)

Failed on IBM JDK 8: http://jenkins.hazelcast.com/job/jet-oss-master-ibm-jdk8/356/testReport/com.hazelcast.jet.impl.connector/WriteJdbcPTest/

Alle tests beside test named test in com.hazelcast.jet.impl.connector.WriteJdbcPTest failed. Seems like an problem with establishing cluster (there are stacktrace messages like Cluster has not elected a master and Master address unknown: instance is not yet initialized or is shut down). See files with stacktraces and standard outputs for details:
com.hazelcast.jet.impl.connector.WriteJdbcPTest.txt
com.hazelcast.jet.impl.connector.WriteJdbcPTest-output.txt

The text was updated successfully, but these errors were encountered:

viliam-durina · 2021-04-14T06:52:45Z

Looks like the system was overloaded, but not by the test process: load.process=<1%, load.system=>99%

2021-04-14 02:39:10,734 [ INFO] [hz.sleepy_cartwright.HealthMonitor] [c.h.i.d.HealthMonitor]: [127.0.0.1]:5702 [jet] [4.6-SNAPSHOT] processors=72, physical.memory.total=377.6G, physical.memory.free=77.6G, swap.space.total=4.0G, swap.space.free=3.4G, heap.memory.used=459.8M, heap.memory.free=564.2M, heap.memory.total=1024.0M, heap.memory.max=1024.0M, heap.memory.used/total=44.90%, heap.memory.used/max=44.90%, minor.gc.count=119, minor.gc.time=14614ms, major.gc.count=0, major.gc.time=0ms, load.process=0.96%, load.system=99.62%, load.systemAverage=4850.39, thread.count=678, thread.peakCount=2362, cluster.timeDiff=-4, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.client.query.size=0, executor.q.client.blocking.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=0, executor.q.priorityOperation.size=0, operations.completed.count=11, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=0, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=0, proxy.count=6, clientEndpoint.count=0, connection.active.count=0, client.connection.count=0, connection.count=0
2021-04-14 02:39:29,762 [ INFO] [hz.dreamy_cartwright.HealthMonitor] [c.h.i.d.HealthMonitor]: [127.0.0.1]:5701 [jet] [4.6-SNAPSHOT] processors=72, physical.memory.total=377.6G, physical.memory.free=78.7G, swap.space.total=4.0G, swap.space.free=3.4G, heap.memory.used=370.6M, heap.memory.free=653.4M, heap.memory.total=1024.0M, heap.memory.max=1024.0M, heap.memory.used/total=36.19%, heap.memory.used/max=36.19%, minor.gc.count=122, minor.gc.time=15114ms, major.gc.count=0, major.gc.time=0ms, load.process=1.28%, load.system=99.14%, load.systemAverage=4816.44, thread.count=817, thread.peakCount=2362, cluster.timeDiff=0, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.client.query.size=0, executor.q.client.blocking.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=0, executor.q.priorityOperation.size=0, operations.completed.count=11826, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=0, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=0, proxy.count=3, clientEndpoint.count=0, connection.active.count=0, client.connection.count=0, connection.count=0
2021-04-14 02:39:30,741 [ INFO] [hz.sleepy_cartwright.HealthMonitor] [c.h.i.d.HealthMonitor]: [127.0.0.1]:5702 [jet] [4.6-SNAPSHOT] processors=72, physical.memory.total=377.6G, physical.memory.free=80.8G, swap.space.total=4.0G, swap.space.free=3.4G, heap.memory.used=384.7M, heap.memory.free=639.3M, heap.memory.total=1024.0M, heap.memory.max=1024.0M, heap.memory.used/total=37.57%, heap.memory.used/max=37.57%, minor.gc.count=122, minor.gc.time=15114ms, major.gc.count=0, major.gc.time=0ms, load.process=1.34%, load.system=99.65%, load.systemAverage=4816.44, thread.count=817, thread.peakCount=2362, cluster.timeDiff=0, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.client.query.size=0, executor.q.client.blocking.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=0, executor.q.priorityOperation.size=0, operations.completed.count=13485, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=0, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=0, proxy.count=3, clientEndpoint.count=0, connection.active.count=0, client.connection.count=0, connection.count=0
2021-04-14 02:39:49,798 [ INFO] [hz.dreamy_cartwright.HealthMonitor] [c.h.i.d.HealthMonitor]: [127.0.0.1]:5701 [jet] [4.6-SNAPSHOT] processors=72, physical.memory.total=377.6G, physical.memory.free=81.8G, swap.space.total=4.0G, swap.space.free=3.4G, heap.memory.used=463.3M, heap.memory.free=560.7M, heap.memory.total=1024.0M, heap.memory.max=1024.0M, heap.memory.used/total=45.25%, heap.memory.used/max=45.25%, minor.gc.count=123, minor.gc.time=15216ms, major.gc.count=0, major.gc.time=0ms, load.process=0.84%, load.system=100.00%, load.systemAverage=4733.54, thread.count=817, thread.peakCount=2362, cluster.timeDiff=0, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.client.query.size=0, executor.q.client.blocking.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=0, executor.q.priorityOperation.size=0, operations.completed.count=15208, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=0, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=0, proxy.count=3, clientEndpoint.count=0, connection.active.count=0, client.connection.count=0, connection.count=0
2021-04-14 02:39:50,746 [ INFO] [hz.sleepy_cartwright.HealthMonitor] [c.h.i.d.HealthMonitor]: [127.0.0.1]:5702 [jet] [4.6-SNAPSHOT] processors=72, physical.memory.total=377.6G, physical.memory.free=81.7G, swap.space.total=4.0G, swap.space.free=3.4G, heap.memory.used=475.6M, heap.memory.free=548.4M, heap.memory.total=1024.0M, heap.memory.max=1024.0M, heap.memory.used/total=46.44%, heap.memory.used/max=46.44%, minor.gc.count=123, minor.gc.time=15216ms, major.gc.count=0, major.gc.time=0ms, load.process=0.96%, load.system=99.58%, load.systemAverage=4733.54, thread.count=817, thread.peakCount=2362, cluster.timeDiff=-1, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.client.query.size=0, executor.q.client.blocking.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=0, executor.q.priorityOperation.size=0, operations.completed.count=17151, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=0, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=0, proxy.count=3, clientEndpoint.count=0, connection.active.count=0, client.connection.count=0, connection.count=0
2021-04-14 02:40:09,802 [ INFO] [hz.dreamy_cartwright.HealthMonitor] [c.h.i.d.HealthMonitor]: [127.0.0.1]:5701 [jet] [4.6-SNAPSHOT] processors=72, physical.memory.total=377.6G, physical.memory.free=82.4G, swap.space.total=4.0G, swap.space.free=3.4G, heap.memory.used=317.7M, heap.memory.free=706.3M, heap.memory.total=1024.0M, heap.memory.max=1024.0M, heap.memory.used/total=31.02%, heap.memory.used/max=31.02%, minor.gc.count=124, minor.gc.time=15269ms, major.gc.count=0, major.gc.time=0ms, load.process=0.93%, load.system=99.90%, load.systemAverage=4665.33, thread.count=815, thread.peakCount=2362, cluster.timeDiff=0, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.client.query.size=0, executor.q.client.blocking.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=0, executor.q.priorityOperation.size=0, operations.completed.count=18455, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=0, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=0, proxy.count=3, clientEndpoint.count=0, connection.active.count=0, client.connection.count=0, connection.count=0
2021-04-14 02:40:10,755 [ INFO] [hz.sleepy_cartwright.HealthMonitor] [c.h.i.d.HealthMonitor]: [127.0.0.1]:5702 [jet] [4.6-SNAPSHOT] processors=72, physical.memory.total=377.6G, physical.memory.free=82.3G, swap.space.total=4.0G, swap.space.free=3.4G, heap.memory.used=335.2M, heap.memory.free=688.8M, heap.memory.total=1024.0M, heap.memory.max=1024.0M, heap.memory.used/total=32.73%, heap.memory.used/max=32.73%, minor.gc.count=125, minor.gc.time=15312ms, major.gc.count=0, major.gc.time=0ms, load.process=0.82%, load.system=99.59%, load.systemAverage=4665.33, thread.count=815, thread.peakCount=2362, cluster.timeDiff=0, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.client.query.size=0, executor.q.client.blocking.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=0, executor.q.priorityOperation.size=0, operations.completed.count=20954, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=0, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=0, proxy.count=3, clientEndpoint.count=0, connection.active.count=0, client.connection.count=0, connection.count=0
2021-04-14 02:40:29,806 [ INFO] [hz.dreamy_cartwright.HealthMonitor] [c.h.i.d.HealthMonitor]: [127.0.0.1]:5701 [jet] [4.6-SNAPSHOT] processors=72, physical.memory.total=377.6G, physical.memory.free=83.0G, swap.space.total=4.0G, swap.space.free=3.4G, heap.memory.used=383.6M, heap.memory.free=640.4M, heap.memory.total=1024.0M, heap.memory.max=1024.0M, heap.memory.used/total=37.45%, heap.memory.used/max=37.45%, minor.gc.count=126, minor.gc.time=15336ms, major.gc.count=0, major.gc.time=0ms, load.process=1.13%, load.system=98.49%, load.systemAverage=4570.90, thread.count=692, thread.peakCount=2362, cluster.timeDiff=0, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.client.query.size=0, executor.q.client.blocking.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=0, executor.q.priorityOperation.size=0, operations.completed.count=21838, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=0, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=0, proxy.count=3, clientEndpoint.count=0, connection.active.count=0, client.connection.count=0, connection.count=0
2021-04-14 02:40:30,758 [ INFO] [hz.sleepy_cartwright.HealthMonitor] [c.h.i.d.HealthMonitor]: [127.0.0.1]:5702 [jet] [4.6-SNAPSHOT] processors=72, physical.memory.total=377.6G, physical.memory.free=83.0G, swap.space.total=4.0G, swap.space.free=3.4G, heap.memory.used=394.9M, heap.memory.free=629.1M, heap.memory.total=1024.0M, heap.memory.max=1024.0M, heap.memory.used/total=38.56%, heap.memory.used/max=38.56%, minor.gc.count=126, minor.gc.time=15336ms, major.gc.count=0, major.gc.time=0ms, load.process=1.23%, load.system=98.85%, load.systemAverage=4570.90, thread.count=692, thread.peakCount=2362, cluster.timeDiff=0, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.client.query.size=0, executor.q.client.blocking.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=0, executor.q.priorityOperation.size=0, operations.completed.count=24755, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=0, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=0, proxy.count=3, clientEndpoint.count=0, connection.active.count=0, client.connection.count=0, connection.count=0

viliam-durina · 2021-04-14T08:04:11Z

This issue seems similar to #3021: snapshot got stuck when concurrently some member completed execution. Due to the stuck the snapshot we were unable to gracefully restart the job. Due to that all the remaining tests in the class were affected because they share cluster.

2021-04-14 02:39:15,210 [DEBUG] [hz.dreamy_cartwright.cached.thread-53] [c.h.j.i.MasterSnapshotContext]: [127.0.0.1]:5701 [jet] [4.6-SNAPSHOT] Snapshot 14 phase 1 for job '0607-cea8-5f42-0001', execution 0607-cea8-5f47-0001 completed with status SUCCESS in 100ms, 37 bytes, 1 keys in 1 chunks, stored in '__jet.snapshot.0607-cea8-5f42-0001.0', proceeding to phase 2
2021-04-14 02:39:15,220 [DEBUG] [hz.dreamy_cartwright.cached.thread-53] [c.h.j.i.e.ExecutionContext]: [127.0.0.1]:5701 [jet] [4.6-SNAPSHOT] Starting snapshot 14 phase 2 for job '0607-cea8-5f42-0001', execution 0607-cea8-5f47-0001 on member
2021-04-14 02:39:15,220 [DEBUG] [hz.sleepy_cartwright.generic-operation.thread-13] [c.h.j.i.e.ExecutionContext]: [127.0.0.1]:5702 [jet] [4.6-SNAPSHOT] Starting snapshot 14 phase 2 for job '0607-cea8-5f42-0001', execution 0607-cea8-5f47-0001 on member
2021-04-14 02:39:15,225 [DEBUG] [hz.sleepy_cartwright.jet.blocking.thread-0] [c.h.j.i.o.SnapshotPhase2Operation]: [127.0.0.1]:5702 [jet] [4.6-SNAPSHOT] Snapshot 14 phase 2 for job '0607-cea8-5f42-0001', execution 0607-cea8-5f47-0001 finished successfully on member
2021-04-14 02:39:15,243 [DEBUG] [hz.dreamy_cartwright.jet.cooperative.thread-54] [c.h.j.i.JobExecutionService]: [127.0.0.1]:5701 [jet] [4.6-SNAPSHOT] Execution of job '0607-cea8-5f42-0001', execution 0607-cea8-5f47-0001 completed

The phase 2 was initiated. It completed successfully on member 5702, but the execution completed on 5701 while in phase 2, which should not happen. I'm going to review the code around this for a possible race.

The failure scenario: - snapshot phase 1 is started in `SnapshotContext` - Before P1 does phase1, it completes. This is normal. It emits DONE_ITEM to snapshotOutbox - `StoreSnapshotTasklet` receives the DONE_ITEM and calls `SnapshotContext.storeSnapshotTaskletDone()`. Because it does so before phase1 was done, it marks the phase1 as done for the processor. - we respond to master that phase1 is done on member, master initiates phase2 - Note that the P1 didn't yet call `SnapshotContext.processorTaskletDone()` so `numPTasklets` isn't decremented - phase2 is initiated with non-decremented `numPTasklets`. We expect this number of processors to do the phase2. - now the processor thread calls `SnapshotContext.processorTaskletDone()` and decrements `numPTasklets`. Since it didn't do the phase1, it completes and never does the phase2 and the execution is stuck waiting for that. This scenario is very unlikely because another threads have a lot more work to do than the processor's thread: send the DONE_ITEM to snapshot queue, handle it in another thread, complete a future, respond to an operation in yet another thread, handle the response and issue a the phase2 operation, handle the phase 2 operation, all in a time while the processor thread proceeds to the very next line. It's easily reproducible by inserting a sleep after `outbox.offerToEdgesAndSnapshot(DONE_ITEM)` in `ProcessorTasklet`. The fix is to ensure we first call `SnapshotContext.processorTaskletDone()` before adding the DONE_ITEM to snapshotQueue to ensure it's called before `SnapshotContext.storeSnapshotTaskletDone()`. We also add an assert that a tasklet isn't done without doing phase2, if phase2 was initiated. Fixes hazelcast/hazelcast-jet#3027

olukas added the test-failure label Apr 14, 2021

olukas added this to the 4.6 milestone Apr 14, 2021

viliam-durina self-assigned this Apr 14, 2021

viliam-durina mentioned this issue Apr 14, 2021

Fix race in phase2 coordination hazelcast/hazelcast#18530

Merged

viliam-durina closed this as completed in hazelcast/hazelcast#18530 Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failures in com.hazelcast.jet.impl.connector.WriteJdbcPTest #3027

Failures in com.hazelcast.jet.impl.connector.WriteJdbcPTest #3027

olukas commented Apr 14, 2021

viliam-durina commented Apr 14, 2021 •

edited

Loading

viliam-durina commented Apr 14, 2021

Failures in com.hazelcast.jet.impl.connector.WriteJdbcPTest #3027

Failures in com.hazelcast.jet.impl.connector.WriteJdbcPTest #3027

Comments

olukas commented Apr 14, 2021

viliam-durina commented Apr 14, 2021 • edited Loading

viliam-durina commented Apr 14, 2021

viliam-durina commented Apr 14, 2021 •

edited

Loading