Skip to content

[Bug] Task hanged in submitted #17732

@johnny2002

Description

@johnny2002

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

tasks hanged in submitted status, no more info.
here is the master log:

[WI-0][TI-0] - 2025-11-26 02:26:44.426 WARN  [Curator-TreeCache-0] o.a.d.s.m.c.AbstractClusterSubscribeListener:[45] - Server MasterServerMetadata(super=BaseServerMetadata(processId=1629687, serverStartupTime=1764123193415, address=10.16.10.119:5678, cpuUsage=0.0012515644555694619, memoryUsage=0.09452163781361474, serverStatus=NORMAL)) removed
[WI-0][TI-0] - 2025-11-26 02:26:44.426 WARN  [Curator-TreeCache-0] o.a.d.s.m.c.MasterSlotManager:[75] - Do rebalance failed, cannot found the current master: 10.16.10.119:5678 in the normal master clusters: []. Please check the current master server status
[WI-0][TI-0] - 2025-11-26 02:26:44.426 INFO  [Curator-TreeCache-0] o.a.d.s.m.e.s.SystemEventBus:[40] - Published SystemEvent: MasterFailoverEvent{masterServerMetadata='MasterServerMetadata(super=BaseServerMetadata(processId=1629687, serverStartupTime=1764123193415, address=10.16.10.119:5678, cpuUsage=0.0012515644555694619, memoryUsage=0.09452163781361474, serverStatus=NORMAL))', eventTime=Wed Nov 26 02:26:44 UTC 2025, delayTime=30000}
[WI-0][TI-0] - 2025-11-26 02:26:44.427 WARN  [Curator-TreeCache-0] o.a.d.s.m.c.AbstractClusterSubscribeListener:[45] - Server WorkerServerMetadata(workerGroup=default, workerWeight=100.0, taskThreadPoolUsage=0.0) removed
[WI-0][TI-0] - 2025-11-26 02:26:44.427 INFO  [Curator-TreeCache-0] o.a.d.s.m.e.s.SystemEventBus:[40] - Published SystemEvent: WorkerFailoverEvent{workerServerMetadata='WorkerServerMetadata(workerGroup=default, workerWeight=100.0, taskThreadPoolUsage=0.0)', eventTime=Wed Nov 26 02:26:44 UTC 2025, delayTime=30000}
[WI-0][TI-0] - 2025-11-26 02:26:44.434 INFO  [Curator-TreeCache-0] o.a.d.r.a.h.DefaultServerStatusChangeListener:[32] - The status is standby now.
[WI-0][TI-0] - 2025-11-26 02:26:44.434 INFO  [Curator-TreeCache-0] o.a.d.s.m.e.TaskGroupCoordinator:[463] - TaskGroupCoordinator closed
[WI-0][TI-0] - 2025-11-26 02:26:44.435 ERROR [Thread-20] o.a.d.c.t.ThreadUtils:[80] - Current thread sleep error
java.lang.InterruptedException: sleep interrupted
        at java.lang.Thread.sleep(Native Method)
        at org.apache.dolphinscheduler.common.thread.ThreadUtils.sleep(ThreadUtils.java:77)
        at org.apache.dolphinscheduler.server.master.engine.TaskGroupCoordinator.doStart(TaskGroupCoordinator.java:121)
        at java.lang.Thread.run(Thread.java:750)
[WI-0][TI-0] - 2025-11-26 02:26:44.879 INFO  [Curator-TreeCache-0] o.a.d.s.m.c.AbstractClusterSubscribeListener:[41] - Server WorkerServerMetadata(workerGroup=default, workerWeight=100.0, taskThreadPoolUsage=0.0) added
[WI-0][TI-0] - 2025-11-26 02:26:45.260 WARN  [MasterCommandLoopThread] o.a.d.s.m.e.c.IdSlotBasedCommandFetcher:[60] - MasterSlotManager check slot (-1 -> 1)is invalidated.
[WI-0][TI-0] - 2025-11-26 02:26:45.917 INFO  [Curator-TreeCache-0] o.a.d.s.m.c.AbstractClusterSubscribeListener:[41] - Server MasterServerMetadata(super=BaseServerMetadata(processId=1629687, serverStartupTime=1764123193415, address=10.16.10.119:5678, cpuUsage=0.003740648379052369, memoryUsage=0.09459177738163037, serverStatus=NORMAL)) added
[WI-0][TI-0] - 2025-11-26 02:26:45.917 INFO  [Curator-TreeCache-0] o.a.d.s.m.c.MasterSlotManager:[89] - Do rebalance success, current master slot: 0, total master slots: 1
[WI-0][TI-0] - 2025-11-26 02:27:14.427 INFO  [SystemEventBusFireWorker] o.a.d.s.m.f.FailoverCoordinator:[105] - Master[MasterServerMetadata(super=BaseServerMetadata(processId=1629687, serverStartupTime=1764123193415, address=10.16.10.119:5678, cpuUsage=0.0012515644555694619, memoryUsage=0.09452163781361474, serverStatus=NORMAL))] failover starting
[WI-0][TI-0] - 2025-11-26 02:27:14.427 INFO  [SystemEventBusFireWorker] o.a.d.s.m.f.FailoverCoordinator:[113] - The master[MasterServerMetadata(super=BaseServerMetadata(processId=1629687, serverStartupTime=1764123193415, address=10.16.10.119:5678, cpuUsage=0.0012515644555694619, memoryUsage=0.09452163781361474, serverStatus=NORMAL))] is alive, maybe it reconnect to registry skip failover
[WI-0][TI-0] - 2025-11-26 02:27:14.427 INFO  [SystemEventBusFireWorker] o.a.d.s.m.e.s.SystemEventBusFireWorker:[103] - Fire SystemEvent: MasterFailoverEvent{masterServerMetadata='MasterServerMetadata(super=BaseServerMetadata(processId=1629687, serverStartupTime=1764123193415, address=10.16.10.119:5678, cpuUsage=0.0012515644555694619, memoryUsage=0.09452163781361474, serverStatus=NORMAL))', eventTime=Wed Nov 26 02:26:44 UTC 2025, delayTime=30000} cost: 0 ms
[WI-0][TI-0] - 2025-11-26 02:27:14.427 INFO  [SystemEventBusFireWorker] o.a.d.s.m.f.FailoverCoordinator:[191] - Worker[WorkerServerMetadata(workerGroup=default, workerWeight=100.0, taskThreadPoolUsage=0.0)] failover starting
[WI-0][TI-0] - 2025-11-26 02:27:14.427 INFO  [SystemEventBusFireWorker] o.a.d.s.m.f.FailoverCoordinator:[198] - The worker[WorkerServerMetadata(workerGroup=default, workerWeight=100.0, taskThreadPoolUsage=0.0)] is alive, maybe it reconnect to registry skip failover
[WI-0][TI-0] - 2025-11-26 02:27:14.427 INFO  [SystemEventBusFireWorker] o.a.d.s.m.e.s.SystemEventBusFireWorker:[103] - Fire SystemEvent: WorkerFailoverEvent{workerServerMetadata='WorkerServerMetadata(workerGroup=default, workerWeight=100.0, taskThreadPoolUsage=0.0)', eventTime=Wed Nov 26 02:26:44 UTC 2025, delayTime=30000} cost: 0 ms
[WI-0][TI-0] - 2025-11-26 02:28:06.455 INFO  [MasterCommandHandleThreadPool] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: WorkflowStartLifecycleEvent{workflow=03.CUST_加载(dim)_fr_project_code_dealer-v2-20251126022806126}
[WI-0][TI-0] - 2025-11-26 02:28:06.456 INFO  [MasterCommandHandleThreadPool] o.a.d.s.m.e.c.CommandEngine:[174] - Success bootstrap command {
  "id" : 8928,
  "commandType" : "START_PROCESS",
  "workflowDefinitionCode" : 18819871298950,
  "workflowDefinitionVersion" : 19,
  "workflowInstanceId" : 14481,
  "commandParam" : "{\"commandType\":\"START_PROCESS\",\"subWorkflowInstance\":false,\"startNodes\":[],\"commandParams\":[{\"prop\":\"bizDate\",\"direct\":\"IN\",\"type\":\"VARCHAR\",\"value\":\"$[yyyy-MM-dd-1]\"},{\"prop\":\"tableName\",\"direct\":\"IN\",\"type\":\"VARCHAR\",\"value\":\"fr_project_code_dealer\"},{\"prop\":\"srcSystem\",\"direct\":\"IN\",\"type\":\"VARCHAR\",\"value\":\"yecai\"},{\"prop\":\"DMP_DB\",\"direct\":\"IN\",\"type\":\"VARCHAR\",\"value\":\"cust\"},{\"prop\":\"SRC_DB\",\"direct\":\"IN\",\"type\":\"VARCHAR\",\"value\":\"loan\"},{\"prop\":\"slctColums\",\"direct\":\"IN\",\"type\":\"VARCHAR\",\"value\":\"t.project_code,t.dealer_name,t.create_time,t.yewuyuan\"},{\"prop\":\"dof\",\"direct\":\"IN\",\"type\":\"VARCHAR\",\"value\":\"dim\"},{\"prop\":\"tableIdCol\",\"direct\":\"IN\",\"type\":\"VARCHAR\",\"value\":\"project_code\"}],\"timeZone\":\"UTC\"}",
  "workflowInstancePriority" : "MEDIUM",
  "executorId" : 0,
  "taskDependType" : "TASK_POST",
  "failureStrategy" : "CONTINUE",
  "warningType" : "NONE",
  "warningGroupId" : null,
  "scheduleTime" : null,
  "startTime" : null,
  "updateTime" : "2025-11-26 02:28:06",
  "workerGroup" : null,
  "tenantCode" : "default",
  "environmentCode" : -1,
  "dryRun" : 0
}
[WI-14481][TI-0] - 2025-11-26 02:28:06.471 INFO  [ds-workflow-eventbus-worker-3] o.a.d.s.m.e.w.l.h.AbstractWorkflowLifecycleEventHandler:[47] - Begin fire workflow 03.CUST_加载(dim)_fr_project_code_dealer-v2-20251126022806126 LifecycleEvent[WorkflowStartLifecycleEvent{workflow=03.CUST_加载(dim)_fr_project_code_dealer-v2-20251126022806126}] with state: RUNNING_EXECUTION
[WI-14481][TI-0] - 2025-11-26 02:28:06.471 INFO  [ds-workflow-eventbus-worker-3] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: TaskStartLifecycleEvent{task=mysql->hdfs}
[WI-14481][TI-0] - 2025-11-26 02:28:06.471 INFO  [ds-workflow-eventbus-worker-3] o.a.d.s.m.e.w.l.h.AbstractWorkflowLifecycleEventHandler:[52] - Fired workflow 03.CUST_加载(dim)_fr_project_code_dealer-v2-20251126022806126 LifecycleEvent[WorkflowStartLifecycleEvent{workflow=03.CUST_加载(dim)_fr_project_code_dealer-v2-20251126022806126}] with state: RUNNING_EXECUTION
[WI-14481][TI-0] - 2025-11-26 02:28:06.482 INFO  [ds-workflow-eventbus-worker-3] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: TaskDispatchLifecycleEvent{task=mysql->hdfs}
[WI-14481][TI-0] - 2025-11-26 02:28:06.482 INFO  [ds-workflow-eventbus-worker-3] o.a.d.s.m.e.t.l.h.AbstractTaskLifecycleEventHandler:[47] - Fired task mysql->hdfs TaskStartLifecycleEvent{task=mysql->hdfs} with state SUBMITTED_SUCCESS
[WI-14481][TI-0] - 2025-11-26 02:28:06.486 INFO  [ds-workflow-eventbus-worker-3] o.a.d.s.m.e.t.d.WorkerGroupDispatcher:[56] - Initialize WorkerGroupDispatcher: WorkerGroupTaskDispatcher-default
[WI-14481][TI-0] - 2025-11-26 02:28:06.486 INFO  [ds-workflow-eventbus-worker-3] o.a.d.s.m.e.t.d.WorkerGroupDispatcher:[62] - The WorkerGroupTaskDispatcher-default starting...
[WI-14481][TI-0] - 2025-11-26 02:28:06.486 INFO  [ds-workflow-eventbus-worker-3] o.a.d.s.m.e.t.d.WorkerGroupDispatcher:[64] - The WorkerGroupTaskDispatcher-default  started
[WI-14481][TI-0] - 2025-11-26 02:28:06.486 INFO  [ds-workflow-eventbus-worker-3] o.a.d.s.m.e.t.d.WorkerGroupDispatcherCoordinator:[59] - Success add Task[id=55958] to WorkerGroupDispatcher[name=default]
[WI-14481][TI-0] - 2025-11-26 02:28:06.486 INFO  [ds-workflow-eventbus-worker-3] o.a.d.s.m.e.t.l.h.AbstractTaskLifecycleEventHandler:[47] - Fired task mysql->hdfs TaskDispatchLifecycleEvent{task=mysql->hdfs} with state SUBMITTED_SUCCESS
[WI-14481][TI-55958] - 2025-11-26 02:28:06.522 INFO  [WorkerGroupTaskDispatcher-default] o.a.d.e.b.c.JdkDynamicRpcClientProxyFactory:[56] - Create DynamicRpcClientProxy cache for host: 10.16.10.117:1234
[WI-0][TI-0] - 2025-11-26 02:28:06.576 INFO  [MasterRpcServer-methodInvoker-1] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: TaskDispatchedLifecycleEvent{task=mysql->hdfs, executorHost='10.16.10.117:1234'}
[WI-14481][TI-0] - 2025-11-26 02:28:06.599 INFO  [ds-workflow-eventbus-worker-2] o.a.d.s.m.e.t.l.h.AbstractTaskLifecycleEventHandler:[47] - Fired task mysql->hdfs TaskDispatchedLifecycleEvent{task=mysql->hdfs, executorHost='10.16.10.117:1234'} with state SUBMITTED_SUCCESS
[WI-0][TI-0] - 2025-11-26 02:28:06.622 INFO  [MasterRpcServer-methodInvoker-2] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: TaskRunningLifecycleEvent{task=mysql->hdfs, logPath='/opt/datasophon/dolphinscheduler-3.3.2/worker-server/logs/20251126/18819871298950/19/14481/55958.log', startTime=Wed Nov 26 02:28:06 UTC 2025}
[WI-14481][TI-0] - 2025-11-26 02:28:06.724 INFO  [ds-workflow-eventbus-worker-16] o.a.d.s.m.e.t.l.h.AbstractTaskLifecycleEventHandler:[47] - Fired task mysql->hdfs TaskRunningLifecycleEvent{task=mysql->hdfs, logPath='/opt/datasophon/dolphinscheduler-3.3.2/worker-server/logs/20251126/18819871298950/19/14481/55958.log', startTime=Wed Nov 26 02:28:06 UTC 2025} with state DISPATCH
[WI-0][TI-0] - 2025-11-26 02:28:06.727 INFO  [MasterRpcServer-methodInvoker-3] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: TaskRunningLifecycleEvent{task=mysql->hdfs, runtimeContext=null}
[WI-14481][TI-0] - 2025-11-26 02:28:06.830 INFO  [ds-workflow-eventbus-worker-14] o.a.d.s.m.e.t.l.h.AbstractTaskLifecycleEventHandler:[47] - Fired task mysql->hdfs TaskRunningLifecycleEvent{task=mysql->hdfs, runtimeContext=null} with state RUNNING_EXECUTION
[WI-0][TI-0] - 2025-11-26 02:28:07.574 INFO  [MasterRpcServer-methodInvoker-4] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: TaskSuccessLifecycleEvent{task=mysql->hdfs, endTime=Wed Nov 26 02:28:07 UTC 2025, varPool='[Property(prop=fLines, direct=OUT, type=VARCHAR, value=${sCount}), Property(prop=newLineColNums, direct=OUT, type=VARCHAR, value=)]'}
[WI-14481][TI-0] - 2025-11-26 02:28:07.641 INFO  [ds-workflow-eventbus-worker-15] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: WorkflowTopologyLogicalTransitionWithTaskFinishLifecycleEvent{task=mysql->hdfstaskState=SUCCESS}
[WI-14481][TI-0] - 2025-11-26 02:28:07.644 INFO  [ds-workflow-eventbus-worker-15] o.a.d.s.m.e.t.l.h.AbstractTaskLifecycleEventHandler:[47] - Fired task mysql->hdfs TaskSuccessLifecycleEvent{task=mysql->hdfs, endTime=Wed Nov 26 02:28:07 UTC 2025, varPool='[Property(prop=fLines, direct=OUT, type=VARCHAR, value=${sCount}), Property(prop=newLineColNums, direct=OUT, type=VARCHAR, value=)]'} with state RUNNING_EXECUTION
[WI-14481][TI-0] - 2025-11-26 02:28:07.644 INFO  [ds-workflow-eventbus-worker-15] o.a.d.s.m.e.w.l.h.AbstractWorkflowLifecycleEventHandler:[47] - Begin fire workflow 03.CUST_加载(dim)_fr_project_code_dealer-v2-20251126022806126 LifecycleEvent[WorkflowTopologyLogicalTransitionWithTaskFinishLifecycleEvent{task=mysql->hdfstaskState=SUCCESS}] with state: RUNNING_EXECUTION
[WI-14481][TI-0] - 2025-11-26 02:28:07.644 INFO  [ds-workflow-eventbus-worker-15] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: TaskStartLifecycleEvent{task=加载到临时表hive}
[WI-14481][TI-0] - 2025-11-26 02:28:07.644 INFO  [ds-workflow-eventbus-worker-15] o.a.d.s.m.e.w.l.h.AbstractWorkflowLifecycleEventHandler:[52] - Fired workflow 03.CUST_加载(dim)_fr_project_code_dealer-v2-20251126022806126 LifecycleEvent[WorkflowTopologyLogicalTransitionWithTaskFinishLifecycleEvent{task=mysql->hdfstaskState=SUCCESS}] with state: RUNNING_EXECUTION
[WI-14481][TI-0] - 2025-11-26 02:28:07.653 INFO  [ds-workflow-eventbus-worker-15] o.a.d.s.m.e.TaskGroupCoordinator:[363] - Success insert TaskGroupQueue: TaskGroupQueue(id=null, taskId=55959, taskName=加载到临时表hive, projectName=null, projectCode=null, workflowInstanceName=null, groupId=1, workflowInstanceId=14481, priority=0, forceStart=0, inQueue=1, status=WAIT_QUEUE, createTime=Wed Nov 26 02:28:07 UTC 2025, updateTime=Wed Nov 26 02:28:07 UTC 2025) for TaskInstance: 加载到临时表hive
[WI-14481][TI-0] - 2025-11-26 02:28:07.662 INFO  [ds-workflow-eventbus-worker-15] o.a.d.s.m.e.t.s.AbstractTaskStateAction:[238] - Task[name=加载到临时表hive] using taskGroup, success acquire taskGroup slot
[WI-14481][TI-0] - 2025-11-26 02:28:07.662 INFO  [ds-workflow-eventbus-worker-15] o.a.d.s.m.e.t.l.h.AbstractTaskLifecycleEventHandler:[47] - Fired task 加载到临时表hive TaskStartLifecycleEvent{task=加载到临时表hive} with state SUBMITTED_SUCCESS
Image Image

What you expected to happen

workflow go ahead

How to reproduce

start a workflow

Anything else

经常卡死在不同的节点。也无法停止

Version

3.3.2

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions