Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Mismatched status recorded in statement_info for interrupted SQL. #21333

Open
1 task done
aptend opened this issue Jan 23, 2025 · 5 comments
Open
1 task done
Assignees
Labels
kind/bug Something isn't working severity/s0 Extreme impact: Cause the application to break down and seriously affect the use
Milestone

Comments

@aptend
Copy link
Contributor

aptend commented Jan 23, 2025

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Branch Name

2.0-dev & prod

Commit ID

1d2305a

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

mo 执行一个query,客户端超时了(60s),断开的连接。
但是 statemnt_info拿到的是query 是success的,也没有 query trace status 的日志。
看起来有 2异常:

  1. query 立即就结束了
  2. query 的执行结果是 success,但实际上是 failed的。
Image

link

These time-out queries are supposed to be marked as failed.

Expected Behavior

No response

Steps to Reproduce

background: https://github.com/matrixorigin/MO-Cloud/issues/4793

Additional information

No response

@aptend aptend added kind/bug Something isn't working needs-triage labels Jan 23, 2025
@aptend aptend assigned matrix-meow and xzxiong and unassigned matrix-meow Jan 23, 2025
@aptend aptend added needs-triage severity/s1 High impact: Logical errors or data errors that must occur labels Jan 23, 2025
@xzxiong
Copy link
Contributor

xzxiong commented Jan 23, 2025

在 dev 环境下

抓取 tcpdump 确认了给 连接发送了 FIN 包 , 对应的请求按道理 应该是 失败的。

连接信息, 来自 proxy: 172.20.191.134:39898

mysql> select node_id, session_id, client_host, proxy_host, host, query_start, info from processlist() p ;
+--------------------------------------+--------------------------------------+----------------------+----------------------+---------------------+----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+
| node_id                              | session_id                           | client_host          | proxy_host           | host                | query_start                | info                                                                                                                                                    |
+--------------------------------------+--------------------------------------+----------------------+----------------------+---------------------+----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+
...
| 30393532-3464-3034-3666-636263376138 | 01949324-094e-7610-9b8b-8a72a4e25c32 | 172.20.185.110:44592 | 172.20.191.134:39898 | 172.20.154.140:6001 | 2025-01-23 12:30:00.025337 | execute __mo_stmt_id_1 // UPDATE `task` SET `db_status`=?,`end_time`=?,`update_time`=? WHERE id = ? ; 6 ; 1737635400 ; 1737635400 ; 1882320705065291776 |
+--------------------------------------+--------------------------------------+----------------------+----------------------+---------------------+----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+
9 rows in set (0.06 sec)

Image

tcpdump tcp.srcport == 39898 or tcp.dstport == 39898target.cap.0123.issue21333.tgz

Image

异常: 对应的请求记录结果为成功

mysql> select * from system.statement_info where session_id = '01949324-094e-7610-9b8b-8a72a4e25c32' and account = '019491d4-e376-7d5c-baa1-7d80cd4ae745' and request_at between '2025-01-23 12:30:00' and '2025-01-23 12:30:01' \G
...
*************************** 4. row ***************************
         statement_id: 01949324-0959-74db-9e34-e8e0069bcec3
       transaction_id: d85eb473e7401d3b181d209bc674a826
           session_id: 01949324-094e-7610-9b8b-8a72a4e25c32
              account: 019491d4-e376-7d5c-baa1-7d80cd4ae745
                 user: admin
                 host: 172.20.191.134:39898
             database: luke
            statement: execute __mo_stmt_id_1 // UPDATE `task` SET `db_status`=?,`end_time`=?,`update_time`=? WHERE id = ? ; 6 ; 1737635400 ; 1737635400 ; 1882320705065291776
        statement_tag:
statement_fingerprint:
            node_uuid: 30393532-3464-3034-3666-636263376138
            node_type: CN
           request_at: 2025-01-23 12:30:00.025309
          response_at: 2025-01-23 12:31:00.002420
             duration: 59977110901
               status: Success
             err_code: 0
                error:
            exec_plan: {"steps":[{"graphData":{"nodes":[{"id":"3","name":"Multi Update","title":"multi_update","labels":[{"name":"Apply","value":[]}],"statistics":{"Time":[{"name":"Time Consumed","value":276132,"unit":"ns"},{"name":"Wait Time","value":59976201928,"unit":"ns"},{"name":"Scan Time","value":0,"unit":"ns"},{"name":"Insert Time","value":59976330808,"unit":"ns"},{"name":"Wait Lock Time","value":0,"unit":"ns"}],"Memory":[{"name":"Memory Size","value":0,"unit":"byte"}],"Throughput":[{"name":"Input Rows","value":1,"unit":"count"},{"name":"Output Rows","value":1,"unit":"count"},{"name":"Input Size","value":296,"unit":"byte"},{"name":"Output Size","value":296,"unit":"byte"}],"IO":[{"name":"Disk IO","value":0,"unit":"byte"},{"name":"Scan Bytes","value":0,"unit":"byte"},{"name":"S3 List Count","value":0,"unit":"count"},{"name":"S3 Head Count","value":0,"unit":"count"},{"name":"S3 Put Count","value":0,"unit":"count"},{"name":"S3 Get Count","value":0,"unit":"count"},{"name":"S3 Delete Count","value":0,"unit":"count"},{"name":"S3 DeleteMul Count","value":0,"unit":"count"},{"name":"FileService Cache Read","value":0,"unit":"count"},{"name":"FileService Cache Hit","value":0,"unit":"count"},{"name":"FileService Cache Memory Read","value":0,"unit":"count"},{"name":"FileService Cache Memory Hit","value":0,"unit":"count"},{"name":"FileService Cache Disk Read","value":0,"unit":"count"},{"name":"FileService Cache Disk Hit","value":0,"unit":"count"},{"name":"FileService Cache Remote Read","value":0,"unit":"count"},{"name":"FileService Cache Remote Hit","value":0,"unit":"count"}],"Network":[{"name":"Network","value":0,"unit":"byte"}]},"stats":{"blocknum":1,"outcnt":2.5,"cost":2.5,"hashmapsize":0,"rowsize":100},"totalStats":{"name":"Time spent","value":276132,"unit":"ns"}},{"id":"2","name":"Lock Operator","title":"lock_op","labels":[{"name":"Lock op","value":[]}],"statistics":{"Time":[{"name":"Time Consumed","value":177530,"unit":"ns"},{"name":"Wait Time","value":59975942515,"unit":"ns"},{"name":"Scan Time","value":0,"unit":"ns"},{"name":"Insert Time","value":0,"unit":"ns"},{"name":"Wait Lock Time","value":59975704775,"unit":"ns"}],"Memory":[{"name":"Memory Size","value":296,"unit":"byte"}],"Throughput":[{"name":"Input Rows","value":1,"unit":"count"},{"name":"Output Rows","value":1,"unit":"count"},{"name":"Input Size","value":296,"unit":"byte"},{"name":"Output Size","value":296,"unit":"byte"}],"IO":[{"name":"Disk IO","value":0,"unit":"byte"},{"name":"Scan Bytes","value":0,"unit":"byte"},{"name":"S3 List Count","value":0,"unit":"count"},{"name":"S3 Head Count","value":0,"unit":"count"},{"name":"S3 Put Count","value":0,"unit":"count"},{"name":"S3 Get Count","value":0,"unit":"count"},{"name":"S3 Delete Count","value":0,"unit":"count"},{"name":"S3 DeleteMul Count","value":0,"unit":"count"},{"name":"FileService Cache Read","value":0,"unit":"count"},{"name":"FileService Cache Hit","value":0,"unit":"count"},{"name":"FileService Cache Memory Read","value":0,"unit":"count"},{"name":"FileService Cache Memory Hit","value":0,"unit":"count"},{"name":"FileService Cache Disk Read","value":0,"unit":"count"},{"name":"FileService Cache Disk Hit","value":0,"unit":"count"},{"name":"FileService Cache Remote Read","value":0,"unit":"count"},{"name":"FileService Cache Remote Hit","value":0,"unit":"count"}],"Network":[{"name":"Network","value":0,"unit":"byte"}]},"stats":{"blocknum":1,"outcnt":2.5,"cost":2.5,"hashmapsize":0,"rowsize":100},"totalStats":{"name":"Time spent","value":177530,"unit":"ns"}},{"id":"1","name":"Project","title":"task.id, task.source_connector_id, task.task_log_id, task.name, task.creator, task.uid, cast(? AS TINYINT), task.interval, task.load_interval_type, task.task_filter_mode, task.task_meta, task.folder_id, task.folder_name, task.start_time, cast(? AS BIGINT), task.create_time, cast(? AS BIGINT), task.__mo_rowid","labels":[{"name":"List of expressions","value":["task.id","task.source_connector_id","task.task_log_id","task.name","task.creator","task.uid","cast(? AS TINYINT)","task.interval","task.load_interval_type","task.task_filter_mode","task.task_meta","task.folder_id","task.folder_name","task.start_time","cast(? AS BIGINT)","task.create_time","cast(? AS BIGINT)","task.__mo_rowid"]}],"statistics":{"Time":[{"name":"Time Consumed","value":14234,"unit":"ns"},{"name":"Wait Time","value":0,"unit":"ns"},{"name":"Scan Time","value":0,"unit":"ns"},{"name":"Insert Time","value":0,"unit":"ns"},{"name":"Wait Lock Time","value":0,"unit":"ns"}],"Memory":[{"name":"Memory Size","value":296,"unit":"byte"}],"Throughput":[{"name":"Input Rows","value":1,"unit":"count"},{"name":"Output Rows","value":1,"unit":"count"},{"name":"Input Size","value":279,"unit":"byte"},{"name":"Output Size","value":296,"unit":"byte"}],"IO":[{"name":"Disk IO","value":0,"unit":"byte"},{"name":"Scan Bytes","value":0,"unit":"byte"},{"name":"S3 List Count","value":0,"unit":"count"},{"name":"S3 Head Count","value":0,"unit":"count"},{"name":"S3 Put Count","value":0,"unit":"count"},{"name":"S3 Get Count","value":0,"unit":"count"},{"name":"S3 Delete Count","value":0,"unit":"count"},{"name":"S3 DeleteMul Count","value":0,"unit":"count"},{"name":"FileService Cache Read","value":0,"unit":"count"},{"name":"FileService Cache Hit","value":0,"unit":"count"},{"name":"FileService Cache Memory Read","value":0,"unit":"count"},{"name":"FileService Cache Memory Hit","value":0,"unit":"count"},{"name":"FileService Cache Disk Read","value":0,"unit":"count"},{"name":"FileService Cache Disk Hit","value":0,"unit":"count"},{"name":"FileService Cache Remote Read","value":0,"unit":"count"},{"name":"FileService Cache Remote Hit","value":0,"unit":"count"}],"Network":[{"name":"Network","value":0,"unit":"byte"}]},"stats":{"blocknum":1,"outcnt":2.5,"cost":2.5,"hashmapsize":0,"rowsize":100},"totalStats":{"name":"Time spent","value":14234,"unit":"ns"}},{"id":"0","name":"Table Scan","title":"luke.task","labels":[{"name":"Full table name","value":"luke.task"},{"name":"Columns","value":["id","source_connector_id","task_log_id","name","creator","uid","interval","load_interval_type","task_filter_mode","task_meta","folder_id","folder_name","start_time","create_time","__mo_rowid"]},{"name":"Total columns","value":18},{"name":"Scan columns","value":15},{"name":"Block Filter conditions","value":["(task.id = cast(? AS BIGINT))"]},{"name":"Filter conditions","value":["(task.id = cast(? AS BIGINT))"]}],"statistics":{"Time":[{"name":"Time Consumed","value":75742,"unit":"ns"},{"name":"Wait Time","value":0,"unit":"ns"},{"name":"Scan Time","value":0,"unit":"ns"},{"name":"Insert Time","value":0,"unit":"ns"},{"name":"Wait Lock Time","value":0,"unit":"ns"}],"Memory":[{"name":"Memory Size","value":559,"unit":"byte"}],"Throughput":[{"name":"Input Rows","value":1,"unit":"count"},{"name":"Output Rows","value":1,"unit":"count"},{"name":"Input Size","value":279,"unit":"byte"},{"name":"Output Size","value":279,"unit":"byte"}],"IO":[{"name":"Disk IO","value":0,"unit":"byte"},{"name":"Scan Bytes","value":279,"unit":"byte"},{"name":"S3 List Count","value":0,"unit":"count"},{"name":"S3 Head Count","value":0,"unit":"count"},{"name":"S3 Put Count","value":0,"unit":"count"},{"name":"S3 Get Count","value":0,"unit":"count"},{"name":"S3 Delete Count","value":0,"unit":"count"},{"name":"S3 DeleteMul Count","value":0,"unit":"count"},{"name":"FileService Cache Read","value":15,"unit":"count"},{"name":"FileService Cache Hit","value":15,"unit":"count"},{"name":"FileService Cache Memory Read","value":15,"unit":"count"},{"name":"FileService Cache Memory Hit","value":15,"unit":"count"},{"name":"FileService Cache Disk Read","value":0,"unit":"count"},{"name":"FileService Cache Disk Hit","value":0,"unit":"count"},{"name":"FileService Cache Remote Read","value":0,"unit":"count"},{"name":"FileService Cache Remote Hit","value":0,"unit":"count"}],"Network":[{"name":"Network","value":0,"unit":"byte"}]},"stats":{"blocknum":1,"outcnt":2.5,"cost":1,"hashmapsize":1,"rowsize":0},"totalStats":{"name":"Time spent","value":75742,"unit":"ns"}}],"edges":[{"id":"E2","src":"2","dst":"3","output":1,"unit":"count"},{"id":"E1","src":"1","dst":"2","output":1,"unit":"count"},{"id":"E0","src":"0","dst":"1","output":1,"unit":"count"}],"labels":[],"global":{"statistics":{"Time":[{"name":"Time Consumed","value":543638,"unit":"ns"},{"name":"Wait Time","value":119952144443,"unit":"ns"}],"Memory":[{"name":"Memory Size","value":1151,"unit":"byte"}],"Throughput":[{"name":"Input Rows","value":4,"unit":"count"},{"name":"Output Rows","value":4,"unit":"count"},{"name":"Input Size","value":1150,"unit":"byte"},{"name":"Output Size","value":1167,"unit":"byte"}],"IO":[{"name":"Disk IO","value":0,"unit":"byte"},{"name":"Scan Bytes","value":279,"unit":"byte"},{"name":"S3 List Count","value":0,"unit":"count"},{"name":"S3 Head Count","value":0,"unit":"count"},{"name":"S3 Put Count","value":0,"unit":"count"},{"name":"S3 Get Count","value":0,"unit":"count"},{"name":"S3 Delete Count","value":0,"unit":"count"},{"name":"S3 DeleteMul Count","value":0,"unit":"count"}],"Network":[{"name":"Network","value":0,"unit":"byte"}]},"totalStats":{"name":"Time spent","value":543638,"unit":"ns"}}},"step":0,"description":"","state":"success","stats":{}}],"code":0,"message":"","uuid":"01949324-0959-74db-9e34-e8e0069bcec3","PhyPlan":{"version":"1.0","scope":[{"Magic":"Merge","Mcpu":1,"Receiver":[{"Idx":1}],"PreScopes":[{"Magic":"Merge","Mcpu":1,"Receiver":[{"Idx":0}],"PreScopes":[{"Magic":"Remote","Mcpu":1,"DataSource":{"SchemaName":"luke","TableName":"task","Columns":["id","source_connector_id","task_log_id","name","creator","uid","interval","load_interval_type","task_filter_mode","task_meta","folder_id","folder_name","start_time","create_time","__mo_rowid"]},"RootOperator":{"OpName":"Connector","NodeIdx":2,"Status":0,"toMergeReceiver":[{"Idx":0}],"OpStats":{"CallCount":2,"TimeConsumed":83293},"Children":[{"OpName":"Projection","NodeIdx":1,"Status":3,"OpStats":{"CallCount":2,"TimeConsumed":14234,"MemorySize":296,"InputRows":1,"InputSize":279,"OutputRows":1,"OutputSize":296},"Children":[{"OpName":"Projection","NodeIdx":0,"Status":2,"OpStats":{"CallCount":2,"TimeConsumed":1141,"MemorySize":279,"OutputRows":1,"OutputSize":279},"Children":[{"OpName":"Filter","NodeIdx":0,"Status":0,"OpStats":{"CallCount":2,"TimeConsumed":11200,"MemorySize":1},"Children":[{"OpName":"TableScan","NodeIdx":0,"Status":1,"OpStats":{"CallCount":2,"TimeConsumed":63401,"MemorySize":279,"InputRows":1,"InputSize":279,"CacheRead":15,"CacheHit":15,"CacheMemoryRead":15,"CacheMemoryHit":15}}]}]}]}]},"PrepareTimeConsumed":29342}],"RootOperator":{"OpName":"Connector","NodeIdx":3,"Status":0,"toMergeReceiver":[{"Idx":1}],"OpStats":{"CallCount":2,"TimeConsumed":144904},"Children":[{"OpName":"Projection","NodeIdx":2,"Status":2,"OpStats":{"CallCount":2,"TimeConsumed":11209,"MemorySize":296,"OutputRows":1,"OutputSize":296},"Children":[{"OpName":"LockOp","NodeIdx":2,"Status":1,"OpStats":{"CallCount":2,"TimeConsumed":20849,"WaitTimeConsumed":59975704394,"InputRows":1,"InputSize":296,"OperatorMetrics":{"3":59975704775}},"Children":[{"OpName":"Merge","NodeIdx":2,"Status":0,"OpStats":{"CallCount":2,"TimeConsumed":62179,"WaitTimeConsumed":238121}}]}]}]},"PrepareTimeConsumed":2147}],"RootOperator":{"OpName":"MultiUpdate","NodeIdx":3,"Status":3,"OpStats":{"CallCount":2,"TimeConsumed":81973,"InputRows":1,"InputSize":296,"OutputRows":1,"OutputSize":296,"WrittenRows":1,"DeletedRows":1,"OperatorMetrics":{"1":59976330808}},"Children":[{"OpName":"Merge","NodeIdx":3,"Status":0,"OpStats":{"CallCount":2,"TimeConsumed":49255,"WaitTimeConsumed":59976201928}}]},"PrepareTimeConsumed":2797}]},"NewPlanStats":{"ParseStage":{"ParseDuration":9596,"ParseStartTime":"2025-01-23T12:30:00.025309286Z"},"PlanStage":{"PlanDuration":0,"PlanStartTime":"0001-01-01T00:00:00Z","BuildPlanS3Request":{},"BuildPlanStatsIOConsumption":0,"BuildPlanStatsS3":{},"BuildPlanStatsDuration":0,"BuildPlanStatsInCacheDuration":0,"BuildPlanResolveVarDuration":0},"CompileStage":{"CompileDuration":0,"CompileStartTime":"0001-01-01T00:00:00Z","CompileS3Request":{},"CompileExpandRangesS3":{},"CompileTableScanDuration":0},"PrepareRunStage":{"CompilePreRunOnceDuration":504505,"CompilePreRunOnceWaitLock":326002,"ScopePrepareDuration":34286,"ScopePrepareS3Request":{},"BuildReaderDuration":27385},"ExecuteStage":{"ExecutionDuration":59976958406,"ExecutionStartTime":"2025-01-23T12:30:00.025391551Z","ExecutionEndTime":"2025-01-23T12:31:00.00234995Z","OutputDuration":0},"OtherStage":{"TxnIncrStatementS3":{}},"IOAccessTimeConsumption":25145,"S3FSPrefetchFileIOMergerTimeConsumption":0,"LocalFSReadIOMergerTimeConsumption":0,"S3FSReadIOMergerTimeConsumption":0,"WaitActive":0}}

            rows_read: 1
           bytes_scan: 279
                stats: [5,740878,1151.000,0.000244,0,78,1,1,0.0570,0,0]
       statement_type: Update
           query_type: DML
              role_id: 2
      sql_source_type: external_sql
           aggr_count: 0
         result_count: 0
        connection_id: 2146166980
                   cu: 0.057
4 rows in set (13.22 sec)

@xzxiong
Copy link
Contributor

xzxiong commented Jan 23, 2025

目前能构造出 FIN 的操作,但是未能复现 构造出 Success 的情况。

  • step 1, prepare
create database test;
use test;
create table user_info (id int auto_increment, name varchar(1024));
insert into test.user_info (name) values ('jacksonxie');
begin;
-- lock the target row.
select * from test.user_info for update;
  • step 2, run
git clone [email protected]:xzxiong/gorm_demo.git
cd gorm_demo
git checkout issue_21333
go run .
  • key code.
func testContextTimeout() {

	ctx := context.Background()

	dbCfg := Config{
		Host:        "127.0.0.1",
		Port:        6001,
		Username:    "dump",
		Password:    "111",
		Database:    "mysql",
		PPV2Enabled: false,
		ClientIP:    "",
	}

	logger = NewLogger(NewExampleZapLogger())

	// conn db, got GORM conn
	conn, err := connDBForUser(ctx, dbCfg, logger)
	if err != nil {
		logger.Error(ctx, "Create db connection failed for %s: %v", dbCfg.Username, err)
		return
	}

	timeoCtx, timeoCancel := context.WithTimeoutCause(ctx, 3*time.Second, fmt.Errorf("client-timeout"))
	defer timeoCancel()
	logger.Info(ctx, "startup")
	conn.WithContext(timeoCtx).Transaction(func(tx *gorm.DB) error {
		// 通过外部 lock test.user_info 可构造 timeout 的情况
		if err := tx.Exec("update test.user_info set name ='xzxiong' where id = 1;").Error; err != nil {
			logger.Error(ctx, "exec sql: %v", err)
			return err
		}

		return nil
	})

	logger.Info(ctx, "Done.")
}

@xzxiong
Copy link
Contributor

xzxiong commented Jan 23, 2025

@daviszhen 目前 CN 接受到了 FIN,但是 query 最后是success ,按道理应该是 failed.

@xzxiong xzxiong assigned daviszhen and unassigned xzxiong Jan 23, 2025
@xzxiong
Copy link
Contributor

xzxiong commented Jan 23, 2025

以下是 执行 update task 时,抓取的pprof

goroutine.1.txt
goroutine.txt

goroutine.pprof.pb.gz

heap.pprof.pb.gz

@xzxiong xzxiong modified the milestones: 2.0.2, 2.1.0 Jan 23, 2025
@xzxiong xzxiong added severity/s0 Extreme impact: Cause the application to break down and seriously affect the use and removed severity/s1 High impact: Logical errors or data errors that must occur labels Jan 23, 2025
@qingxinhome qingxinhome self-assigned this Jan 23, 2025
@xzxiong
Copy link
Contributor

xzxiong commented Jan 24, 2025

在 dev 环境下

抓取 tcpdump 确认了给 连接发送了 FIN 包 , 对应的请求按道理 应该是 失败的。

....

tcpdump tcp.srcport == 39898 or tcp.dstport == 39898target.cap.0123.issue21333.tgz

Image

...

我捋了一下,这里情况如下。
客户端遇到 context timeout 然后 close 了tcp连接,mo 在 tcp- { FIN,ACK } 前完成了 Response 发送,所以 query 确实是成功的。

所以目前的问题时:为什么 update task ... query 都能在 tcp断开前完成?

# Client Server
L1 begin; ->
L2 update task ... ->
L3 run query
L4 tcp FIN ->
L5 <- tcp ACK
L6 <- query Response
L7 <- tcp FIN, ACK
L8 tcp RST ->

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working severity/s0 Extreme impact: Cause the application to break down and seriously affect the use
Projects
None yet
Development

No branches or pull requests

5 participants