Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fatal error: sync: unlock of unlocked mutex (v8.5.0 stmtctx.go:803) #58600

Open
chibiegg opened this issue Dec 29, 2024 · 22 comments · May be fixed by #58629
Open

fatal error: sync: unlock of unlocked mutex (v8.5.0 stmtctx.go:803) #58600

chibiegg opened this issue Dec 29, 2024 · 22 comments · May be fixed by #58629
Labels
impact/panic severity/critical sig/sql-infra SIG: SQL Infra type/bug The issue is confirmed as a bug.

Comments

@chibiegg
Copy link

chibiegg commented Dec 29, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

I'm sorry, but I'm not sure how to reproduce the issue. It occurs sporadically on multiple running TiDB processes.

2. What did you expect to see? (Required)

3. What did you see instead (Required)

fatal error on

func (sc *StatementContext) AffectedRows() uint64 {
sc.mu.Lock()
defer sc.mu.Unlock()
return sc.mu.affectedRows
}
.

These errors always occur on the same line.

fatal error: sync: unlock of unlocked mutex

goroutine 322590843 [running]:
sync.fatal({0x67ac522?, 0x0?})
	/usr/local/go/src/runtime/panic.go:1031 +0x18
sync.(*Mutex).unlockSlow(0xc0414d09c8, 0xffffffff)
	/usr/local/go/src/sync/mutex.go:231 +0x35
sync.(*Mutex).Unlock(...)
	/usr/local/go/src/sync/mutex.go:225
github.com/pingcap/tidb/pkg/sessionctx/stmtctx.(*StatementContext).AffectedRows(0xa6f0580?)
	/workspace/source/tidb/pkg/sessionctx/stmtctx/stmtctx.go:803 +0x9e
github.com/pingcap/tidb/pkg/util.(*ProcessInfo).ToRow(0xc040ae1860, 0xc003726ee0)
	/workspace/source/tidb/pkg/util/processinfo.go:152 +0x87
github.com/pingcap/tidb/pkg/executor.(*memtableRetriever).setDataForProcessList(0xc084d3a8c0, {0x7114e40, 0xc06e015208})
	/workspace/source/tidb/pkg/executor/infoschema_reader.go:1843 +0x25a
github.com/pingcap/tidb/pkg/executor.(*memtableRetriever).setDataForClusterProcessList(0xc084d3a8c0, {0x7114e40, 0xc06e015208})
	/workspace/source/tidb/pkg/executor/infoschema_reader.go:1816 +0x26
github.com/pingcap/tidb/pkg/executor.(*memtableRetriever).retrieve(0xc084d3a8c0, {0x707dc08, 0xc07adb7c20}, {0x7114e40, 0xc06e015208})
	/workspace/source/tidb/pkg/executor/infoschema_reader.go:179 +0xef3
github.com/pingcap/tidb/pkg/executor.(*MemTableReaderExec).Next(0xc0c5628300, {0x707dc08, 0xc07adb7c20}, 0xc040a6d090)
	/workspace/source/tidb/pkg/executor/memtable_reader.go:136 +0x29c
github.com/pingcap/tidb/pkg/executor/internal/exec.Next({0x707dc08, 0xc07adb7c20}, {0x70b4620, 0xc0c5628300}, 0xc040a6d090)
	/workspace/source/tidb/pkg/executor/internal/exec/executor.go:456 +0x29f
github.com/pingcap/tidb/pkg/executor.(*SelectionExec).Next(0xc040a55080, {0x707dc08, 0xc07adb7c20}, 0xc040a6d0e0)
	/workspace/source/tidb/pkg/executor/select.go:721 +0xeb
github.com/pingcap/tidb/pkg/executor/internal/exec.Next({0x707dc08, 0xc07adb7c20}, {0x70b47d0, 0xc040a55080}, 0xc040a6d0e0)
	/workspace/source/tidb/pkg/executor/internal/exec/executor.go:456 +0x29f
github.com/pingcap/tidb/pkg/executor.(*CoprocessorDAGHandler).HandleRequest(0xc019f318b8, {0x707dc08?, 0xc07adb7770?}, 0xc040a56480)
	/workspace/source/tidb/pkg/executor/coprocessor.go:90 +0x356
github.com/pingcap/tidb/pkg/server.(*rpcServer).handleCopRequest(0xc003655b90, {0x707dc08, 0xc07adb7770}, 0xc040a56480)
	/workspace/source/tidb/pkg/server/rpc_server.go:216 +0x215
github.com/pingcap/tidb/pkg/server.(*rpcServer).Coprocessor(0xc003655b90, {0x707dc08, 0xc07adb7770}, 0xc040a56480)
	/workspace/source/tidb/pkg/server/rpc_server.go:105 +0x96
github.com/pingcap/kvproto/pkg/tikvpb._Tikv_Coprocessor_Handler({0x6700b60, 0xc003655b90}, {0x707dc08, 0xc07adb7770}, 0xc057e61400, 0x0)
	/root/go/pkg/mod/github.com/pingcap/[email protected]/pkg/tikvpb/tikvpb.pb.go:3526 +0x1a6
google.golang.org/grpc.(*Server).processUnaryRPC(0xc003eff000, {0x707dc08, 0xc07adb76e0}, {0x70a1f40, 0xc018578180}, 0xc040a5f9e0, 0xc003655d10, 0x9f0d6d0, 0x0)
	/root/go/pkg/mod/google.golang.org/[email protected]/server.go:1369 +0xdf8
google.golang.org/grpc.(*Server).handleStream(0xc003eff000, {0x70a1f40, 0xc018578180}, 0xc040a5f9e0)
	/root/go/pkg/mod/google.golang.org/[email protected]/server.go:1780 +0xe8b
google.golang.org/grpc.(*Server).serveStreams.func2.1()
	/root/go/pkg/mod/google.golang.org/[email protected]/server.go:1019 +0x7f
created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 1510
	/root/go/pkg/mod/google.golang.org/[email protected]/server.go:1030 +0x125

4. What is your TiDB version? (Required)

Release Version: v8.5.0
Edition: Community
Git Commit Hash: d13e52ed6e22cc5789bed7c64c861578cd2ed55b
Git Branch: HEAD
UTC Build Time: 2024-12-18 02:26:06
GoVersion: go1.23.3
Race Enabled: false
Check Table Before Drop: false
Store: tikv
@tiancaiamao
Copy link
Contributor

Should be triggered by some query on information_schema.xxx tables, and that query lead to tidb coprocessor request ...
The trigger conditiion may similiar to this one #57798 (comment)

@Defined2014
Copy link
Contributor

Defined2014 commented Dec 30, 2024

Seems we need add RefCountOfStmtCtx before use StmtCtx, ref #39368

@chibiegg
Copy link
Author

chibiegg commented Dec 31, 2024

Thank you for your comments.

In StatementContext, mu is not a pointer but an actual instance, and it is not exported either packages.
There's also no point in the code where mu is replaced with a new Mutex.
Furthermore, everywhere it's used, Lock and Unlock are both called within the same function.

Despite that, I can't figure out why the Unlock call isn't matching. What possible reasons could there be for this?

@Defined2014
Copy link
Contributor

I think it same as #57799 metioned. We didn't check StatementContext is still using or not before reset it.

@chibiegg
Copy link
Author

I think I'm starting to understand this a bit.

Are you suggesting that StatementContext is being duplicated outside the package while still containing mu?

If that's the case, would it be enough to modify the code (similarly to #57986 ) so that at the beginning of the function, we store the pointer to the mutex in a local variable and then use that?

@Defined2014
Copy link
Contributor

You mean #57799? I think it's a workaround way. But we need modify a lots functions, such like SetAffectedRows, FoundRows and so on.

@chibiegg
Copy link
Author

Sorry, It's TYPO. (similarly to #57799 ) is collect.

I will try to modify all functions in stmtctx.go .

@Defined2014
Copy link
Contributor

Sorry, It's TYPO. (similarly to #57799 ) is collect.

I will try to modify all functions in stmtctx.go .

/cc @tiancaiamao

@chibiegg
Copy link
Author

I’ve committed my changes and opened PR #58629.
Does this match what you had in mind?

This is my first contribution, so please let me know if there are any issues or anything I need to fix.
I’ve already taken care of tasks like signing the CLA.

@chibiegg
Copy link
Author

I’d like to gain a deeper understanding of this issue.

As a workaround, I’ve store a pointer of the Mutex to ensure that the target mutex cannot be replaced.

However, I believe the real cause is that StatementContext itself is being overwritten by some operation, like duplication—am I correct in thinking this?
If so, it might be better to fix the usage of StatementContext that occurs outside of the stmtctx package.

That said, I don’t currently have enough knowledge to make that judgment on my own 😢 .

@Defined2014
Copy link
Contributor

I’d like to gain a deeper understanding of this issue.

As a workaround, I’ve store a pointer of the Mutex to ensure that the target mutex cannot be replaced.

However, I believe the real cause is that StatementContext itself is being overwritten by some operation, like duplication—am I correct in thinking this? If so, it might be better to fix the usage of StatementContext that occurs outside of the stmtctx package.

That said, I don’t currently have enough knowledge to make that judgment on my own 😢 .

I think you understand it correctly. So I think the better way to fix this issue is to add RefCountOfStmtCtx before use StmtCtx. And the ref count will block reset.

@chibiegg
Copy link
Author

I see.
Thank you for your explanation and responding to the pull request.

@chibiegg
Copy link
Author

I see where mu is being replaced.
In StatementContext.Reset() (lines 469–479), a new Mutex object is assigned.

What if we lock the old Mutex before it gets replaced?

        mu := &sc.mu
	mu.Lock()
	cacheMu := &sc.stmtCache.mu
	cacheMu.Lock()
	*sc = StatementContext{
		ctxID:               contextutil.GenContextID(),
		CTEStorageMap:       sc.CTEStorageMap,
		LockTableIDs:        sc.LockTableIDs,
		TableStats:          sc.TableStats,
		MDLRelatedTableIDs:  sc.MDLRelatedTableIDs,
		TblInfo2UnionScan:   sc.TblInfo2UnionScan,
		WarnHandler:         sc.WarnHandler,
		ExtraWarnHandler:    sc.ExtraWarnHandler,
		IndexUsageCollector: sc.IndexUsageCollector,
	}
        mu.Unlock()
        cacheMu.Unlock()

@Defined2014
Copy link
Contributor

I see where mu is being replaced. In StatementContext.Reset() (lines 469–479), a new Mutex object is assigned.

What if we lock the old Mutex before it gets replaced?

        mu := &sc.mu
	mu.Lock()
	cacheMu := &sc.stmtCache.mu
	cacheMu.Lock()
	*sc = StatementContext{
		ctxID:               contextutil.GenContextID(),
		CTEStorageMap:       sc.CTEStorageMap,
		LockTableIDs:        sc.LockTableIDs,
		TableStats:          sc.TableStats,
		MDLRelatedTableIDs:  sc.MDLRelatedTableIDs,
		TblInfo2UnionScan:   sc.TblInfo2UnionScan,
		WarnHandler:         sc.WarnHandler,
		ExtraWarnHandler:    sc.ExtraWarnHandler,
		IndexUsageCollector: sc.IndexUsageCollector,
	}
        mu.Unlock()
        cacheMu.Unlock()

I think it will block the groutine for a while and influence the SQL execution.

@chibiegg
Copy link
Author

I’m also concerned about the performance impact. Should we go ahead with the workaround in the commit, or is there a more appropriate method? What would be the best approach here?

@Defined2014
Copy link
Contributor

I have a patch to fix this issue, but I haven't submitted a PR because I haven't found a way to reproduce the panic and can't 100% guarantee the effectiveness of the patch.

diff --git a/pkg/executor/infoschema_reader.go b/pkg/executor/infoschema_reader.go
index 206ebb52fb..e203886e58 100644
--- a/pkg/executor/infoschema_reader.go
+++ b/pkg/executor/infoschema_reader.go
@@ -1839,7 +1839,11 @@ func (e *memtableRetriever) setDataForProcessList(ctx sessionctx.Context) {
                        continue
                }

+               if pi.StmtCtx != nil && pi.RefCountOfStmtCtx != nil && !pi.RefCountOfStmtCtx.TryIncrease() {
+                       continue
+               }
                rows := pi.ToRow(ctx.GetSessionVars().StmtCtx.TimeZone())
+               pi.RefCountOfStmtCtx.Decrease()
                record := types.MakeDatums(rows...)
                records = append(records, record)
        }

@chibiegg
Copy link
Author

I see! So the StatementContext is only reset in InitStatementContext(), and that is also checked by RefCountOfStmtCtx.

func (s *SessionVars) InitStatementContext() *stmtctx.StatementContext {
sc := &s.cachedStmtCtx[0]
if sc == s.StmtCtx {
sc = &s.cachedStmtCtx[1]
}
if s.RefCountOfStmtCtx.TryFreeze() {
sc.Reset()
s.RefCountOfStmtCtx.UnFreeze()
} else {
sc = stmtctx.NewStmtCtx()
}
return sc
}

In other words, rather than acquiring the mutex lock within Reset(), we can properly handle this by appropriately managing RefCountOfStmtCtx in the code that calls functions like AffectedRows(). Is that correct?

@Defined2014
Copy link
Contributor

I think so. But we couldn't get RefCountOfStmtCtx in AffectedRows()

@chibiegg
Copy link
Author

Based on what I’ve seen, in my environment, every occurrence of the panic happened during calls originating from setDataForProcessList (infoschema_reader.go:1843).

I’ll try running the patch you provided on my local environment.

If it works, should we skip the previously committed workaround and only introduce the handling of RefCountOfStmtCtx?
Or is it better to keep both changes?

@Defined2014
Copy link
Contributor

I think it is enough to keep RefCountOfStmtCtx. But I will let other colleagues review this PR and ask for their opinions

@chibiegg
Copy link
Author

chibiegg commented Jan 1, 2025

I wish you a happy New Year.
Thank you for providing guidance despite it being year-end.

Since last night, I’ve deployed a binary that includes the patch referencing RefCountOfStmtCtx and have started monitoring it.

I’ll keep observing it for a while.

@Defined2014
Copy link
Contributor

I wish you a happy New Year. Thank you for providing guidance despite it being year-end.

Since last night, I’ve deployed a binary that includes the patch referencing RefCountOfStmtCtx and have started monitoring it.

I’ll keep observing it for a while.

Does panic still exist? If not, you could update your PR and I will ask my colleagues to review it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact/panic severity/critical sig/sql-infra SIG: SQL Infra type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants