Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checking tcp connection status #20996

Open
wants to merge 41 commits into
base: main
Choose a base branch
from
Open

checking tcp connection status #20996

wants to merge 41 commits into from

Conversation

QCSnAn
Copy link

@QCSnAn QCSnAn commented Dec 27, 2024

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #20609


What this PR does / why we need it:

通过异步的方式检测TCP连接的状态并提前中止已断开连接的任务
https://github.com/matrixorigin/docs/blob/main/design/mo/sql/20250108-QCSnAn-tcpDetection.md


Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 Security concerns

File Descriptor Exposure:
The code obtains raw file descriptors from TCP connections using File() method. If not properly managed, these file descriptors could potentially be leaked or accessed by unauthorized code. The code should ensure proper cleanup and validation of file descriptor access permissions.

⚡ Recommended focus areas for review

Resource Leak

The GetsockoptTCPInfo function defers file.Close() but continues execution if there's an error closing the file. This could lead to file descriptor leaks.

defer func() {
	err = file.Close()
	if err != nil {
		logutil.Error("TCP info file close error", zap.Error(err))
		return
	}
}()
Potential Deadlock

The isConnected function holds locks on the connMap while iterating and then tries to load/delete from it again, which could cause deadlock issues in concurrent scenarios.

connMap.Range(func(key, value any) bool {
	tcpConn := key.(*net.TCPConn)
	tcpInfo, err := GetsockoptTCPInfo(tcpConn)
	if err != nil {
		logutil.Error("Failed to get TCP info", zap.Error(err))
		tcpConnStatus[tcpConn] = 0
		return true
	}
	switch tcpInfo.State {
	case TCP_LAST_ACK, TCP_CLOSE, TCP_FIN_WAIT1, TCP_FIN_WAIT2, TCP_TIME_WAIT:
		tcpConnStatus[tcpConn] = tcpInfo.State
		return true
	default:
		return true
	}
})

for key, value := range tcpConnStatus {
	cancel, ok := connMap.Load(key)
	if ok {
		TCPAddr := key.RemoteAddr().String()
		cancel.(context.CancelFunc)()
		connMap.Delete(key)
		switch value {
		case TCP_LAST_ACK:
			logutil.Infof("Connection %s is terminated by TCP_LAST_ACK", TCPAddr)
		case TCP_CLOSE:
			logutil.Infof("Connection %s is terminated by TCP_CLOSE", TCPAddr)
		case TCP_FIN_WAIT1:
			logutil.Infof("Connection %s is terminated by TCP_FIN_WAIT1", TCPAddr)
		case TCP_FIN_WAIT2:
			logutil.Infof("Connection %s is terminated by TCP_FIN_WAIT2", TCPAddr)
		case TCP_TIME_WAIT:
			logutil.Infof("Connection %s is terminated by TCP_TIME_WAIT", TCPAddr)
		default:
			logutil.Infof("Connection %s is terminated", TCPAddr)
		}
	}
}
Infinite Loop

The checkConnected function runs in an infinite loop with a sleep, but has no way to be gracefully shutdown. This could cause issues during server shutdown.

func checkConnected() {

	ticker := time.Tick(time.Minute)

	for {
		select {
		case <-ticker:
			logutil.Debugf("Goruntine %d is checking TCP status", GetRoutineId())
		default:
		}
		isConnected()
		time.Sleep(checkInterval * time.Second)
	}
}

@matrix-meow matrix-meow added the size/M Denotes a PR that changes [100,499] lines label Dec 27, 2024
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link

qodo-merge-pro bot commented Dec 27, 2024

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Score
Possible issue
✅ Fix resource leak by properly managing ticker lifecycle and adding graceful shutdown capability
Suggestion Impact:The commit implements context-based cancellation and moves ticker functionality into MOServer struct methods, though with a slightly different implementation than suggested

code diff:

-func checkConnected() {
+func (mo *MOServer) checkConnected(ctx context.Context) {
 
 	ticker := time.Tick(time.Minute)
 
 	for {
 		select {
-		case <-ticker:
-			logutil.Debugf("Goruntine %d is checking TCP status", GetRoutineId())
+		case <-ctx.Done():
+			break
 		default:
-		}
-		isConnected()
-		time.Sleep(checkInterval * time.Second)
-	}
+			select {
+			case <-ticker:
+				logutil.Debugf("Goruntine %d is checking TCP status", GetRoutineId())
+			default:
+			}
+			mo.isConnected()
+			time.Sleep(checkInterval * time.Second)
+		}
+	}
+}

The checkConnected() function creates an infinite number of ticker goroutines that
are never cleaned up. Use a single ticker and add a way to stop the goroutine when
the server shuts down.

pkg/frontend/server.go [180-193]

-func checkConnected() {
-    ticker := time.Tick(time.Minute)
+func checkConnected(ctx context.Context) {
+    ticker := time.NewTicker(time.Minute)
+    defer ticker.Stop()
     for {
         select {
-        case <-ticker:
+        case <-ticker.C:
             logutil.Debugf("Goruntine %d is checking TCP status", GetRoutineId())
-        default:
+            isConnected()
+        case <-ctx.Done():
+            return
         }
-        isConnected()
-        time.Sleep(checkInterval * time.Second)
     }
 }
  • Apply this suggestion
Suggestion importance[1-10]: 9

Why: The suggestion addresses a significant resource leak issue by properly managing the ticker lifecycle and adding context-based cancellation. This is crucial for preventing memory leaks and enabling proper server shutdown.

9
✅ Replace unsafe pointer operations with safe system calls to prevent potential crashes
Suggestion Impact:The commit replaces unsafe pointer operations with reflect package operations, which provides a safer way to handle memory operations compared to direct unsafe pointer usage

code diff:

-	tcpInfo := syscall.TCPInfo{}
-	size := unsafe.Sizeof(tcpInfo)
+	size := reflect.TypeOf(*tcpInfo).Size()
 	_, _, errno := syscall.Syscall6(syscall.SYS_GETSOCKOPT, fd, syscall.SOL_TCP, syscall.TCP_INFO,
-		uintptr(unsafe.Pointer(&tcpInfo)), uintptr(unsafe.Pointer(&size)), 0)
+		reflect.ValueOf(tcpInfo).Pointer(), reflect.ValueOf(&size).Pointer(), 0)

The isConnected function uses unsafe pointer operations without proper memory
alignment checks, which could cause crashes on certain architectures. Add alignment
checks or use a safer alternative.

pkg/frontend/server.go [121-122]

-_, _, errno := syscall.Syscall6(syscall.SYS_GETSOCKOPT, fd, syscall.SOL_TCP, syscall.TCP_INFO,
-    uintptr(unsafe.Pointer(&tcpInfo)), uintptr(unsafe.Pointer(&size)), 0)
+info, err := syscall.GetsockoptTCPInfo(int(fd), syscall.IPPROTO_TCP, syscall.TCP_INFO)
+if err != nil {
+    return nil, fmt.Errorf("failed to get TCP info: %v", err)
+}
+return info, nil
  • Apply this suggestion
Suggestion importance[1-10]: 8

Why: The suggestion addresses a potential stability and security issue by replacing unsafe pointer operations with safe system calls, which could prevent crashes and undefined behavior across different architectures.

8
General
Improve error handling by properly propagating file closure errors

The GetsockoptTCPInfo function defers file closure but ignores potential errors
during the main execution path. Move the error check into the deferred function.

pkg/frontend/server.go [110-116]

 defer func() {
-    err = file.Close()
-    if err != nil {
-        logutil.Error("TCP info file close error", zap.Error(err))
-        return
+    if closeErr := file.Close(); closeErr != nil {
+        if err == nil {
+            err = closeErr
+        } else {
+            logutil.Error("TCP info file close error", zap.Error(closeErr))
+        }
     }
 }()
  • Apply this suggestion
Suggestion importance[1-10]: 6

Why: The suggestion improves error handling by properly propagating file closure errors, which could be important for debugging and system stability, though not as critical as the resource leak issue.

6

@QCSnAn QCSnAn changed the title Checking TCP connection status checking tcp connection status Jan 2, 2025
Copy link
Contributor

mergify bot commented Jan 20, 2025

This pull request has been removed from the queue for the following reason: pull request dequeued.

Pull request #20996 has been dequeued. Invalid commit message. section not found

You should look at the reason for the failure and decide if the pull request needs to be fixed or if you want to requeue it.

If you want to requeue this pull request, you need to post a comment with the text: @mergifyio requeue

2 similar comments
Copy link
Contributor

mergify bot commented Jan 20, 2025

This pull request has been removed from the queue for the following reason: pull request dequeued.

Pull request #20996 has been dequeued. Invalid commit message. section not found

You should look at the reason for the failure and decide if the pull request needs to be fixed or if you want to requeue it.

If you want to requeue this pull request, you need to post a comment with the text: @mergifyio requeue

Copy link
Contributor

mergify bot commented Jan 20, 2025

This pull request has been removed from the queue for the following reason: pull request dequeued.

Pull request #20996 has been dequeued. Invalid commit message. section not found

You should look at the reason for the failure and decide if the pull request needs to be fixed or if you want to requeue it.

If you want to requeue this pull request, you need to post a comment with the text: @mergifyio requeue

Copy link
Contributor

mergify bot commented Jan 21, 2025

This pull request has been removed from the queue for the following reason: pull request dequeued.

Pull request #20996 has been dequeued. Invalid commit message. section not found

You should look at the reason for the failure and decide if the pull request needs to be fixed or if you want to requeue it.

If you want to requeue this pull request, you need to post a comment with the text: @mergifyio requeue

Copy link
Contributor

mergify bot commented Jan 21, 2025

This pull request has been removed from the queue for the following reason: pull request dequeued.

Pull request #20996 has been dequeued. Invalid commit message. section not found

You should look at the reason for the failure and decide if the pull request needs to be fixed or if you want to requeue it.

If you want to requeue this pull request, you need to post a comment with the text: @mergifyio requeue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Review effort [1-5]: 4 size/M Denotes a PR that changes [100,499] lines
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants