Skip to content

Conversation

@imeoer
Copy link
Collaborator

@imeoer imeoer commented Nov 10, 2025

We found that when a pod container is recreated, there's a chance the dynamically mounted CSI directory gets unmounted. Since this directory is shared among multiple pod containers via mount propagation from /var/run/model-csi (a tmpfs), it introduces instability and security isolation risks, for example, if one pod learns another pod's volume name, it could maliciously call APIs to mount or unmount that submount of volume.

This pull request refactors the management of dynamic HTTP servers for CSI mounts, improving security and stability by introducing the DynamicServerManager and deprecating the previous shared dynamic CSI endpoint approach. It also restructures the codebase for better maintainability by moving HTTP server logic from pkg/server/http.go to new service-layer files and updating handlers accordingly. Additionally, it enhances resource cleanup and recovery logic for dynamic servers, and updates mount/unmount workflows to leverage the new manager.

Dynamic Server Management and API Refactor:

  • Introduced DynamicServerManager in pkg/service/dynamic_server.go to handle lifecycle (creation, closing, recovery) of per-volume dynamic HTTP servers, replacing the old shared socket approach. This improves isolation and avoids mount propagation issues.
  • Deprecated the use of a single dynamic_csi_endpoint in favor of individual csi.sock sockets managed per dynamic mount directory, as documented in pkg/config/config.go.
  • Removed the old HTTP server implementation from pkg/server/http.go and replaced it with dynamic server logic in the service layer. [1] [2]

Handler and API Changes:

  • Renamed and refactored HTTP handler code from HttpHandler in pkg/server/http_handler.go to DynamicServerHandler in pkg/service/dynamic_server_handler.go, updating method names and request types for clarity and consistency. [1] [2] [3] [4] [5] [6] [7]

Mount/Unmount Workflow Updates:

  • Updated dynamic volume mount/unmount logic in pkg/service/node_dynamic.go to use DynamicServerManager for server lifecycle management, and added device checks to determine the correct cleanup path for backward compatibility. [1] [2]

Resource Management and Utilities:

  • Moved socket file existence checks and cleanup logic to a shared utility (EnsureSockNotExists), replacing the previous inline implementation for better reuse. [1] [2]
  • Added utility function for generating dynamic CSI socket paths in pkg/config/config.go.

Test and Import Cleanup:

  • Removed obsolete HTTP server and handler tests from pkg/server/server_test.go and cleaned up unused imports related to the old HTTP server code. [1] [2]

This refactor lays the groundwork for more robust and secure dynamic mount management, while maintaining backward compatibility for legacy setups.

@imeoer imeoer force-pushed the improve-dynamic-mounts branch from 312876a to 19509c8 Compare November 11, 2025 02:53
@imeoer imeoer changed the title feat: improve dynamic model mounts refactor(dynamic mount): introduce dynamic server manager Nov 11, 2025
@imeoer imeoer force-pushed the improve-dynamic-mounts branch from 19509c8 to 6823740 Compare November 11, 2025 03:51
@imeoer imeoer marked this pull request as ready for review November 11, 2025 03:52
@imeoer
Copy link
Collaborator Author

imeoer commented Nov 11, 2025

@gemini-code-assist review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-motivated refactoring to improve the management of dynamic CSI mounts. The introduction of DynamicServerManager to handle per-volume HTTP servers is a great step towards better security and stability. The code is well-structured, and backward compatibility for the old shared socket approach is thoughtfully maintained. My review includes a few suggestions to enhance robustness, particularly around context propagation for graceful shutdowns, ensuring proper logging severity for error conditions, and maintaining test coverage for the refactored code.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant refactoring to improve the management of dynamic CSI mounts. It introduces a DynamicServerManager to create per-volume HTTP servers, which enhances security and stability by isolating volumes from each other and resolving issues with mount propagation. The changes are well-structured, moving logic into a new service layer and cleaning up the old implementation.

My review focuses on a few critical areas:

  • Concurrency safety: I've identified a critical issue with the use of os.Chdir in a concurrent environment, which can lead to race conditions.
  • Context propagation: There are places where context.Background() is used instead of propagating the request-scoped context, which can affect graceful shutdown.
  • Error handling: I've pointed out a couple of spots where error handling could be more robust to prevent unexpected behavior.

Overall, this is a great improvement. Addressing these points will make the implementation even more robust.

We found that when a pod container is recreated, there's a chance the
dynamically mounted CSI directory gets unmounted. Since this directory
is shared among multiple pod containers via mount propagation from
/var/run/model-csi (a tmpfs), it introduces instability and security
isolation risks, for example, if one pod learns another pod's volume
name, it could maliciously call APIs to mount or unmount that submount
of volume.

To ensure secure isolation for each dynamic mount and avoid unstable
mount propagation, an independent csi.sock is currently created under
each dynamic mount directory instead of using a shared csi.sock, these
individual csi.sock servers are managed by the DynamicServerManager.

Signed-off-by: imeoer <[email protected]>
@imeoer imeoer force-pushed the improve-dynamic-mounts branch from 6823740 to de42dbc Compare November 12, 2025 08:48
Copy link

@aftersnow aftersnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@chlins chlins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@imeoer imeoer merged commit 4a710cf into modelpack:main Nov 12, 2025
6 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants