Conversation
Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
JustinCappos
left a comment
There was a problem hiding this comment.
This looks good overall. I'm submitting a few basic improvement asks.
Let me know if you want to use this to help proceed to a Joint Assessment and I'll help to get that process started. This will be a moderate amount of work on top of what you've done here.
| **Kubernetes API Server**: While HAMi operates within kubernetes, It uses kubernetes API server to handles requests and responses between users and the HAMi scheduler. It’s a critical interface and thus a separate actor due to its role in processing and validating job submissions. | ||
|
|
||
| **Scheduler Extender (hami-scheduler)** | ||
| Runs in the cluster and is called by the default Kubernetes scheduler for Filter and Score. It holds a view of node devices and pod allocations, and decides which nodes can fit device requests. It is isolated as a separate service; compromise could affect scheduling decisions but never in-container isolation. |
There was a problem hiding this comment.
You might explain this is the traditional k8s scheduler operations of Filter and Score? Are you just saying the same thing as the next sentence?
| Runs in the cluster and is called by the default Kubernetes scheduler for Filter and Score. It holds a view of node devices and pod allocations, and decides which nodes can fit device requests. It is isolated as a separate service; compromise could affect scheduling decisions but never in-container isolation. | ||
|
|
||
| **Mutating Webhook** | ||
| Running as part of the scheduler deployment and mutates pods before admission to add 'schedulerName: hami-scheduler' to each pod it manages. It must be configured with TLS (self-signed or cert-manager). Compromise could lead to job failure. |
| Running as part of the scheduler deployment and mutates pods before admission to add 'schedulerName: hami-scheduler' to each pod it manages. It must be configured with TLS (self-signed or cert-manager). Compromise could lead to job failure. | ||
|
|
||
| **Device Plugins (e.g., hami-device-plugin)** | ||
| Run on each node, register device resources with the kubelet, and respond to Allocate requests. They interact with vendor drivers and host paths. Compromise could affect which devices are exposed to which pods or allow escape from resource limits if combined with a bug in in-container enforcement. |
There was a problem hiding this comment.
Explaining what operations here are from other, external things would be heplful
| ### Goals | ||
|
|
||
| - **Resource isolation:** Workloads cannot use more device memory or compute cores than requested and allocated; enforcement is hard where supported by the vendor stack. | ||
| - **Multi-vendor support:** Unified interface for NVIDIA, Ascend, Cambricon, Hygon, Iluvatar, Moore Threads, Enflame, MetaX, etc., with consistent scheduling, allocating and monitoring semantics. |
There was a problem hiding this comment.
I see this less as a security goal and more as a project goal. Maybe making some security goals more explicit? I assume you don't want unauthorized folks to be able to influence scheduling / filtering, etc., right?
|
|
||
| - **Heterogeneous compute adaptation:** HAMi does not handle task-level heterogeneous compute adaptation; users are responsible for resolving image and device compatibility themselves. | ||
| - **Multi-node scheduling optimization:** HAMi provides pod-based scheduling optimization only and does not support multi-node jobs. | ||
|
|
There was a problem hiding this comment.
Presumably, you're subject to all of the good and bad security things about k8s so many of these are out of scope either as unaddressable or already addressed.
| - **Testing:** Unit tests for device logic, scheduler, and webhook; E2E for integration. Contributors are asked to add tests for new code. | ||
| - **Code review:** HAMi employs a rigorous code review process, with multiple maintainers from different organizations and automated checks, as well as AI assistance. This ensures high standards of code quality and security. | ||
| - **AI assistance:** Contributors must disclose AI assistance in PRs; undisclosed use is not acceptable. | ||
| - **Contributer Sign-off:** The project does require DCO or signed commits; governance and maintainer list are documented in [community](https://github.com/Project-HAMi/community). |
There was a problem hiding this comment.
I'd recommend generating SBOMs and/or attestations. GitHub / gittuf, etc code signing / protection best practices would be welcome as well.
security self-assesment for HAMi project
HAMi is a CNCF sandbox, and is apply for incubation, related issue is: #1775