Add hami security self-assesment#2083

Open

archlitchi wants to merge 1 commit intocncf:mainfrom

archlitchi:main

archlitchi commented Mar 11, 2026

security self-assesment for HAMi project

HAMi is a CNCF sandbox, and is apply for incubation, related issue is: #1775


          Add hami security self-assesment

cca1866

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

archlitchi requested a review from a team as a code owner

March 11, 2026 09:45

archlitchi mentioned this pull request

[Incubation] HAMi Incubation Application #1775

Open

48 tasks

JustinCappos approved these changes

View reviewed changes

Contributor

JustinCappos left a comment

This looks good overall. I'm submitting a few basic improvement asks.

Let me know if you want to use this to help proceed to a Joint Assessment and I'll help to get that process started. This will be a moderate amount of work on top of what you've done here.

projects/hami/security-assesment/self-assessment.md

+              **Kubernetes API Server**: While HAMi operates within kubernetes, It uses kubernetes API server to handles requests and responses between users and the HAMi scheduler. It’s a critical interface and thus a separate actor due to its role in processing and validating job submissions.
+              **Scheduler Extender (hami-scheduler)**
+              Runs in the cluster and is called by the default Kubernetes scheduler for Filter and Score. It holds a view of node devices and pod allocations, and decides which nodes can fit device requests. It is isolated as a separate service; compromise could affect scheduling decisions but never in-container isolation.

Contributor

JustinCappos Mar 11, 2026

You might explain this is the traditional k8s scheduler operations of Filter and Score? Are you just saying the same thing as the next sentence?

projects/hami/security-assesment/self-assessment.md

+              Runs in the cluster and is called by the default Kubernetes scheduler for Filter and Score. It holds a view of node devices and pod allocations, and decides which nodes can fit device requests. It is isolated as a separate service; compromise could affect scheduling decisions but never in-container isolation.
+              **Mutating Webhook**
+              Running as part of the scheduler deployment and mutates pods before admission to add 'schedulerName: hami-scheduler' to each pod it manages. It must be configured with TLS (self-signed or cert-manager). Compromise could lead to job failure.

Contributor

JustinCappos Mar 11, 2026

Or SPIFFE / SPIRE?

projects/hami/security-assesment/self-assessment.md

+              Running as part of the scheduler deployment and mutates pods before admission to add 'schedulerName: hami-scheduler' to each pod it manages. It must be configured with TLS (self-signed or cert-manager). Compromise could lead to job failure.
+              **Device Plugins (e.g., hami-device-plugin)**
+              Run on each node, register device resources with the kubelet, and respond to Allocate requests. They interact with vendor drivers and host paths. Compromise could affect which devices are exposed to which pods or allow escape from resource limits if combined with a bug in in-container enforcement.

Contributor

JustinCappos Mar 11, 2026

Explaining what operations here are from other, external things would be heplful

projects/hami/security-assesment/self-assessment.md

+              ### Goals
+              - **Resource isolation:** Workloads cannot use more device memory or compute cores than requested and allocated; enforcement is hard where supported by the vendor stack.
+              - **Multi-vendor support:** Unified interface for NVIDIA, Ascend, Cambricon, Hygon, Iluvatar, Moore Threads, Enflame, MetaX, etc., with consistent scheduling, allocating and monitoring semantics.

Contributor

JustinCappos Mar 11, 2026

I see this less as a security goal and more as a project goal. Maybe making some security goals more explicit? I assume you don't want unauthorized folks to be able to influence scheduling / filtering, etc., right?

projects/hami/security-assesment/self-assessment.md


		- Heterogeneous compute adaptation: HAMi does not handle task-level heterogeneous compute adaptation; users are responsible for resolving image and device compatibility themselves.
		- Multi-node scheduling optimization: HAMi provides pod-based scheduling optimization only and does not support multi-node jobs.

Contributor

JustinCappos Mar 11, 2026

Presumably, you're subject to all of the good and bad security things about k8s so many of these are out of scope either as unaddressable or already addressed.

projects/hami/security-assesment/self-assessment.md

+              - **Testing:** Unit tests for device logic, scheduler, and webhook; E2E for integration. Contributors are asked to add tests for new code.
+              - **Code review:** HAMi employs a rigorous code review process, with multiple maintainers from different organizations and automated checks, as well as AI assistance. This ensures high standards of code quality and security.
+              - **AI assistance:** Contributors must disclose AI assistance in PRs; undisclosed use is not acceptable.
+              - **Contributer Sign-off:** The project does require DCO or signed commits; governance and maintainer list are documented in [community](https://github.com/Project-HAMi/community).

Contributor

JustinCappos Mar 11, 2026

I'd recommend generating SBOMs and/or attestations. GitHub / gittuf, etc code signing / protection best practices would be welcome as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet