Skip to content

Add hami security self-assesment#2083

Open
archlitchi wants to merge 1 commit intocncf:mainfrom
archlitchi:main
Open

Add hami security self-assesment#2083
archlitchi wants to merge 1 commit intocncf:mainfrom
archlitchi:main

Conversation

@archlitchi
Copy link

security self-assesment for HAMi project

HAMi is a CNCF sandbox, and is apply for incubation, related issue is: #1775

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
@archlitchi archlitchi requested a review from a team as a code owner March 11, 2026 09:45
Copy link
Contributor

@JustinCappos JustinCappos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall. I'm submitting a few basic improvement asks.

Let me know if you want to use this to help proceed to a Joint Assessment and I'll help to get that process started. This will be a moderate amount of work on top of what you've done here.

**Kubernetes API Server**: While HAMi operates within kubernetes, It uses kubernetes API server to handles requests and responses between users and the HAMi scheduler. It’s a critical interface and thus a separate actor due to its role in processing and validating job submissions.

**Scheduler Extender (hami-scheduler)**
Runs in the cluster and is called by the default Kubernetes scheduler for Filter and Score. It holds a view of node devices and pod allocations, and decides which nodes can fit device requests. It is isolated as a separate service; compromise could affect scheduling decisions but never in-container isolation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might explain this is the traditional k8s scheduler operations of Filter and Score? Are you just saying the same thing as the next sentence?

Runs in the cluster and is called by the default Kubernetes scheduler for Filter and Score. It holds a view of node devices and pod allocations, and decides which nodes can fit device requests. It is isolated as a separate service; compromise could affect scheduling decisions but never in-container isolation.

**Mutating Webhook**
Running as part of the scheduler deployment and mutates pods before admission to add 'schedulerName: hami-scheduler' to each pod it manages. It must be configured with TLS (self-signed or cert-manager). Compromise could lead to job failure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or SPIFFE / SPIRE?

Running as part of the scheduler deployment and mutates pods before admission to add 'schedulerName: hami-scheduler' to each pod it manages. It must be configured with TLS (self-signed or cert-manager). Compromise could lead to job failure.

**Device Plugins (e.g., hami-device-plugin)**
Run on each node, register device resources with the kubelet, and respond to Allocate requests. They interact with vendor drivers and host paths. Compromise could affect which devices are exposed to which pods or allow escape from resource limits if combined with a bug in in-container enforcement.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explaining what operations here are from other, external things would be heplful

### Goals

- **Resource isolation:** Workloads cannot use more device memory or compute cores than requested and allocated; enforcement is hard where supported by the vendor stack.
- **Multi-vendor support:** Unified interface for NVIDIA, Ascend, Cambricon, Hygon, Iluvatar, Moore Threads, Enflame, MetaX, etc., with consistent scheduling, allocating and monitoring semantics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this less as a security goal and more as a project goal. Maybe making some security goals more explicit? I assume you don't want unauthorized folks to be able to influence scheduling / filtering, etc., right?


- **Heterogeneous compute adaptation:** HAMi does not handle task-level heterogeneous compute adaptation; users are responsible for resolving image and device compatibility themselves.
- **Multi-node scheduling optimization:** HAMi provides pod-based scheduling optimization only and does not support multi-node jobs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably, you're subject to all of the good and bad security things about k8s so many of these are out of scope either as unaddressable or already addressed.

- **Testing:** Unit tests for device logic, scheduler, and webhook; E2E for integration. Contributors are asked to add tests for new code.
- **Code review:** HAMi employs a rigorous code review process, with multiple maintainers from different organizations and automated checks, as well as AI assistance. This ensures high standards of code quality and security.
- **AI assistance:** Contributors must disclose AI assistance in PRs; undisclosed use is not acceptable.
- **Contributer Sign-off:** The project does require DCO or signed commits; governance and maintainer list are documented in [community](https://github.com/Project-HAMi/community).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend generating SBOMs and/or attestations. GitHub / gittuf, etc code signing / protection best practices would be welcome as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants