[RFE] Network mode equivalent to libvirt's hostdev. #25511

useranon350 · 2025-03-09T19:45:48Z

Feature request description

Hardware features such as SR-IOV allow passing through separate PCIe devices to VMs for hardware accelerated network virtualization. The same could potentially be done inside a privileged container using --network none and --device . It would be more desirable for podman to support this directly in --network to simplify configuration and avoid the need for elevated privileges within the container.

See #8919 and https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/configuring_and_managing_virtualization/managing-virtual-devices_configuring-and-managing-virtualization#attaching-sr-iov-networking-devices-to-virtual-machines_managing-sr-iov-devices. Macvlan is not a replacement for SR-IOV pass-through because it adds additional overhead and is not equivalent to a simple hostdev passthrough.

Suggest potential solution

Support --network hostdev:name=<adapter name, e.g. enpXXsY>,pf=,vf=. Potentially start by simply allowing adapter pass-through without any "intelligence" regarding physical or virtual functions. e.g. --network hostdev:name=enp12s0f3v4 to manually pass-through a specific virtual function without podman needing to give it special treatment vs. a physical network adapter.

As a "nice to have", allow automatic allocation of physical and virtual functions if the pf and vf flags are unset. This would effectively allow dynamic allocation of virtual functions to pods. Even some decade-old NICs like the Intel X550-T2 support 126 total virtual functions (128 if none are reserved for the host), which is sufficient for many deployments to offload pod networking entirely. This also allows offloading the pod DHCP to the local router, which is desirable in some deployments.

Have you considered any alternatives?

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context or screenshots about the feature request here.

Luap99 · 2025-03-10T15:26:18Z

It is not clear to me how these interfaces should be managed by us, the docs you link show how a user who manage them. But how would podman know which interface to use? And how would this look in actual netlink calls? Does the VF create a new interface on the host network namesapce that we then just have to move? Or how is the interface actually being created?

Support --network hostdev:name=<adapter name, e.g. enpXXsY>,pf=,vf=. Potentially start by simply allowing adapter pass-through without any "intelligence" regarding physical or virtual functions. e.g. --network hostdev:name=enp12s0f3v4 to manually pass-through a specific virtual function without podman needing to give it special treatment vs. a physical network adapter.

Design wise this will not make sense for us.
The way things must work is to create a new network driver in netavark then in podman you can do something like

podman network create --driver hostdev --interface-name ethX mynet
podman run --network mynet:pf=,vf=

And for that you can already write you our own plugin https://github.com/containers/netavark/blob/main/plugin-API.md
I already have an example plugin that can just move a interface https://github.com/containers/netavark/blob/main/examples/host-device-plugin.rs.

If this is common use case we consider adding this to the main netavark but right now I don't think this is a common use case so I rather not, especially because this need special hardware to test which makes it likely impossible to test in CI and even hard to get for us maintainers.

useranon350 · 2025-03-11T03:12:45Z

It is not clear to me how these interfaces should be managed by us, the docs you link show how a user who manage them. But how would podman know which interface to use? And how would this look in actual netlink calls? Does the VF create a new interface on the host network namesapce that we then just have to move? Or how is the interface actually being created?

Yes, the virtual function creates a new interface on the host network which can be treated the same as a physical interface. The difference between a virtual function and a physical function is that the virtual functions can be dynamically created and destroyed via the device driver. e.g. (Irrelevant network devices removed for brevity.)

[root@localhost:/var/lib]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
6: enp36s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 9c:6b:00:84:cd:cb brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.83/24 brd 192.168.1.255 scope global dynamic noprefixroute enp36s0f1
       valid_lft 78636sec preferred_lft 78636sec
[root@localhost:~]# echo 1 > /sys/class/net/enp36s0f1/device/sriov_numvfs
[root@localhost:/var/lib]# ip a
5: enp36s0f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 9c:6b:00:84:cc:d6 brd ff:ff:ff:ff:ff:ff
6: enp36s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 9c:6b:00:84:cd:cb brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.83/24 brd 192.168.1.255 scope global dynamic noprefixroute enp36s0f1
       valid_lft 77846sec preferred_lft 77846sec
7: enp36s0f1v0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 16:4a:9e:d2:f0:9b brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.86/24 brd 192.168.1.255 scope global dynamic noprefixroute enp36s0f1v0
       valid_lft 86391sec preferred_lft 86391sec
[root@localhost:/var/lib]# ip netns add testns
[root@localhost:/var/lib]# ip link set enp36s0f1v0 netns testns
[root@localhost:/var/lib]# ip netns exec testns ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
7: enp36s0f1v0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 16:4a:9e:d2:f0:9b brd ff:ff:ff:ff:ff:ff
[root@localhost:/var/lib]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
6: enp36s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 9c:6b:00:84:cd:cb brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.83/24 brd 192.168.1.255 scope global dynamic noprefixroute enp36s0f1
       valid_lft 77566sec preferred_lft 77566sec
[root@localhost:~]# echo 0 > /sys/class/net/enp36s0f1/device/sriov_numvfs
[root@localhost:/var/lib]# ip netns exec testns ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Support --network hostdev:name=<adapter name, e.g. enpXXsY>,pf=,vf=. Potentially start by simply allowing adapter pass-through without any "intelligence" regarding physical or virtual functions. e.g. --network hostdev:name=enp12s0f3v4 to manually pass-through a specific virtual function without podman needing to give it special treatment vs. a physical network adapter.

Design wise this will not make sense for us. The way things must work is to create a new network driver in netavark then in podman you can do something like
podman network create --driver hostdev --interface-name ethX mynet
podman run --network mynet:pf=,vf=
And for that you can already write you our own plugin https://github.com/containers/netavark/blob/main/plugin-API.md I already have an example plugin that can just move a interface https://github.com/containers/netavark/blob/main/examples/host-device-plugin.rs.

Thank you for linking this, that plugin looks useful for testing. Automatic allocation of virtual functions is a nice to have, but host device networking is the key component. I suppose netavark is being used to provide the DHCP and other services required to configure the network link?

If this is common use case we consider adding this to the main netavark but right now I don't think this is a common use case so I rather not, especially because this need special hardware to test which makes it likely impossible to test in CI and even hard to get for us maintainers.

I believe this functionality is used frequently in enterprise VM deployments, although DPUs are replacing some use-cases. It's also useful for certain network types which don't, or historically haven't had, paravirtual drivers. e.g. infiniband. Most enterprise NICs I have seen support this functionality. (e.g. Looking at a random dell server, all the NICs except the Broadcom 5720 support SR-IOV. https://www.dell.com/en-us/shop/servers-storage-and-networking/poweredge-r660xs-rack-server/spd/poweredge-r660xs/pe_r660xs_tm_vi_vp_sb) It is also being used for GPU/TPU virtualization now, which could be relevant to graphical or AI accelerated containers. Unfortunately, getting access to this functionality on a reasonable budget is a lot harder in the current GPU market. (The Intel i350-T2 supports SR-IOV and is currently $70 on Amazon.)

I can still see availability for maintainers being an issue, since I believe it would require something like a Threadripper Pro or other workstation CPU/motherboard for SR-IOV networking to be standard. ASRock tends to unofficially enable SR-IOV on their AMD Ryzen boards (it's how I am able to use it at home), but this not a reliable solution.

EDIT: I didn't mean to close this, but I do not object if it is closed as out of scope.

useranon350 · 2025-03-11T03:23:18Z

I would appreciate if someone with the suitable permissions would reopen this, I closed it by mistake. I understand if it is not a priority, but I think this feature would be valuable for any containerizing any software which requires high-throughput or low-latency networking, but which shouldn't have full host network access. e.g. It would potentially allow migrating certain firewall or virtual switch VMs to containers. (I have a particular interest in running openvswitch inside a container without sacrificing performance or granting access to the host network.)

Luap99 · 2025-03-11T11:02:07Z

We can keep this open in case other users like voice interest in this. To be clear I am not strictly against having this in the main netavark but I would like to see how such code would look like first before making a final decision (a working plugin could show that).

In general if this is just about moving a interface in the namespace as I linked then I think that is something that can be supported easily but I am not sure how we would manage PF/VF functionality. If there is a simple design to do that sure I am happy to consider it.

Also note that netavark focus is networking only obviously. For GPU and other hardware modules I would think CDI specs https://github.com/cncf-tags/container-device-interface are used

useranon350 · 2025-03-11T23:40:07Z

I think the best way to handle PF/VF functionality would be to allow assigning multiple functions when creating a network, then treating the resulting pool of network devices similarly to a macvlan network, but without the need for a gateway or the creation of virtual devices (since we're just reusing the virtual devices created by the driver). Containers could be assigned to a specific virtual function using a similar syntax to static IP assignment. e.g. --network net1:ip=10.89.1.5 --network net2:ip=10.89.10.10 becomes --network net1:vf=2 --network net2:vf=3. It probably should not be allowed to change the MAC address, since usually SR-IOV devices are configured to prevent guests from spoofing the MAC. (Spoofing the MAC would allow the container to impersonate the host or other containers because all virtual functions associated with a physical function share a single port on the switch.)

useranon350 added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 9, 2025

baude changed the title ~~Network mode equivalent to libvirt's hostdev.~~ [RFE] Network mode equivalent to libvirt's hostdev. Mar 10, 2025

Luap99 added the network Networking related issue or feature label Mar 10, 2025

useranon350 closed this as completed Mar 11, 2025

Luap99 reopened this Mar 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFE] Network mode equivalent to libvirt's hostdev. #25511

[RFE] Network mode equivalent to libvirt's hostdev. #25511

useranon350 commented Mar 9, 2025

Luap99 commented Mar 10, 2025

useranon350 commented Mar 11, 2025 •

edited

Loading

useranon350 commented Mar 11, 2025

Luap99 commented Mar 11, 2025

useranon350 commented Mar 11, 2025

[RFE] Network mode equivalent to libvirt's hostdev. #25511

[RFE] Network mode equivalent to libvirt's hostdev. #25511

Comments

useranon350 commented Mar 9, 2025

Feature request description

Suggest potential solution

Have you considered any alternatives?

Additional context

Luap99 commented Mar 10, 2025

useranon350 commented Mar 11, 2025 • edited Loading

useranon350 commented Mar 11, 2025

Luap99 commented Mar 11, 2025

useranon350 commented Mar 11, 2025

useranon350 commented Mar 11, 2025 •

edited

Loading