-
Notifications
You must be signed in to change notification settings - Fork 18
iommu smmu dma and performance
Set cma=0 and iommu.passthrough=1 and iommu.strict=0 when testing device performance, unless device cannot work or there is security concern.
-
Disable CMA: set cma=0 in kernel command line. Before 5.10 (see the per-numa-cma patch), there is only one default CMA reserved. This may affect performance in numa system if DMA is mapped from CMA.
-
Bypass SMMU for DMA translation(set iommu.passthrough=1), when performance is more important than security. Please note iommu.passthrough=1 only affects DMA mapping of kernel drivers. For user space driver based on VFIO, DMA mapping still uses SMMU even if iommu.passthrough=1 (explained bellow). VFIO is used in DPDK, SPDK and KVM device assignment.
-
Enable iommu lazy mode by setting iommu.strict=0, if security is not a major concern. This can improve IO performance by reducing SMMU TLBI overhead.
To summarize:
- When testing kernel driver performance, set cma=0 and iommu.passthrough=1
- When testing performance of user space driver based on VFIO (DPDK, SPDK, SR-IOV), set iommu.strict=0
-
Boot kernel with option: iommu.passthrough=1
-
Bind a nvme disk to VFIO
# modprobe vfio-pci
# echo "0003:04:00.0" > /sys/bus/pci/devices/0003\:04\:00.0/driver/unbind
# echo 144d a808 > /sys/bus/pci/drivers/vfio-pci/new_id
# ls /dev/vfio/
47 vfio
- Perform IO on the nvme disk using qemu user space nvme driver (refer to: https://events.static.linuxfound.org/sites/events/files/slides/Userspace%20NVMe%20driver%20in%20QEMU%20-%20Fam%20Zheng_0.pdf)
# qemu-io -c 'read 0 1G' nvme://0003:04:00.0/1
WARNING: Image format was not specified for 'nvme://0003:04:00.0/1' and probing guessed raw.
Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
Specify the 'raw' format explicitly to remove the restrictions.
read 1073741824/1073741824 bytes at offset 0
1 GiB, 1 ops; 0:00:06.81 (150.337 MiB/sec and 0.1468 ops/sec)
- At the same time of 2, use eBPF tool to check kernel stack. We see qemu nvme driver calls arm_smmu_* functions.
# yum install bcc
# /usr/share/bcc/tools/stackcount 'arm_smmu_*'
b'arm_smmu_tlb_inv_page_nosync'
b'__arm_lpae_unmap'
b'__arm_lpae_unmap'
b'__arm_lpae_unmap'
b'arm_lpae_unmap'
b'arm_smmu_unmap'
b'__iommu_unmap'
b'iommu_unmap_fast'
b'vfio_unmap_unpin'
b'vfio_remove_dma'
b'vfio_iommu_unmap_unpin_all'
b'vfio_iommu_type1_detach_group'
b'__vfio_group_unset_container'
b'vfio_group_try_dissolve_container'
b'vfio_group_fops_release'
b'__fput'
b'____fput'
b'task_work_run'
b'do_notify_resume'
b'work_pending'
b'[unknown]'
b'[unknown]'
b'[unknown]'
b'[unknown]'
b'[unknown]'
b'[unknown]'
b'[unknown]'
b'[unknown]'
b'[unknown]'
b'[unknown]'
262276
- Use SMMU perf event to verify SMMU are working:
# perf stat -e smmuv3_pmcg_27ffe0202/transaction/ -a sleep 1
Performance counter stats for 'system wide':
622224 smmuv3_pmcg_27ffe0202/transaction/
1.002288245 seconds time elapsed
# perf stat -e smmuv3_pmcg_27ffe0202/tlb_miss/ -a sleep 1
Performance counter stats for 'system wide':
38648 smmuv3_pmcg_27ffe0202/tlb_miss/
1.002244245 seconds time elapsed
- Boot kernel with option iommu.passthrough=1 and iommu.strict=0
- Repeat above test, we see IO performance improved
# qemu-io -c 'read 0 1G' nvme://0003:04:00.0/1
WARNING: Image format was not specified for 'nvme://0003:04:00.0/1' and probing guessed raw.
Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
Specify the 'raw' format explicitly to remove the restrictions.
read 1073741824/1073741824 bytes at offset 0
1 GiB, 1 ops; 00.39 sec (2.539 GiB/sec and 2.5392 ops/sec)
- There is more smmu transaction per second:
# perf stat -e smmuv3_pmcg_27ffe0202/transaction/ -a sleep 1
Performance counter stats for 'system wide':
4233791 smmuv3_pmcg_27ffe0202/transaction/ <--- compared with '622224' if iommu.strict=1
1.002524565 seconds time elapsed
https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html
iommu.passthrough=
[ARM64, X86] Configure DMA to bypass the IOMMU by default.
Format: { "0" | "1" }
0 - Use IOMMU translation for DMA.
1 - Bypass the IOMMU for DMA.
unset - Use value of CONFIG_IOMMU_DEFAULT_PASSTHROUGH.
iommu.strict= [ARM64] Configure TLB invalidation behaviour
Format: { "0" | "1" }
0 - Lazy mode.
Request that DMA unmap operations use deferred
invalidation of hardware TLBs, for increased
throughput at the cost of reduced device isolation.
Will fall back to strict mode if not supported by
the relevant IOMMU driver.
1 - Strict mode (default).
DMA unmap operations invalidate IOMMU hardware TLBs
synchronously.
cma=nn[MG]@[start[MG][-end[MG]]]
[KNL,CMA]
Sets the size of kernel global memory area for
contiguous memory allocations and optionally the
placement constraint by the physical address range of
memory allocations. A value of 0 disables CMA
altogether. For more information, see
kernel/dma/contiguous.c
cma_pernuma=nn[MG]
[ARM64,KNL,CMA]
Sets the size of kernel per-numa memory area for
contiguous memory allocations. A value of 0 disables
per-numa CMA altogether. And If this option is not
specificed, the default value is 0.
With per-numa CMA enabled, DMA users on node nid will
first try to allocate buffer from the pernuma area
which is located in node nid, if the allocation fails,
they will fallback to the global default memory area.