-
Notifications
You must be signed in to change notification settings - Fork 18
iommu smmu dma and performance
Set cma=0 and iommu.passthrough=1 and iommu.strict=0 when testing device performance, unless device cannot work or there is security concern.
-
Disable CMA: set cma=0 in kernel command line. Before 5.10 (see the per-numa-cma patch), there is only one default CMA reserved. This may affect performance in numa system if DMA is mapped from CMA.
-
Bypass SMMU for DMA translation(set iommu.passthrough=1), when performance is more important than security. Please note iommu.passthrough=1 only affects DMA mapping of kernel drivers. For user space driver based on VFIO, DMA mapping still uses SMMU even if iommu.passthrough=1 (explained bellow). VFIO is used in DPDK, SPDK and KVM device assignment.
-
Enable iommu lazy mode by setting iommu.strict=0, if security is not a major concern. This can improve IO performance by reducing SMMU TLBI overhead.
To summarize:
- When testing kernel driver performance, set cma=0 and iommu.passthrough=1
- When testing performance of user space driver based on VFIO (DPDK, SPDK, SR-IOV), set iommu.strict=0
-
Boot kernel with option: iommu.passthrough=1
-
Bind a nvme disk to VFIO
# modprobe vfio-pci
# echo "0003:04:00.0" > /sys/bus/pci/devices/0003\:04\:00.0/driver/unbind
# echo 144d a808 > /sys/bus/pci/drivers/vfio-pci/new_id
# ls /dev/vfio/
47 vfio
- Perform IO on the nvme disk using qemu user space nvme driver (refer to: https://events.static.linuxfound.org/sites/events/files/slides/Userspace%20NVMe%20driver%20in%20QEMU%20-%20Fam%20Zheng_0.pdf)
# qemu-io -c 'read 0 1G' nvme://0003:04:00.0/1
WARNING: Image format was not specified for 'nvme://0003:04:00.0/1' and probing guessed raw.
Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
Specify the 'raw' format explicitly to remove the restrictions.
read 1073741824/1073741824 bytes at offset 0
1 GiB, 1 ops; 0:00:06.81 (150.337 MiB/sec and 0.1468 ops/sec)
- At the same time of 2, use eBPF tool to check kernel stack. We see qemu nvme driver calls arm_smmu_* functions.
# yum install bcc
# /usr/share/bcc/tools/stackcount 'arm_smmu_*'
b'arm_smmu_tlb_inv_page_nosync'
b'__arm_lpae_unmap'
b'__arm_lpae_unmap'
b'__arm_lpae_unmap'
b'arm_lpae_unmap'
b'arm_smmu_unmap'
b'__iommu_unmap'
b'iommu_unmap_fast'
b'vfio_unmap_unpin'
b'vfio_remove_dma'
b'vfio_iommu_unmap_unpin_all'
b'vfio_iommu_type1_detach_group'
b'__vfio_group_unset_container'
b'vfio_group_try_dissolve_container'
b'vfio_group_fops_release'
b'__fput'
b'____fput'
b'task_work_run'
b'do_notify_resume'
b'work_pending'
b'[unknown]'
b'[unknown]'
b'[unknown]'
b'[unknown]'
b'[unknown]'
b'[unknown]'
b'[unknown]'
b'[unknown]'
b'[unknown]'
b'[unknown]'
262276
- Use SMMU perf event to verify SMMU are working:
# perf stat -e smmuv3_pmcg_27ffe0202/transaction/ -a sleep 1
Performance counter stats for 'system wide':
622224 smmuv3_pmcg_27ffe0202/transaction/
1.002288245 seconds time elapsed
# perf stat -e smmuv3_pmcg_27ffe0202/tlb_miss/ -a sleep 1
Performance counter stats for 'system wide':
38648 smmuv3_pmcg_27ffe0202/tlb_miss/
1.002244245 seconds time elapsed
- Boot kernel with option iommu.passthrough=1 and iommu.strict=0
- Repeat above test, we see IO performance improved
# qemu-io -c 'read 0 1G' nvme://0003:04:00.0/1
WARNING: Image format was not specified for 'nvme://0003:04:00.0/1' and probing guessed raw.
Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
Specify the 'raw' format explicitly to remove the restrictions.
read 1073741824/1073741824 bytes at offset 0
1 GiB, 1 ops; 00.39 sec (2.539 GiB/sec and 2.5392 ops/sec)
- There are more smmu transactions per second when iommu.strict=0
# perf stat -e smmuv3_pmcg_27ffe0202/transaction/ -a sleep 1
Performance counter stats for 'system wide':
4233791 smmuv3_pmcg_27ffe0202/transaction/ <--- compared with '622224' if iommu.strict=1
1.002524565 seconds time elapsed
passthrough and bypass are confusing terms when talking about smmu(iommu). Quoted from original patch: https://lists.linuxfoundation.org/pipermail/iommu/2017-March/020818.html
The IOMMU core currently initialises the default domain for each group
to IOMMU_DOMAIN_DMA, under the assumption that devices will use
IOMMU-backed DMA ops by default. However, in some cases it is desirable
for the DMA ops to bypass the IOMMU for performance reasons
We see the iommu.passthrough options sets default domain type to IOMMU_DOMAIN_IDENTITY or IOMMU_DOMAIN_DMA.
+static int __init iommu_set_def_domain_type(char *str)
+{
+ bool pt;
+
+ if (!str || strtobool(str, &pt))
+ return -EINVAL;
+
+ iommu_def_domain_type = pt ? IOMMU_DOMAIN_IDENTITY : IOMMU_DOMAIN_DMA;
+ return 0;
+}
The default domain is used by kernel dma-api:
https://elixir.bootlin.com/linux/v5.12.8/source/drivers/iommu/dma-iommu.c#L1291
/*
* The IOMMU core code allocates the default DMA domain, which the underlying
* IOMMU driver needs to support via the dma-iommu layer.
*/
void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size)
{
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
if (!domain)
goto out_err;
/*
* The IOMMU core code allocates the default DMA domain, which the
* underlying IOMMU driver needs to support via the dma-iommu layer.
*/
if (domain->type == IOMMU_DOMAIN_DMA) {
<------------------ if domain->type != IOMMU_DOMAIN_DMA, dev->dma_ops is NULL. dma-api will do dma_map_direct() ---------->
if (iommu_dma_init_domain(domain, dma_base, size, dev))
goto out_err;
dev->dma_ops = &iommu_dma_ops;
}
return;
out_err:
pr_warn("Failed to set up IOMMU for device %s; retaining platform DMA ops\n",
dev_name(dev));
}
However, VFIO will allocate domain with type IOMMU_DOMAIN_UNMANAGED.
struct iommu_domain *iommu_domain_alloc(struct bus_type *bus)
{
return __iommu_domain_alloc(bus, IOMMU_DOMAIN_UNMANAGED);
}
To summarize, default domain is used by kernel dma-api. VFIO allocates domain with type IOMMU_DOMAIN_UNMANAGED.
https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html
iommu.passthrough=
[ARM64, X86] Configure DMA to bypass the IOMMU by default.
Format: { "0" | "1" }
0 - Use IOMMU translation for DMA.
1 - Bypass the IOMMU for DMA.
unset - Use value of CONFIG_IOMMU_DEFAULT_PASSTHROUGH.
iommu.strict= [ARM64] Configure TLB invalidation behaviour
Format: { "0" | "1" }
0 - Lazy mode.
Request that DMA unmap operations use deferred
invalidation of hardware TLBs, for increased
throughput at the cost of reduced device isolation.
Will fall back to strict mode if not supported by
the relevant IOMMU driver.
1 - Strict mode (default).
DMA unmap operations invalidate IOMMU hardware TLBs
synchronously.
cma=nn[MG]@[start[MG][-end[MG]]]
[KNL,CMA]
Sets the size of kernel global memory area for
contiguous memory allocations and optionally the
placement constraint by the physical address range of
memory allocations. A value of 0 disables CMA
altogether. For more information, see
kernel/dma/contiguous.c
cma_pernuma=nn[MG]
[ARM64,KNL,CMA]
Sets the size of kernel per-numa memory area for
contiguous memory allocations. A value of 0 disables
per-numa CMA altogether. And If this option is not
specificed, the default value is 0.
With per-numa CMA enabled, DMA users on node nid will
first try to allocate buffer from the pernuma area
which is located in node nid, if the allocation fails,
they will fallback to the global default memory area.