Skip to content

Commit

Permalink
doc: add queue design
Browse files Browse the repository at this point in the history
Signed-off-by: Frank Du <[email protected]>
  • Loading branch information
frankdjx committed Dec 4, 2023
1 parent 6314a5f commit 356e845
Show file tree
Hide file tree
Showing 6 changed files with 177 additions and 20 deletions.
2 changes: 1 addition & 1 deletion .markdown-lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ MD004: false # Unordered list style
MD007:
indent: 2 # Unordered list indentation
MD013:
line_length: 400 # Line length 80 is far to short
line_length: 800 # Line length 80 is far to short
MD026:
punctuation: ".,;:!。,;:" # List of not allowed
MD029: false # Ordered list item prefix
Expand Down
106 changes: 101 additions & 5 deletions doc/design.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,11 +84,59 @@ The hugepages size is dependent on the workloads you wish to execute on the syst

In MTL, memory management is directly handled through DPDK's memory-related APIs, including mempool and mbuf. In fact, all internal data objects are constructed based on mbuf/mempool to ensure efficient lifecycle management.

## 4. TX Path
## 4. Data path

### 4.1 Backend layer

The library incorporates a virtual data path backend layer, designed to abstract various NIC implementation and provide a unified packet TX/RX interface to the upper network layer. It currently supports three types of NIC devices:

* DPDK Poll-Mode Drivers (PMDs): These drivers fully bypass the kernel's networking stack, utilizing the 'DPDK poll mode' driver.
* Native Linux Kernel Network Socket Stack: This option supports the full range of kernel ecosystems. Related code can be found from [mt_dp_socket.c](../lib/src/datapath/mt_dp_socket.c)
* AF_XDP (Express Data Path) with eBPF filter: AF_XDP represents a significant advancement in the Linux networking stack, striking a balance between raw performance and integration with the kernel's networking ecosystem. Please refer to [mt_af_xdp.c](../lib/src/dev/mt_af_xdp.c) for detail.
* Native Windows Kernel Network Socket Stack: in plan, not implemented.

MTL selects the backend NIC based on input from the application. Users should specify both of the following parameters in `struct mtl_init_params`, the port name should follow the format described below, and the pmd type can be fetched using `mtl_pmd_by_port_name`.

```bash
/**
* MTL_PMD_DPDK_USER. Use PCIE BDF port, ex: 0000:af:01.0.
* MTL_PMD_KERNEL_SOCKET. Use kernel + ifname, ex: kernel:enp175s0f0.
* MTL_PMD_NATIVE_AF_XDP. Use native_af_xdp + ifname, ex: native_af_xdp:enp175s0f0.
*/
char port[MTL_PORT_MAX][MTL_PORT_MAX_LEN];
enum mtl_pmd_type pmd[MTL_PORT_MAX];
```

### 4.2 Queue Manager

The library incorporates a queue manager layer, designed to abstract various queue implementation. Code please refer to [mt_queue.c](../lib/src/datapath/mt_queue.c) for detail.

#### 4.2.1 Tx

For transmitting (TX) data, there are two queue modes available:

Dedicated Mode: In this mode, each session exclusively occupies one TX queue resource.

Shared Mode: In contrast, shared mode allows multiple sessions to utilize the same TX queue. To ensure there is no conflict in the packet output path, a spin lock is employed. While this mode enables more efficient use of resources by sharing them, there can be a performance trade-off due to the overhead of acquiring and releasing the lock.
The TX queue shared mode is enabled by `MTL_FLAG_SHARED_TX_QUEUE` flag. Code please refer to [mt_shared_queue.c](../lib/src/datapath/mt_shared_queue.c) for detail.

#### 4.2.2 RX

For RX data, there are three queue modes available:

Dedicated Mode: each session is assigned a unique RX queue. Flow Director is utilized to filter and steer the incoming packets to the correct RX queue based on criteria such as IP address, port, protocol, or a combination of these.

Shared Mode: allows multiple sessions to utilize the same RX queue. Each session will configure its own set of Flow Director rules to identify its specific traffic. However, all these rules will direct the corresponding packets to the same shared RX queue. Software will dispatch the packet to each session during the process of received packet for each queue.
The RX queue shared mode is enabled by `MTL_FLAG_SHARED_RX_QUEUE` flag. Code please refer to [mt_shared_queue.c](../lib/src/datapath/mt_shared_queue.c) for detail.

RSS mode: Not all NICs support Flow Director. For those that don't, we employs Receive Side Scaling (RSS) to enable the efficient distribution of network receive processing across multiple queues. This is based on a hash calculated from fields in packet headers, such as source and destination IP addresses, and port numbers.
Code please refer to [mt_shared_rss.c](../lib/src/datapath/mt_shared_rss.c) for detail.

### 4.3 ST2110 TX

After receiving a frame from an application, MTL constructs a network packet from the frame in accordance with RFC 4175 <https://datatracker.ietf.org/doc/rfc4175/> and ST2110-21 timing requirement.

### 4.1 Zero Copy Packet Build
#### 4.3.1 Zero Copy Packet Build

Most modern Network Interface Cards (NICs) support a multi-buffer descriptor feature, enabling the programming of the NIC to dispatch a packet to the network from multiple data segments. The MTL utilizes this capability to achieve zero-copy transmission when a DPDK Poll Mode Driver (PMD) is utilized, thereby delivering unparalleled performance.
In one typical setup, capable of sending approximately 50 Gbps(equivalent to 16 streams of 1080p YUV422 at 10-bit color depth and 59.94 fps) only requires a single core.
Expand All @@ -101,7 +149,7 @@ Note that if the currently used NIC does not support the multi-buffer feature, t
<img src="png/tx_zero_copy.png" align="center" alt="TX Zero Copy">
</div>

### 4.2 ST2110-21 pacing
#### 4.3.2 ST2110-21 pacing

The specific standard ST2110-21 deals with the traffic shaping and delivery timing of uncompressed video. It defines how the video data packets should be paced over the network to maintain consistent timing and bandwidth utilization.

Expand All @@ -118,15 +166,63 @@ In the case that the rate-limiting feature is unavailable, TSC (Timestamp Counte
<img src="png/tx_pacing.png" align="center" alt="TX Pacing">
</div>

## 5. RX Path
### 4.4 ST2110 RX

The RX (Receive) packet classification in MTL includes two types: Flow Director and RSS (Receive Side Scaling). Flow Director is preferred if the NIC is capable, as it can directly feed the desired packet into the RX session packet handling function.
Once the packet is received and validated as legitimate, the RX session will copy the payload to the frame and notify the application if it is the last packet.

### 5.1 RX DMA offload
#### 4.4.1 RX DMA offload

The process of copying data between packets and frames consumes a significant amount of CPU resources. MTL can be configured to use DMA to offload this copy operation, thereby enhancing performance. For detailed usage instructions, please refer to [DMA guide](./dma.md)

<div align="center">
<img src="png/rx_dma_offload.png" align="center" alt="RX DMA Offload">
</div>

## 5. Control path

For the DPDK Poll Mode Driver backend, given its nature of fully bypassing the kernel, it is necessary to implement specific control protocols within MTL."

### 5.1 ARP

Address Resolution Protocol is a communication protocol used for discovering the link layer address, such as a MAC address, associated with a given internet layer address, typically an IPv4 address. This mapping is critical for local area network communication. The code can be found from [mt_arp.c](../lib/src/mt_arp.c)

### 5.2 IGMP

The internet Group Management Protocol is a communication protocol used by hosts and adjacent routers on IPv4 networks to establish multicast group memberships. IGMP is used for managing the membership of internet Protocol multicast groups and is an integral part of the IP multicast specification. MTL support the IGMPv3 version. The code can be found from [mt_mcast.c](../lib/src/mt_mcast.c)

### 5.3 DHCP

Dynamic Host Configuration Protocol is a network management protocol used on IP networks whereby a DHCP server dynamically assigns an IP address and other network configuration parameters to each device on a network, so they can communicate with other IP networks.
DHCP allows devices known as clients to get an IP address automatically, reducing the need for a network administrator or a user to manually assign IP addresses to all networked devices.

The DHCP option is not default on, enable it by set `net_proto` in `struct mtl_init_params` to `MTL_PROTO_DHCP`.

The code implementation can be found from [mt_dhcp.c](../lib/src/mt_dhcp.c).

### 5.4 PTP

Precision Time Protocol, also known as IEEE 1588, is designed for accurate clock synchronization between devices on a network. PTP is capable of clock accuracy in the sub-microsecond range, making it ideal for systems where precise timekeeping is vital. PTP uses a master-slave architecture for time synchronization.
Typically, a PTP grandmaster is deployed within the network, and clients synchronize with it using tools like ptp4l.

MTL support two type of PTP client settings, the built-in PTP client implementation inside MTL or using a external PTP time source.

#### 5.4.1 Built-in PTP

This project includes a built-in support for the PTP client protocol, which is also based on the hardware offload timesync feature. This combination allows for achieving a PTP time clock source with an accuracy of approximately 30ns.

To enable this feature in the RxTxApp sample application, use the `--ptp` argument. The control for the built-in PTP feature is the `MTL_FLAG_PTP_ENABLE` flag in the `mtl_init_params` structure.

Note: Currently, the VF (Virtual Function) does not support the hardware timesync feature. Therefore, for VF deployment, the timestamp of the transmitted (TX) and received (RX) packets is read from the CPU TSC (TimeStamp Counter) instead. In this case, it is not possible to obtain a stable delta in the PTP adjustment, and the maximum accuracy achieved will be up to 1us.

#### 5.4.2 Customized PTP time source by Application

Some setups may utilize external tools, such as `ptp4l`, for synchronization with a grandmaster clock. MTL provides an option `ptp_get_time_fn` within `struct mtl_init_params`, allowing applications to customize the PTP time source. In this mode, whenever MTL requires a PTP time, it will invoke this function to acquire the actual PTP time.
Consequently, it is the application's responsibility to retrieve the time from the PTP client configuration.

#### 5.4.3 37 seconds offset between UTC and TAI time

There's actually a difference of 37 seconds between Coordinated Universal Time (UTC) and International Atomic Time (TAI). This discrepancy is due to the number of leap seconds that have been added to UTC to keep it synchronized with Earth's rotation, which is gradually slowing down.

It is possible to observe a 37-second offset in some third-party timing equipment using MTL in conjunction with external PTP4L. This is typically caused by the time difference between Coordinated Universal Time (UTC) and International Atomic Time (TAI).
While PTP grandmasters disseminate the offset in their announce messages, this offset is not always accurately passed to the `ptp_get_time_fn` function. The RxTxApp provides a `--utc_offset` option, with a default value of 37 seconds, to compensate for this discrepancy. Consider adjusting the offset if you encounter similar issues.
82 changes: 72 additions & 10 deletions doc/dma.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## 1. Overview

The Intel® Media Transport Library supports a DMA feature to offload CPU memory copy for better RX video session density. The DMA device API is supported from DPDK version 21.11, and the DMA feature was introduced in version 0.7.2
The Intel® Media Transport Library features DMA support to enhance RX video session density by offloading memory copy operations from the CPU, thereby reducing CPU usage.

## 2. DMA driver bind to PMD(vfio-pci) mode

Expand All @@ -12,13 +12,13 @@ The Intel® Media Transport Library supports a DMA feature to offload CPU memory
dpdk-devbind.py -s | grep CBDMA
```

For DSA in SPR, pls search by idxd
For DSA in SPR, please search by idxd

```bash
dpdk-devbind.py -s | grep idxd
```

Pls check the output to find the VF BDF info, ex 0000:80:04.0 on the socket 1, 0000:00:04.0 on the socket 0, in below example.
Please review the output below to locate the Virtual Function's Bus/Device/Function (BDF) information, such as '0000:80:04.0' for socket 1 or '0000:00:04.0' for socket 0, as illustrated in the example.

```bash
0000:00:04.0 'Sky Lake-E CBDMA Registers 2021' drv=ioatdma unused=vfio-pci
Expand All @@ -41,31 +41,93 @@ Pls check the output to find the VF BDF info, ex 0000:80:04.0 on the socket 1, 0

### 2.2 Bind ports to PMD(vfio-pci)

Below example bind 0000:80:04.0,0000:80:04.1,0000:80:04.2 to PMD(vfio-pci) mode.
The example below demonstrates binding the devices '0000:80:04.0', '0000:80:04.1', and '0000:80:04.2' to Poll Mode Driver (PMD) using the vfio-pci module.

```bash
dpdk-devbind.py -b vfio-pci 0000:80:04.0
dpdk-devbind.py -b vfio-pci 0000:80:04.1
dpdk-devbind.py -b vfio-pci 0000:80:04.2
```

## 3. Pass the DMA port to RxTxApp
## 3. Pass the DMA configuration to lib

The argument --dma_dev is used to pass the DMA setup. In the following example, three DMA ports are bound to the application:
### 3.1 DMA configuration in RxTxApp

When utilizing the built-in application, simply use the `--dma_dev` argument to specify the DMA setup configuration. The following example demonstrates how to pass three DMA ports to the application:

```bash
--dma_dev 0000:80:04.0,0000:80:04.1,0000:80:04.2
```

The logs will display the DMA usage information as shown below:
### 3.2 DMA configuration in API

If you're directly interfacing with the API, the initial step involves incorporating DMA information into the `struct mtl_init_params` before making the `mtl_init` call. Subsequently, the initialization routine will attempt to parse and initialize the DMA device, and if the DMA is prepared, it will be added to the DMA list.

```bash
/**
* Optional. Dma(CBDMA or DSA) device can be used in the MTL.
* DMA can be used to offload the CPU for copy the payload for video rx sessions.
* See more from ST20_RX_FLAG_DMA_OFFLOAD in st20_api.h.
* PCIE BDF path like 0000:80:04.0.
*/
char dma_dev_port[MTL_DMA_DEV_MAX][MTL_PORT_MAX_LEN];
/** Optional. The element number in the dma_dev_port array, leave to zero if no DMA */
uint8_t num_dma_dev_port;
```
To enable DMA offloading, set the `ST20_RX_FLAG_DMA_OFFLOAD` flag in the `st20_rx_create` function, or `ST20P_RX_FLAG_DMA_OFFLOAD` when operating in pipeline mode. During the creation of the RX session, the system will attempt to locate a DMA device.
However, be aware that this process may fail if a suitable DMA device is not available, due to various reasons. For detailed information in case of failure, please consult the logs.
### 3.3 DMA logs
The logs below indicate that the PCI driver for the DMA device has been loaded successfully.
```bash
EAL: Probe PCI driver: dmadev_ioat (8086:b00) device: 0000:80:04.0 (socket 1)
IOAT: ioat_dmadev_probe(): Init 0000:80:04.0 on NUMA node 1
```
Below log shows the `0000:80:04.0` is registered into MTL.
```bash
MT: mt_dma_init(0), dma dev id 0 name 0000:80:04.0 capa 0x500000041 numa 1 desc 32:4096
```
Below log shows the RX session is correctly attached to a DMA device.
```bash
MT: mt_dma_request_dev(0), dma created with max share 16 nb_desc 128
MT: rv_init_dma(0), succ, dma 0 lender id 0
```
Below logs display the DMA usage information.
```bash
ST: RX_VIDEO_SESSION(1,0): pkts 2589325 by dma copy, dma busy 0.000000
ST: DMA(0), s 2589313 c 2589313 e 0 avg q 1
```
By the way, gtest also supports the use of --dma_dev. Please pass the DMA setup for DMA testing as well.
### 3.4 DMA socket
In a multi-socket system, each socket possesses its own DMA device, similar to NICs. Cross-socket traffic incurs significant latency; therefore, during MTL RX sessions, the system will attempt to utilize a DMA only if it resides on the same socket as the NICs.
## 3. DMA sample code for application usage
### 3.5 DMA per core
To maximize the utilization of DMA resources, the MTL architecture is designed to use the same DMA device for all sessions running within the same core. Sharing the DMA device is safe in this context because the sessions within a single core share CPU resources, eliminating the need for spin locks.
## 4. Public DMA API for application usage
Besides using the internal DMA capabilities for RX video offload, applications can also leverage DMA through the public API.
To learn how to utilize DMA in your application, refer to the sample code in [dma_sample.c](../app/sample/dma/dma_sample.c).
The major DMA APIs listed below:
```bash
mtl_udma_create
mtl_udma_free
mtl_udma_copy
mtl_udma_submit
mtl_udma_completed
```
Refer to [dma_sample.c](../app/sample/dma/dma_sample.c) to learn how to use DMA on the application side. Use st_hp_virt2iova (for st_hp_malloc) or st_dma_map (for malloc) to obtain the IOVA address.
You can use the function st_hp_virt2iova when memory is allocated via st_hp_malloc, or use st_dma_map if memory is allocated using the standard malloc to obtain the IOVA (Input/Output Virtual Address) necessary for DMA operations with hardware.
2 changes: 1 addition & 1 deletion include/st20_api.h
Original file line number Diff line number Diff line change
Expand Up @@ -1335,7 +1335,7 @@ struct st20_rx_ops {
* return:
* - 0: if app consume the frame successful. App should call st20_rx_put_framebuff
* to return the frame when it finish the handling
* < 0: the error code if app can't handle, lib will free the frame then.
* < 0: the error code if app can't handle, lib will call st20_rx_put_framebuff then.
* And only non-block method can be used in this callback as it run from lcore tasklet
* routine.
*/
Expand Down
2 changes: 1 addition & 1 deletion include/st20_redundant_api.h
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ struct st20r_rx_ops {
* return:
* - 0: if app consume the frame successful. App should call st20r_rx_put_frame
* to return the frame when it finish the handling
* < 0: the error code if app can't handle, lib will free the frame then.
* < 0: the error code if app can't handle, lib will call st20r_rx_put_frame then.
* Only for ST20_TYPE_FRAME_LEVEL/ST20_TYPE_SLICE_LEVEL.
* And only non-block method can be used in this callback as it run from lcore tasklet
* routine.
Expand Down
3 changes: 1 addition & 2 deletions include/st30_api.h
Original file line number Diff line number Diff line change
Expand Up @@ -398,8 +398,7 @@ struct st30_rx_ops {
* return:
* - 0: if app consume the frame successful. App should call st30_rx_put_framebuff
* to return the frame when it finish the handling
* < 0: the error code if app can't handle, lib will free the frame then
* the consume of frame.
* < 0: the error code if app can't handle, lib will call st30_rx_put_framebuff then.
* And only non-block method can be used in this callback as it run from lcore tasklet
* routine.
*/
Expand Down

0 comments on commit 356e845

Please sign in to comment.