LastUpdate: 2026-05-19, Author: HAO022
Concept
- PD (Protection Domain): A domain in which a set of objects that work together (e.g., QP, SQ) are created.
- QP (Queue Pair): A queue pair.
- SQ (Send Queue): A send queue.
- RQ (Receive Queue): A receive queue.
- CQ (Completion Queue): A completion queue.
- WQE (Work Queue Entry): A Work Request submitted to the RNIC, describing specific send or receive information.

User-space Interface
The user-space interface is implemented by libibverbs.so (see linux-rdma/rdma-core).
- ibv_get_device_list: Returns a list of available RDMA devices.
- ibv_open_device: Opens a device and creates a device context.
- ibv_alloc_pd: Allocates a Protection Domain based on the device context.
- ibv_reg_mr: Registers a Memory Region within a Protection Domain.
- ibv_create_cq: Creates a Completion Queue based on the device context.
- ibv_create_qp: Creates a Queue Pair, including a Send Queue and a Receive Queue, based on parameters such as the Protection Domain and Completion Queues.
- ibv_post_recv: Posts a Work Request to the Receive Queue.
- ibv_post_send: Posts a Work Request to the Send Queue.
- ibv_poll_cq: Polls the Completion Queue for completed Work Requests.
Queue Pairs (QPs)
A Queue Pair (QP) is the core abstraction for data exchange. It comprises a Send Queue (SQ) and a Receive Queue (RQ). The following describes the design across user space, kernel space, and hardware during the creation process.
struct ibv_qp *ibv_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *qp_init_attr);
struct ibv_qp_init_attr {
void *qp_context;
struct ibv_cq *send_cq;
struct ibv_cq *recv_cq;
struct ibv_srq *srq;
struct ibv_qp_cap cap;
enum ibv_qp_type qp_type;
int sq_sig_all;
};
struct ibv_qp_cap {
uint32_t max_send_wr;
uint32_t max_recv_wr;
uint32_t max_send_sge;
uint32_t max_recv_sge;
uint32_t max_inline_data;
};
ibv_qp_init_attr is a structure that defines the basic attributes of a QP.
| Name | Description |
|---|---|
| send_cq | The Completion Queue associated with the Send Queue. (Created via ibv_create_cq. send_cq and recv_cq may refer to the same or different CQs.) |
| recv_cq | The Completion Queue associated with the Receive Queue. |
| cap | Describes attributes such as the size of the Queue Pair. |
| qp_type | IBV_QPT_RC: Reliable Connected. IBV_QPT_UC: Unreliable Connected. IBV_QPT_UD: Unreliable Datagram. IBV_QPT_RAW_PACKET: Allows user-defined packet headers, including Layer 2. |
ibv_qp_cap is a structure that defines the sizing attributes of a QP.
| Name | Description |
|---|---|
| max_send_wr | The maximum number of Work Requests that the Send Queue can hold. This value may be set to dev_cap.max_qp_wr. |
| max_recv_wr | The maximum number of Work Requests that the Receive Queue can hold. This value may be set to dev_cap.max_qp_wr. |
| max_send_sge | The maximum number of scatter-gather elements in a Work Request posted to the Send Queue. |
| max_recv_sge | The maximum number of scatter-gather elements in a Work Request posted to the Receive Queue. |
User-space Implementation
ibv_create_qp invokes verbs_context_ops.create_qp().
struct verbs_context_ops mlx5_ctx_common_ops = {
.alloc_pd = mlx5_alloc_pd,
.create_qp = mlx5_create_qp,
.create_qp_ex = mlx5_create_qp_ex,
...
};
The mlx5_create_qp function performs the following steps:
- Calculates the required memory size for the SQ and RQ Work Queue Entries (WQEs) by calling
mlx5_calc_wq_size.mlx5_calc_send_wqemlx5_calc_rq_size
- Allocates the buffer (
wq_size) for the QP based on the calculated result viamlx5_alloc_qp_buf. - Allocates memory resources for the doorbell record via
mlx5_alloc_dbrec. - Interacts with the kernel via
ibv_cmd_create_qp_ex.cmd.buf_addr…cmd.db_addr…
- Updates the QP attributes.
mlx5_calc_sq_size:
1. wq_size = roundup_pow_of_two(attr->cap.max_send_wr * wqe_size);
2. qp->sq.wqe_cnt = wq_size / MLX5_SEND_WQE_BB;
3. qp->sq.wqe_shift = ...;
4. qp->sq.max_gs = cap.max_send_sge;
5. qp->sq.max_post is approximately equal to cap.max_send_wr, adjusted for alignment constraints.
mlx5_calc_rq_size:
1. mlx5_calc_rcv_wqe
2. qp->rq.wqe_cnt = ...
3. qp->rq.wqe_shift = ...
4. qp->rq.max_post = ...

Kernel-space Implementation
The kernel path does not implement complex logic; it directly calls the driver’s ops.create_qp callback.
struct ib_device_ops mlx5_ib_dev_ops = {
.create_qp = mlx5_ib_create_qp, // ... calls create_user_qp
...
};
create_user_qp:
- Invokes
ib_umem_geton the user-space virtual address: this function allocates memory, pins pages, and performs DMA address mapping. - Populates the Memory Translation Table (MTT) entries for the WQEs via
mlx5_ib_populate_pas. - Maps the doorbell record via
mlx5_ib_db_map_user:- Allocates a physical page via
kmalloc. - Associates this physical page with the doorbell virtual address.
- Invokes
ib_umem_geton this physical page to pin it and perform DMA mapping.
- Allocates a physical page via
- Constructs the Queue Pair Context (QPC) and submits the creation command to the hardware via
mlx5_qpc_create_qp:qpc.log_page_sizeqpc.page_offsetqpc.log_sq_sizeqpc.pdqpc.uar_pageqpc.dbr_addr
The resulting memory relationships are illustrated in the following diagram:

Posting a Send Request
mlx5_post_send
for (nreq = 0; wr; ++nreq, wr = wr->next) {
// mlx5_wqe_ctrl_seg
// 1. Configures the control based on the Work Request.
// mlx5_wqe_data_seg
// 2. Configures the data based on wr.sg_list[i].
// The data for the mlx5 control/data structures are already mapped to the MTT.
}
post_send_db(qp, ..., ctrl); // Rings the doorbell.
Observations
- The WQE buffer is associated with the QP; therefore, it can be considered a part of the QP.
- The Memory Translation Table (MTT) is used not only during Memory Region registration but also during QP creation. Its primary function in this context is to translate the virtual addresses of the WQEs, which are passed during the doorbell operation, into I/O Virtual Addresses (IOVAs) or Physical Addresses (PAs) that the hardware can access. This mechanism avoids copying WQEs during user-kernel context switches, thereby improving performance.
HUATUO is an operating system observability project open-sourced by DiDi and incubated under the China Computer Federation (CCF).
