⭐️ If you like this project, give it a star on GitHub! ⭐️

RDMA: Queue Pairs

LastUpdate: 2026-05-19, Author: HAO022

Concept

  • PD (Protection Domain): A domain in which a set of objects that work together (e.g., QP, SQ) are created.
  • QP (Queue Pair): A queue pair.
  • SQ (Send Queue): A send queue.
  • RQ (Receive Queue): A receive queue.
  • CQ (Completion Queue): A completion queue.
  • WQE (Work Queue Entry): A Work Request submitted to the RNIC, describing specific send or receive information.

User-space Interface

The user-space interface is implemented by libibverbs.so (see linux-rdma/rdma-core).

  • ibv_get_device_list: Returns a list of available RDMA devices.
  • ibv_open_device: Opens a device and creates a device context.
  • ibv_alloc_pd: Allocates a Protection Domain based on the device context.
  • ibv_reg_mr: Registers a Memory Region within a Protection Domain.
  • ibv_create_cq: Creates a Completion Queue based on the device context.
  • ibv_create_qp: Creates a Queue Pair, including a Send Queue and a Receive Queue, based on parameters such as the Protection Domain and Completion Queues.
  • ibv_post_recv: Posts a Work Request to the Receive Queue.
  • ibv_post_send: Posts a Work Request to the Send Queue.
  • ibv_poll_cq: Polls the Completion Queue for completed Work Requests.

Queue Pairs (QPs)

A Queue Pair (QP) is the core abstraction for data exchange. It comprises a Send Queue (SQ) and a Receive Queue (RQ). The following describes the design across user space, kernel space, and hardware during the creation process.

struct ibv_qp *ibv_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *qp_init_attr);

struct ibv_qp_init_attr {
    void *qp_context;
    struct ibv_cq *send_cq;
    struct ibv_cq *recv_cq;
    struct ibv_srq *srq;
    struct ibv_qp_cap cap;
    enum ibv_qp_type qp_type;
    int sq_sig_all;
};

struct ibv_qp_cap {
    uint32_t max_send_wr;
    uint32_t max_recv_wr;
    uint32_t max_send_sge;
    uint32_t max_recv_sge;
    uint32_t max_inline_data;
};

ibv_qp_init_attr is a structure that defines the basic attributes of a QP.

Name Description
send_cq The Completion Queue associated with the Send Queue. (Created via ibv_create_cq. send_cq and recv_cq may refer to the same or different CQs.)
recv_cq The Completion Queue associated with the Receive Queue.
cap Describes attributes such as the size of the Queue Pair.
qp_type IBV_QPT_RC: Reliable Connected.
IBV_QPT_UC: Unreliable Connected.
IBV_QPT_UD: Unreliable Datagram.
IBV_QPT_RAW_PACKET: Allows user-defined packet headers, including Layer 2.

ibv_qp_cap is a structure that defines the sizing attributes of a QP.

Name Description
max_send_wr The maximum number of Work Requests that the Send Queue can hold. This value may be set to dev_cap.max_qp_wr.
max_recv_wr The maximum number of Work Requests that the Receive Queue can hold. This value may be set to dev_cap.max_qp_wr.
max_send_sge The maximum number of scatter-gather elements in a Work Request posted to the Send Queue.
max_recv_sge The maximum number of scatter-gather elements in a Work Request posted to the Receive Queue.

User-space Implementation

ibv_create_qp invokes verbs_context_ops.create_qp().

struct verbs_context_ops mlx5_ctx_common_ops = {
    .alloc_pd      = mlx5_alloc_pd,
    .create_qp     = mlx5_create_qp,
    .create_qp_ex  = mlx5_create_qp_ex,
    ...
};

The mlx5_create_qp function performs the following steps:

  1. Calculates the required memory size for the SQ and RQ Work Queue Entries (WQEs) by calling mlx5_calc_wq_size.
    1. mlx5_calc_send_wqe
    2. mlx5_calc_rq_size
  2. Allocates the buffer (wq_size) for the QP based on the calculated result via mlx5_alloc_qp_buf.
  3. Allocates memory resources for the doorbell record via mlx5_alloc_dbrec.
  4. Interacts with the kernel via ibv_cmd_create_qp_ex.
    1. cmd.buf_addr
    2. cmd.db_addr
  5. Updates the QP attributes.

mlx5_calc_sq_size:

1. wq_size = roundup_pow_of_two(attr->cap.max_send_wr * wqe_size);
2. qp->sq.wqe_cnt = wq_size / MLX5_SEND_WQE_BB;
3. qp->sq.wqe_shift = ...;
4. qp->sq.max_gs = cap.max_send_sge;
5. qp->sq.max_post is approximately equal to cap.max_send_wr, adjusted for alignment constraints.

mlx5_calc_rq_size:

1. mlx5_calc_rcv_wqe
2. qp->rq.wqe_cnt = ...
3. qp->rq.wqe_shift = ...
4. qp->rq.max_post = ...

Kernel-space Implementation

The kernel path does not implement complex logic; it directly calls the driver’s ops.create_qp callback.

struct ib_device_ops mlx5_ib_dev_ops = {
    .create_qp = mlx5_ib_create_qp, // ... calls create_user_qp
    ...
};

create_user_qp:

  1. Invokes ib_umem_get on the user-space virtual address: this function allocates memory, pins pages, and performs DMA address mapping.
  2. Populates the Memory Translation Table (MTT) entries for the WQEs via mlx5_ib_populate_pas.
  3. Maps the doorbell record via mlx5_ib_db_map_user:
    1. Allocates a physical page via kmalloc.
    2. Associates this physical page with the doorbell virtual address.
    3. Invokes ib_umem_get on this physical page to pin it and perform DMA mapping.
  4. Constructs the Queue Pair Context (QPC) and submits the creation command to the hardware via mlx5_qpc_create_qp:
    1. qpc.log_page_size
    2. qpc.page_offset
    3. qpc.log_sq_size
    4. qpc.pd
    5. qpc.uar_page
    6. qpc.dbr_addr

The resulting memory relationships are illustrated in the following diagram:

Posting a Send Request

mlx5_post_send
        for (nreq = 0; wr; ++nreq, wr = wr->next) {
                // mlx5_wqe_ctrl_seg
                // 1. Configures the control based on the Work Request.
                // mlx5_wqe_data_seg
                // 2. Configures the data based on wr.sg_list[i].
                // The data for the mlx5 control/data structures are already mapped to the MTT.
        }

        post_send_db(qp, ..., ctrl); // Rings the doorbell.

Observations

  • The WQE buffer is associated with the QP; therefore, it can be considered a part of the QP.
  • The Memory Translation Table (MTT) is used not only during Memory Region registration but also during QP creation. Its primary function in this context is to translate the virtual addresses of the WQEs, which are passed during the doorbell operation, into I/O Virtual Addresses (IOVAs) or Physical Addresses (PAs) that the hardware can access. This mechanism avoids copying WQEs during user-kernel context switches, thereby improving performance.

HUATUO is an operating system observability project open-sourced by DiDi and incubated under the China Computer Federation (CCF).

微信