RDMA: Doorbell, UAR

LastUpdate: 2026-04-15, Author: HAO022

UAR, User Access Region

UAR

The UAR is a region within the PCI address space. It contains device memory and registers. (Users may control the device via registers). The memory space address of the UAR is exposed as a PCI BAR and mapped to kernel space via IO-mapped I/O. A UAR typically occupies one system page of memory. At the driver layer, different UARs are mapped for different processes to achieve resource isolation. Furthermore, the CPU can directly access the mapped memory. This reduces context switches between kernel mode and user mode, thereby lowering latency. A very typical use case for the UAR is the Doorbell Register. A user-mode process may write directly to the Doorbell Register to trigger a hardware operation. The UAR has applications in scenarios such as NIC communication, storage services, and GPU computing. The primary components of the Mellanox NIC UAR Page format layout are:

  • Completion WQ DoorBells: Stored in CQ registers
  • Event WQ DoorBells: Stored in EQ registers
  • Send DoorBells: Stored in Blue Flame registers

Memory Layout

Blue Flame Reg

The Blue Flame Register is a technology implemented by Mellanox for the purpose of fast hardware resource access. The user may write WQEs directly to the device. Therefore, the device is not required to copy memory from the host, which in turn reduces latency. In terms of specific implementation, BlueFlame comprises a set of uniformly sized registers. Each register is 512B. A register is further divided into two uniformly sized Blue Flame Buffers.

Doorbell Record

The Doorbell Record is located in system physical memory. When an RQ/SQ is created, the memory address of the Doorbell Record is assigned to the hardware. The Doorbell Record primarily stores the SQ sq_wqebb_counter and the RQ, SRQ wqe_counter. These markers indicate the number of WQEs that have been submitted to the hardware queue.

Submitting a Request to a Work Queue

  • Write WQE to the WQE buffer sequentially to previously-posted WQE (on WQEBB granularity).
  • Update Doorbell Record associated with that queue by writing their sq_wqebb_counter or wqe_counter for send and RQ respectively
  • For send request ring DoorBell by writing to the Doorbell Register field in the UAR associated with that queue. For performance-critical send WQEs DoorBell can be rang by using the BlueFlame mechanism.

Note: 1. For send requests, the execution of the third step is performed to ensure the WQE is executed. At the second step, the hardware has already executed the WQE. For WQEs that have already been executed, ringing the doorbell again does not cause harm; the hardware will ignore the event. 2. Software may submit WQEs in batches at the first step. The parameters updated at the second step include all WQEs.

Doorbell Record Mapping

ibv_create_qp  mlx5_create_qp  mlx5_alloc_dbrec  kernel create_user_qp:
  • Doorbell Record memory is allocated.
  • The virtual address of this memory is passed to the kernel driver.
  • The driver performs physical allocation, page pinning, DMA mapping, etc., for this memory.
  • Finally, it is assigned to the QPC, and the hardware resource QP is created.

User-mode code: providers/mlx5/verbs.c

mlx5_create_qp
        // Returns a segment of virtual memory.
        // This memory will be used for the send and receive Doorbell Records in the future.
        qp->db = mlx5_alloc_dbrec
        qp->db[MLX5_RCV_DBR] = 0;
        qp->db[MLX5_SND_DBR] = 0;

        // Pass to kernel.
        cmd.db_addr  = (uintptr_t) qp->db;
        ...

Kernel driver code:

mlx5_ib_create_qp create_user_qp
        // Allocates physical memory, pins pages, maps.
        mlx5_ib_db_map_user
        // Assigns the DMA address of this memory to the hardware.
        MLX5_SET64(qpc, qpc, dbr_addr, qp->db.dma);
        ...

mlx5_ib_db_map_user(unsigned long virt ...)
        page = kmalloc();
        page->user_virt = (virt & PAGE_MASK);
        page->umem = ib_umem_get();
        db->dma = sg_dma_address();
        ...

Doorbell Register Mapping

Mapping the register to user mode involves the ibv_open_device and ibv_create_qp:

  • ibv_open_devicemlx5_alloc_context … kernel mlx5_ib_alloc_ucontext:
    • Device context is allocated.
    • Hardware allocates UAR.
    • User maps UAR.
  • ibv_create_qpmlx5_create_qp … kernel create_user_qp:
    • UAR ID is associated with QP.
    • An available UAR index is returned.

1. ibv_open_device UAR Mapping

  • ibv_open_device User-mode Code (1):

    ibv_open_device mlx5_alloc_context
            // Number of requested registers, default is 16.
            // Configurable via environment variable MLX5_TOTAL_UUARS.
            req.total_num_bfregs = context->tot_uuars;
            // Number of requested low-latency registers, default is 4.
            // Configurable via environment variable MLX5_NUM_LOW_LAT_UUARS.
            req.num_low_latency_bfregs = context->low_lat_uuars;
            req.lib_caps |= (MLX5_LIB_CAP_4K_UAR | MLX5_LIB_CAP_DYN_UAR);
    
            // Calls kernel mlx5_ib_alloc_ucontext to return device attributes
            mlx5_cmd_get_context(req, resp ... ) ...
            // Initializes device context and UAR mapping based on the returned device attributes.
            mlx5_set_context
    
  • alloc_ucontext Kernel-mode Code (1):

    The IB device allocates the kernel-mode device context via ib_device_ops.alloc_ucontext. This information is stored in the structure ibucontext or the private structure mlx5_ib_ucontext. The mlx5_ib_ucontext structure member mlx5_bfreg_info maintains UAR information.

    struct mlx5_ib_ucontext {
            struct ib_ucontext ibucontext;
            struct mlx5_bfreg_info bfregi;
            ...
    };
    
    struct mlx5_bfreg_info {
            u32 *sys_pages; // UAR ID
            int num_low_latency_bfregs;
            unsigned int *count;
            u32 num_sys_pages;
            u32 num_static_sys_pages;
            u32 total_num_bfregs;
            u32 num_dyn_bfregs;
    };
    

    The NIC hardware may operate on different platforms. The page size supported by a platform may be configured according to the kernel build option CONFIG_PAGE_SHIFT, which is typically 4K. It should also be understood that the total number of BFREGs per UAR is NUM_BFREGS_PER_UAR (4). Among these, MLX5_NUM_NON_FP_BFREGS_PER_UAR (2) are used for the slow path (No Fast Path). In the kernel implementation, mlx5_ib_alloc_ucontext calls calc_total_bfregs to initialize mlx5_bfreg_info based on the request and hardware configuration information. Finally, mlx5_ib_alloc_ucontext calls allocate_uars to allocate hardware UARs. The command parameters and return values for the Mellanox hardware UAR resource creation command are shown in the following figure.

    calc_total_bfregs
            // One UAR corresponds to one system page (4K).
            // Each UAR contains NUM_BFREGS_PER_UAR (4) BFREGs,
            // two of which are used for the non-fast path MLX5_NUM_NON_FP_BFREGS_PER_UAR.
            // 
            // total_num_bfregs 16 static registers (the number of registers requested by the user mode)
            // Since each UAR supports MLX5_NUM_NON_FP_BFREGS_PER_UAR (2) slow-path registers,
            // 8 UARs are required, i.e., 8 system pages.
            bfregi->num_static_sys_pages = 8;
            bfregi->num_dyn_bfregs = 1024; // Dynamic registers, default is 1024
            // Dynamic + Static registers: 1040
            bfregi->total_num_bfregs = req->total_num_bfregs + bfregi->num_dyn_bfregs;
            // Total system pages required: 8
            bfregi->num_sys_pages = req->total_num_bfregs / 2;
            bfregi->sys_pages = kcalloc(bfregi->num_sys_pages, ....);
    
    allocate_uars // Allocates num_static_sys_pages hardware UAR resources
            for (i = 0; i < bfregi->num_static_sys_pages; i++) {
                // sys_pages stores the UAR ID. This information is returned by the hardware.
                mlx5_cmd_alloc_uar(dev->mdev, &bfregi->sys_pages[i]);
            }
    

  • ibv_open_device User-mode Code (2):

    The user-mode program completes the UAR resource mapping based on the information returned by the kernel:

    (gdb) p *resp
    $4 = {
            bf_reg_size = 512,              // BF register memory size
            tot_bfregs = 16,                // Number of static BF registers
            num_dyn_bfregs = 1024,          // Number of dynamic BF registers
            num_uars_per_page = 1,          // Number of UARs contained per system page
            log_uar_size = 12,              // UAR memory equals 4K
            qp_tab_size = 262144,
            cache_line_size = 64,
            max_sq_desc_sz = 1024,
            max_rq_desc_sz = 512,
            max_send_wqebb = 32768,
            max_recv_wr = 32768,
            max_srq_recv_wr = 32768,
            num_ports = 1,
            flow_action_flags = 0,
            comp_mask = 3,
            response_length = 72,
            cqe_version = 1 '\001',
            cmds_supp_uhw = 3 '\003',
            eth_min_inline = 2 '\002',
            clock_info_versions = 1 '\001',
            hca_core_clock_offset = 0,
            dump_fill_mkey = 1792
    }
    
    ibv_open_device ... mlx5_alloc_context ... mlx5_set_context
            ...
            // Total 32 BFREG registers
            gross_uuars = context->tot_uuars / MLX5_NUM_NON_FP_BFREGS_PER_UAR * NUM_BFREGS_PER_UAR;
            context->bfs = calloc(gross_uuars, sizeof(*context->bfs));
    
            num_sys_page_map = context->tot_uuars / (context->num_uars_per_page * MLX5_NUM_NON_FP_BFREGS_PER_UAR);
    
            // 8 UAR memory mappings
            for (i = 0; i < num_sys_page_map; ++i) {
                // context->uar mapping
                mlx5_mmap(&context->uar[i], i, cmd_fd, ... MLX5_UAR_TYPE_REGULAR);
            }
    
            // Flattens context->uar[].reg into the array context->bfs[].reg
            ...
    
  • mlx5_mmap Kernel-mode Code (2):

    mmap exposes hardware resources (such as registers, memory regions) as addresses accessible to user space. This permits the user-mode program to bypass the kernel and operate these hardware resources directly. This approach avoids context switches and data copying, which reduces latency and increases throughput. In the Mellanox implementation, mmap implements the mapping of UAR resources. The process of UAR mapping is as follows: The index passed from user mode is obtained. The PCIe address is obtained according to this index. Finally, the device address and the user-mode vma address are remapped.

    ib_device_ops.mmap = mlx5_ib_mmap
    
    int mlx5_ib_mmap(struct ib_ucontext *ibcontext, struct vm_area_struct *vma)
            // The offset contains the operation type and the UAR INDEX.
            command = get_command(vma->vm_pgoff);
            case MLX5_IB_MMAP_REGULAR_PAGE:
            // 1. The command operation is obtained according to vm_pgoff.
            // 2. The UAR index is obtained according to vm_pgoff, and the device physical address is derived from it.
            // 3. The physical address and the vma virtual address are remapped.
            uar_mmap(dev, command, vma, context);
            ...
    
  • Mapping Relationship

2. ibv_create_qp UAR QP Association

mlx5_create_qp completes the association of the QP with a specific UAR.

  • User-mode code:

    mlx5_create_qp
            // Creates QP.
            ibv_cmd_create_qp_ex
    
            // 1. The newly created QP is not associated with all UAR resources.
            // Instead, the UAR to be used is determined based on the returned bfreg_index.
            //
            // Assigns this register to qp.
            // In fact, 16 registers are supported after ibv_open_device.
            // 2. qp->bf = &ctx->bfs[uuar_index];
            map_uuar(context, qp, resp_drv->bfreg_index, bf);
    
  • Kernel code:

    create_user_qp
            // 1. Selects an available bfreg from the already allocated context->bfregi.
            bfregn = alloc_bfreg(dev, &context->bfregi);
            // 2. Obtains the sys_page based on bfreg, storing the UAR ID.
            uar_index = bfregn_to_uar_index(dev, &context->bfregi, bfregn, false);
            // 3. Associates the QP with the UAR ID.
            MLX5_SET(qpc, qpc, uar_page, uar_index);
            // 4. Returns the associated index.
            resp->bfreg_index = ...;
    

Diagram

Considerations

  • During the execution of ibv_open_device, Mellanox actually allocates and maps 8 UARs and 16 registers. If ibv_create_qp is called more than 16 times within the same ibv_context, reuse of BFREGs will occur.

HUATUO is an operating system observability project open-sourced by DiDi and incubated under the China Computer Federation (CCF).

微信