LastUpdate: 2026-04-15, Author: HAO022
UAR, User Access Region
UAR
The UAR is a region within the PCI address space. It contains device memory and registers. (Users may control the device via registers). The memory space address of the UAR is exposed as a PCI BAR and mapped to kernel space via IO-mapped I/O. A UAR typically occupies one system page of memory. At the driver layer, different UARs are mapped for different processes to achieve resource isolation. Furthermore, the CPU can directly access the mapped memory. This reduces context switches between kernel mode and user mode, thereby lowering latency. A very typical use case for the UAR is the Doorbell Register. A user-mode process may write directly to the Doorbell Register to trigger a hardware operation. The UAR has applications in scenarios such as NIC communication, storage services, and GPU computing. The primary components of the Mellanox NIC UAR Page format layout are:
- Completion WQ DoorBells: Stored in CQ registers
- Event WQ DoorBells: Stored in EQ registers
- Send DoorBells: Stored in Blue Flame registers
Memory Layout

Blue Flame Reg
The Blue Flame Register is a technology implemented by Mellanox for the purpose of fast hardware resource access. The user may write WQEs directly to the device. Therefore, the device is not required to copy memory from the host, which in turn reduces latency. In terms of specific implementation, BlueFlame comprises a set of uniformly sized registers. Each register is 512B. A register is further divided into two uniformly sized Blue Flame Buffers.
Doorbell Record
The Doorbell Record is located in system physical memory. When an RQ/SQ is created, the memory address of the Doorbell Record is assigned to the hardware. The Doorbell Record primarily stores the SQ sq_wqebb_counter and the RQ, SRQ wqe_counter. These markers indicate the number of WQEs that have been submitted to the hardware queue.

Submitting a Request to a Work Queue
- Write WQE to the WQE buffer sequentially to previously-posted WQE (on WQEBB granularity).
- Update Doorbell Record associated with that queue by writing their
sq_wqebb_counterorwqe_counterfor send and RQ respectively - For send request ring DoorBell by writing to the Doorbell Register field in the UAR associated with that queue. For performance-critical send WQEs DoorBell can be rang by using the BlueFlame mechanism.
Note: 1. For send requests, the execution of the third step is performed to ensure the WQE is executed. At the second step, the hardware has already executed the WQE. For WQEs that have already been executed, ringing the doorbell again does not cause harm; the hardware will ignore the event. 2. Software may submit WQEs in batches at the first step. The parameters updated at the second step include all WQEs.
Doorbell Record Mapping
ibv_create_qp … mlx5_create_qp … mlx5_alloc_dbrec … kernel create_user_qp:
- Doorbell Record memory is allocated.
- The virtual address of this memory is passed to the kernel driver.
- The driver performs physical allocation, page pinning, DMA mapping, etc., for this memory.
- Finally, it is assigned to the QPC, and the hardware resource QP is created.
User-mode code:
providers/mlx5/verbs.c
mlx5_create_qp
// Returns a segment of virtual memory.
// This memory will be used for the send and receive Doorbell Records in the future.
qp->db = mlx5_alloc_dbrec
qp->db[MLX5_RCV_DBR] = 0;
qp->db[MLX5_SND_DBR] = 0;
// Pass to kernel.
cmd.db_addr = (uintptr_t) qp->db;
...
Kernel driver code:
mlx5_ib_create_qp create_user_qp
// Allocates physical memory, pins pages, maps.
mlx5_ib_db_map_user
// Assigns the DMA address of this memory to the hardware.
MLX5_SET64(qpc, qpc, dbr_addr, qp->db.dma);
...
mlx5_ib_db_map_user(unsigned long virt ...)
page = kmalloc();
page->user_virt = (virt & PAGE_MASK);
page->umem = ib_umem_get();
db->dma = sg_dma_address();
...
Doorbell Register Mapping
Mapping the register to user mode involves the ibv_open_device and ibv_create_qp:
ibv_open_device…mlx5_alloc_context… kernelmlx5_ib_alloc_ucontext:- Device context is allocated.
- Hardware allocates UAR.
- User maps UAR.
ibv_create_qp…mlx5_create_qp… kernelcreate_user_qp:- UAR ID is associated with QP.
- An available UAR index is returned.
1. ibv_open_device UAR Mapping
-
ibv_open_deviceUser-mode Code (1):ibv_open_device mlx5_alloc_context // Number of requested registers, default is 16. // Configurable via environment variable MLX5_TOTAL_UUARS. req.total_num_bfregs = context->tot_uuars; // Number of requested low-latency registers, default is 4. // Configurable via environment variable MLX5_NUM_LOW_LAT_UUARS. req.num_low_latency_bfregs = context->low_lat_uuars; req.lib_caps |= (MLX5_LIB_CAP_4K_UAR | MLX5_LIB_CAP_DYN_UAR); // Calls kernel mlx5_ib_alloc_ucontext to return device attributes mlx5_cmd_get_context(req, resp ... ) ... // Initializes device context and UAR mapping based on the returned device attributes. mlx5_set_context -
alloc_ucontextKernel-mode Code (1):The IB device allocates the kernel-mode device context via
ib_device_ops.alloc_ucontext. This information is stored in the structureibucontextor the private structuremlx5_ib_ucontext. Themlx5_ib_ucontextstructure membermlx5_bfreg_infomaintains UAR information.struct mlx5_ib_ucontext { struct ib_ucontext ibucontext; struct mlx5_bfreg_info bfregi; ... }; struct mlx5_bfreg_info { u32 *sys_pages; // UAR ID int num_low_latency_bfregs; unsigned int *count; u32 num_sys_pages; u32 num_static_sys_pages; u32 total_num_bfregs; u32 num_dyn_bfregs; };The NIC hardware may operate on different platforms. The page size supported by a platform may be configured according to the kernel build option
CONFIG_PAGE_SHIFT, which is typically 4K. It should also be understood that the total number of BFREGs per UAR isNUM_BFREGS_PER_UAR(4). Among these,MLX5_NUM_NON_FP_BFREGS_PER_UAR(2) are used for the slow path (No Fast Path). In the kernel implementation,mlx5_ib_alloc_ucontextcallscalc_total_bfregsto initializemlx5_bfreg_infobased on the request and hardware configuration information. Finally,mlx5_ib_alloc_ucontextcallsallocate_uarsto allocate hardware UARs. The command parameters and return values for the Mellanox hardware UAR resource creation command are shown in the following figure.calc_total_bfregs // One UAR corresponds to one system page (4K). // Each UAR contains NUM_BFREGS_PER_UAR (4) BFREGs, // two of which are used for the non-fast path MLX5_NUM_NON_FP_BFREGS_PER_UAR. // // total_num_bfregs 16 static registers (the number of registers requested by the user mode) // Since each UAR supports MLX5_NUM_NON_FP_BFREGS_PER_UAR (2) slow-path registers, // 8 UARs are required, i.e., 8 system pages. bfregi->num_static_sys_pages = 8; bfregi->num_dyn_bfregs = 1024; // Dynamic registers, default is 1024 // Dynamic + Static registers: 1040 bfregi->total_num_bfregs = req->total_num_bfregs + bfregi->num_dyn_bfregs; // Total system pages required: 8 bfregi->num_sys_pages = req->total_num_bfregs / 2; bfregi->sys_pages = kcalloc(bfregi->num_sys_pages, ....);allocate_uars // Allocates num_static_sys_pages hardware UAR resources for (i = 0; i < bfregi->num_static_sys_pages; i++) { // sys_pages stores the UAR ID. This information is returned by the hardware. mlx5_cmd_alloc_uar(dev->mdev, &bfregi->sys_pages[i]); }

-
ibv_open_deviceUser-mode Code (2):The user-mode program completes the UAR resource mapping based on the information returned by the kernel:
(gdb) p *resp $4 = { bf_reg_size = 512, // BF register memory size tot_bfregs = 16, // Number of static BF registers num_dyn_bfregs = 1024, // Number of dynamic BF registers num_uars_per_page = 1, // Number of UARs contained per system page log_uar_size = 12, // UAR memory equals 4K qp_tab_size = 262144, cache_line_size = 64, max_sq_desc_sz = 1024, max_rq_desc_sz = 512, max_send_wqebb = 32768, max_recv_wr = 32768, max_srq_recv_wr = 32768, num_ports = 1, flow_action_flags = 0, comp_mask = 3, response_length = 72, cqe_version = 1 '\001', cmds_supp_uhw = 3 '\003', eth_min_inline = 2 '\002', clock_info_versions = 1 '\001', hca_core_clock_offset = 0, dump_fill_mkey = 1792 }ibv_open_device ... mlx5_alloc_context ... mlx5_set_context ... // Total 32 BFREG registers gross_uuars = context->tot_uuars / MLX5_NUM_NON_FP_BFREGS_PER_UAR * NUM_BFREGS_PER_UAR; context->bfs = calloc(gross_uuars, sizeof(*context->bfs)); num_sys_page_map = context->tot_uuars / (context->num_uars_per_page * MLX5_NUM_NON_FP_BFREGS_PER_UAR); // 8 UAR memory mappings for (i = 0; i < num_sys_page_map; ++i) { // context->uar mapping mlx5_mmap(&context->uar[i], i, cmd_fd, ... MLX5_UAR_TYPE_REGULAR); } // Flattens context->uar[].reg into the array context->bfs[].reg ... -
mlx5_mmapKernel-mode Code (2):mmapexposes hardware resources (such as registers, memory regions) as addresses accessible to user space. This permits the user-mode program to bypass the kernel and operate these hardware resources directly. This approach avoids context switches and data copying, which reduces latency and increases throughput. In the Mellanox implementation,mmapimplements the mapping of UAR resources. The process of UAR mapping is as follows: The index passed from user mode is obtained. The PCIe address is obtained according to this index. Finally, the device address and the user-modevmaaddress are remapped.ib_device_ops.mmap = mlx5_ib_mmap int mlx5_ib_mmap(struct ib_ucontext *ibcontext, struct vm_area_struct *vma) // The offset contains the operation type and the UAR INDEX. command = get_command(vma->vm_pgoff); case MLX5_IB_MMAP_REGULAR_PAGE: // 1. The command operation is obtained according to vm_pgoff. // 2. The UAR index is obtained according to vm_pgoff, and the device physical address is derived from it. // 3. The physical address and the vma virtual address are remapped. uar_mmap(dev, command, vma, context); ... -
Mapping Relationship

2. ibv_create_qp UAR QP Association
mlx5_create_qp completes the association of the QP with a specific UAR.
-
User-mode code:
mlx5_create_qp // Creates QP. ibv_cmd_create_qp_ex // 1. The newly created QP is not associated with all UAR resources. // Instead, the UAR to be used is determined based on the returned bfreg_index. // // Assigns this register to qp. // In fact, 16 registers are supported after ibv_open_device. // 2. qp->bf = &ctx->bfs[uuar_index]; map_uuar(context, qp, resp_drv->bfreg_index, bf); -
Kernel code:
create_user_qp // 1. Selects an available bfreg from the already allocated context->bfregi. bfregn = alloc_bfreg(dev, &context->bfregi); // 2. Obtains the sys_page based on bfreg, storing the UAR ID. uar_index = bfregn_to_uar_index(dev, &context->bfregi, bfregn, false); // 3. Associates the QP with the UAR ID. MLX5_SET(qpc, qpc, uar_page, uar_index); // 4. Returns the associated index. resp->bfreg_index = ...;
Diagram

Considerations
- During the execution of
ibv_open_device, Mellanox actually allocates and maps 8 UARs and 16 registers. Ifibv_create_qpis called more than 16 times within the sameibv_context, reuse of BFREGs will occur.
HUATUO is an operating system observability project open-sourced by DiDi and incubated under the China Computer Federation (CCF).
