As compute workloads become more data-intensive and latency-sensitive, accelerators (e.g., GPUs, FPGAs and SmartNICs) require more flexible and efficient access to host memory. Traditional direct memory access (DMA) models rely on pre-pinned (locked) memory regions mapped into the device address space, which reduces available system memory, increases fragmentation, and forces pre-loading that adds latency.
To address these limitations, the PCIe® specification defines two complementary mechanisms — address translation service (ATS) and the page request interface (PRI) — that allow devices to translate and request host virtual memory pages dynamically. When combined, ATS and PRI enable devices to access host virtual memory with less reliance on pinned buffers, improving memory utilization and enabling more dynamic, virtualized and accelerator-driven system architectures.
Address translation service (ATS), when used together with page request interface (PRI), enables a PCIe device to perform virtual-to-physical address translation by directly querying the system’s input-output memory management unit (IOMMU)*. Instead of relying on the CPU or operating system (OS) to pre-pin memory and provide physical addresses, an ATS-capable device can issue an address translation request (ATR)* for a given virtual address on demand. The IOMMU translates the virtual address using OS-managed page tables and returns the corresponding physical address to the device.
Once a translation is obtained, the device may cache the result in its address translation cache (ATC)*, reducing latency for subsequent accesses to the same memory region. This mechanism allows devices to operate within the same virtual address space as the application that allocated the memory, supporting true zero-copy access to user-space buffers without repeated kernel intervention.
PRI complements ATS by enabling a device to request host intervention when a required memory page is not currently resident or mapped in system memory. If a device attempts to access a virtual address whose page is not present—such as when the page has been swapped out or has not yet been allocated—the IOMMU returns a translation failure.
With PRI support, the device can issue a page request message (PRM)* instead of treating this condition as a fatal error. The OS or hypervisor handles the request in a manner analogous to a CPU page fault. The host allocates the required page, updates the relevant page tables, and then signals completion to the device. This on-demand paging mechanism enables devices to access virtual memory dynamically and removes the need to pre-pin all potential memory regions.
In modern server and accelerator-centric system designs, ATS has become a key PCIe feature for improving I/O efficiency by enabling autonomous address translation at the device level. With ATS, a PCIe device can obtain translated memory addresses through its on-device ATC, reducing reliance on the CPU to pre-pin memory and provide physical mappings.
When combined with PRI, devices gain the additional ability to trigger on-demand page handling by the host. This allows the operating system to avoid aggressive memory pinning while still enabling accelerators to safely access pageable memory regions, significantly improving overall memory utilization and system scalability.
Figure 2 illustrates a sample host architecture in which a translation agent (TA) and address translation and protection tables (ATPTs) reside within the IOMMU, enabling coordinated virtual memory management between the host and PCIe devices.
The process address space ID (PASID) is a key mechanism that enables PCIe devices to operate across multiple user address spaces. By tagging each device transaction with a PASID, the IOMMU can identify which process page tables should be used for address translation. This allows accelerators, GPUs, and storage devices to directly access per-process virtual addresses rather than relying on pinned, system-wide buffers. As a result, PASID enables secure multi-process isolation for device-initiated DMA.
ATS further enhances performance by allowing devices to cache virtual-to-physical address translations locally in the ATC. PASID ensures that these cached translations are correctly scoped to the corresponding process address space, preventing translation conflicts when a single device concurrently serves multiple applications. Together, PASID and ATS provide both correctness through process isolation and efficiency through reduced translation latency.
The ATPT is a memory-resident data structure used by the IOMMU to resolve device address translations. While its structure is conceptually similar to CPU page tables, the ATPT additionally incorporates hardware-managed features and isolation controls. The ATPT maps each PASID to its corresponding translation context and access permissions. Each PASID entry references the page tables for that process and defines the protection attributes applied to device accesses. In this model, PASID provides identity, ATS delivers performance, and ATPT enforces translation policy and protection.
The table 1 summarizes the major ATPT fields referenced by the IOMMU during address translation.
In a shared accelerator environment, such as when multiple applications utilize a single GPU, each application is assigned a unique PASID. The device tags its DMA requests with the appropriate PASID, and the IOMMU uses the ATPT to map each PASID to the correct page tables and protection attributes. ATS allows the GPU to cache these translations locally, minimizing repeated IOMMU lookups. If the OS updates a process’s memory mapping, it issues invalidation messages to ensure stale ATC entries are removed. Together, PASID, ATS, and ATPT enable secure and efficient shared virtual addressing across CPUs, GPUs, and PCIe devices.
This section outlines how the IOMMU performs guest address translation in virtualized environments.
Guest address translation involves converting guest virtual addresses (GVA) to guest physical addresses (GPA), and then to system physical addresses (SPA) through the IOMMU. This process is essential in virtualized environments where multiple guest operating systems run on a single physical platform.
Key address components involved in guest address translation include:
The guest address translation process is performed as follows:
Depending on the virtualization configuration, the IOMMU can operate in different address translation modes:
Table 2 summarizes the operational differences between one-level and two-level address translation.
ATS and PRI significantly change how devices interact with host memory by enabling shared virtual memory between the CPU and PCIe devices. With this capability, both the CPU and the device can operate within a unified virtual address space, improving flexibility and efficiency in data movement.
This model provides the following key benefits:
By removing the constraints of pinned memory and manual address management, ATS and PRI allow devices to participate more directly in the system’s virtual memory model.
These capabilities are increasingly important for emerging workloads such as AI/ML, high-speed networking, and virtualized computing, where performance, scalability, and memory efficiency are critical.
References