Skip to content

[Advanced Memory Systems Ⅰ] Realizing ATS and PRI for Efficient Data Access in NVMe SSD

Written by Karthik Balan, Solution PE Architecture / SSIR

  • mail

As compute workloads become more data-intensive and latency-sensitive, accelerators (e.g., GPUs, FPGAs and SmartNICs) require more flexible and efficient access to host memory. Traditional direct memory access (DMA) models rely on pre-pinned (locked) memory regions mapped into the device address space, which reduces available system memory, increases fragmentation, and forces pre-loading that adds latency.

To address these limitations, the PCIe® specification defines two complementary mechanisms — address translation service (ATS) and the page request interface (PRI) — that allow devices to translate and request host virtual memory pages dynamically. When combined, ATS and PRI enable devices to access host virtual memory with less reliance on pinned buffers, improving memory utilization and enabling more dynamic, virtualized and accelerator-driven system architectures.

 

Part 1: Fundamentals of ATS and PRI in system architecture

 

Address translation service (ATS)

Address translation service (ATS), when used together with page request interface (PRI), enables a PCIe device to perform virtual-to-physical address translation by directly querying the system’s input-output memory management unit (IOMMU)*. Instead of relying on the CPU or operating system (OS) to pre-pin memory and provide physical addresses, an ATS-capable device can issue an address translation request (ATR)* for a given virtual address on demand. The IOMMU translates the virtual address using OS-managed page tables and returns the corresponding physical address to the device.

* IOMMU: A hardware component that translates device-visible virtual addresses into physical memory addresses and enforces memory protection for DMA operations.

* ATR: A request issued by a PCIe device to the IOMMU to translate a virtual address into a physical address using host-managed address translation and protection tables.

Once a translation is obtained, the device may cache the result in its address translation cache (ATC)*, reducing latency for subsequent accesses to the same memory region. This mechanism allows devices to operate within the same virtual address space as the application that allocated the memory, supporting true zero-copy access to user-space buffers without repeated kernel intervention.

* ATC: A local cache in the device that stores recent virtual-to-physical address translation results to reduce translation latency.

 

Page request interface (PRI)

PRI complements ATS by enabling a device to request host intervention when a required memory page is not currently resident or mapped in system memory. If a device attempts to access a virtual address whose page is not present—such as when the page has been swapped out or has not yet been allocated—the IOMMU returns a translation failure.

With PRI support, the device can issue a page request message (PRM)* instead of treating this condition as a fatal error. The OS or hypervisor handles the request in a manner analogous to a CPU page fault. The host allocates the required page, updates the relevant page tables, and then signals completion to the device. This on-demand paging mechanism enables devices to access virtual memory dynamically and removes the need to pre-pin all potential memory regions.

* PRM: A message generated by a device to request that the host resolve a missing or unmapped memory page.
Figure 1.  Virtual-to-physical address translation overview
Figure 1. Virtual-to-physical address translation overview
Figure 1.  Virtual-to-physical address translation overview
Figure 1. Virtual-to-physical address translation overview
ATS and PRI advantage scenario & Use case

In modern server and accelerator-centric system designs, ATS has become a key PCIe feature for improving I/O efficiency by enabling autonomous address translation at the device level. With ATS, a PCIe device can obtain translated memory addresses through its on-device ATC, reducing reliance on the CPU to pre-pin memory and provide physical mappings.

When combined with PRI, devices gain the additional ability to trigger on-demand page handling by the host. This allows the operating system to avoid aggressive memory pinning while still enabling accelerators to safely access pageable memory regions, significantly improving overall memory utilization and system scalability.

Figure 2 illustrates a sample host architecture in which a translation agent (TA) and address translation and protection tables (ATPTs) reside within the IOMMU, enabling coordinated virtual memory management between the host and PCIe devices.

Figure 2. Host IOMMU with TA and ATPT support
Figure 2. Host IOMMU with TA and ATPT support
Figure 2. Host IOMMU with TA and ATPT support
Figure 2. Host IOMMU with TA and ATPT support
Shared virtual addressing with PASID, ATS and ATPT

The process address space ID (PASID) is a key mechanism that enables PCIe devices to operate across multiple user address spaces. By tagging each device transaction with a PASID, the IOMMU can identify which process page tables should be used for address translation. This allows accelerators, GPUs, and storage devices to directly access per-process virtual addresses rather than relying on pinned, system-wide buffers. As a result, PASID enables secure multi-process isolation for device-initiated DMA.

ATS further enhances performance by allowing devices to cache virtual-to-physical address translations locally in the ATC. PASID ensures that these cached translations are correctly scoped to the corresponding process address space, preventing translation conflicts when a single device concurrently serves multiple applications. Together, PASID and ATS provide both correctness through process isolation and efficiency through reduced translation latency.

The ATPT is a memory-resident data structure used by the IOMMU to resolve device address translations. While its structure is conceptually similar to CPU page tables, the ATPT additionally incorporates hardware-managed features and isolation controls. The ATPT maps each PASID to its corresponding translation context and access permissions. Each PASID entry references the page tables for that process and defines the protection attributes applied to device accesses. In this model, PASID provides identity, ATS delivers performance, and ATPT enforces translation policy and protection.

The table 1 summarizes the major ATPT fields referenced by the IOMMU during address translation.

Table 1. ATPT entry fields used in IOMMU address translation
Table 1. ATPT entry fields used in IOMMU address translation
Table 1. ATPT entry fields used in IOMMU address translation
Table 1. ATPT entry fields used in IOMMU address translation

In a shared accelerator environment, such as when multiple applications utilize a single GPU, each application is assigned a unique PASID. The device tags its DMA requests with the appropriate PASID, and the IOMMU uses the ATPT to map each PASID to the correct page tables and protection attributes. ATS allows the GPU to cache these translations locally, minimizing repeated IOMMU lookups. If the OS updates a process’s memory mapping, it issues invalidation messages to ensure stale ATC entries are removed. Together, PASID, ATS, and ATPT enable secure and efficient shared virtual addressing across CPUs, GPUs, and PCIe devices.

 

Guest address translation in virtualized environments

This section outlines how the IOMMU performs guest address translation in virtualized environments.

Guest address translation involves converting guest virtual addresses (GVA) to guest physical addresses (GPA), and then to system physical addresses (SPA) through the IOMMU. This process is essential in virtualized environments where multiple guest operating systems run on a single physical platform.

Key address components involved in guest address translation include:

  • Guest virtual address (GVA): The virtual address used by the guest operating system
  • Guest physical address (GPA): The physical address as seen by the guest OS, translated from the GVA
  • System Physical Address (SPA): The final physical address used by the hardware after IOMMU translation

The guest address translation process is performed as follows:

  • Nested paging: The IOMMU supports nested paging to enable multi-level address translation. This includes:
  • Using the host page table root pointer for GPA-to-SPA translation
  • Accessing guest translation tables referenced by the GCR3 register for GVA-to-GPA translation
  • Control bits: The translation behavior is controlled by specific memory mapped I/O (MMIO) registers:
  • MMIO Offset 0030h [GTSup]: Indicates whether guest translation is supported
  • MMIO Offset 0018h [GTEn]: Enables guest translation
  • MMIO Offset 0030h [GLXSup]: Controls guest-level translation mode
  • Address alignment: Guest buffer base addresses must be aligned to 4 KB to ensure correct address translation.
Figure 3. Nested address spaces
Figure 3. Nested address spaces
Figure 3. Nested address spaces1

 

Depending on the virtualization configuration, the IOMMU can operate in different address translation modes:

  • Legacy one-level translation: Basic device address translation without guest OS support
Figure 4. One-level translation: Guest CR3 Table, 1 Level1
Figure 4. One-level translation: Guest CR3 Table, 1 Level1
Figure 4. One-level translation: Guest CR3 Table, 1 Level1

 

  • Two-level translation: Supports guest virtual–to–guest physical address translation for fully virtualized environments
Figure 5. Two-level translation: Guest CR3 Table, 2 Level
Figure 5. Two-level translation: Guest CR3 Table, 2 Level
Figure 5. Two-level translation: Guest CR3 Table, 2 Level1

 

Table 2 summarizes the operational differences between one-level and two-level address translation.

Table 2. Comparison of one-level and two-level address translation
Table 2. Comparison of one-level and two-level address translation
Table 2. Comparison of one-level and two-level address translation
Table 2. Comparison of one-level and two-level address translation
System-level benefits of ATS and PRI

ATS and PRI significantly change how devices interact with host memory by enabling shared virtual memory between the CPU and PCIe devices. With this capability, both the CPU and the device can operate within a unified virtual address space, improving flexibility and efficiency in data movement.

This model provides the following key benefits:

  • Zero-copy data sharing: Devices can directly access user-space memory buffers without intermediate copies. This reduces memory bandwidth consumption, lowers latency, and decreases CPU overhead.
  • On-demand paging for devices: With PRI, devices can dynamically request memory pages as needed. This eliminates the need to pre-allocate large pinned buffers, improving overall memory utilization.
  • Simplified application and driver design: Applications can use standard virtual memory allocation (e.g., malloc) without managing physical addresses. Device drivers can map virtual memory to devices dynamically at runtime.
  • Virtualized environments: In virtual machines, devices can use ATS/PRI to access guest virtual memory directly. This enables high-performance device pass-through while preserving isolation and flexibility, which is particularly beneficial in cloud environments using SR-IOV or GPU virtualization.
  • Multi-context systems: With PASID support, a single device can access multiple address spaces concurrently. This is essential for multi-tenant systems and parallel workloads where multiple processes share the same accelerator.

By removing the constraints of pinned memory and manual address management, ATS and PRI allow devices to participate more directly in the system’s virtual memory model.

These capabilities are increasingly important for emerging workloads such as AI/ML, high-speed networking, and virtualized computing, where performance, scalability, and memory efficiency are critical.

 


 

References

[1] “CXL Ecosystem Innovation Leveraging QEMU-based Emulation” 
 
[2] “IOMMUFD — The Linux Kernel documentation”
 
[3] “PCI Express 6.0 Specification”
 
[4] “AMD I/O Virtualization Technology (IOMMU) Specification”
 
 

 
 
* The contents of this blog are provided for informational purposes only. No representation or warranty (whether express or implied) is made by Samsung or any of its affiliates and their respective officers, advisers, agents, or employees (collectively, "Samsung") as to the accuracy, reasonableness or completeness of the information, statements, opinions, or matters contained in this blog, and they are provided on an "AS-IS" basis. Samsung will not be responsible for any damages arising out of the use of, or otherwise relating to, the contents of this blog. Nothing in this blog grants you any license or rights in or to information, materials, or contents provided in this blog, or any other intellectual property.
 
* The contents of this blog may also include forward-looking statements. Forward-looking statements are not guarantees of future performance and that the actual developments of Samsung, the market, or the industry in which Samsung operates may differ materially from those made or suggested by the forward-looking statements contained in this blog.
 
Explore more
episodes