Skip to content

[Advanced Memory Systems Ⅱ] Realizing ATS and PRI for Efficient Data Access in NVMe SSD

Written by Karthik Balan, Solution PE Architecture / SSIR
Arun Bosco J, Controller Val / SSIR

  • mail

Part 2: Emulation and profiling of ATS in PCIe NVMe® devices

 

Need for device-side ATS and ATC in PCIe I/O functions

Traditionally, DMA transactions issued by I/O devices contain virtual or untranslated memory addresses that must be resolved to physical addresses before memory can be accessed. This virtual-to-physical address translation is typically performed by a DMA translation agent within the host system, such as the IOMMU.

Depending on the system implementation, DMA access latency can increase significantly due to the time required for address translation. If each transaction involves multiple memory accesses—e.g., during page table walks—the associated overhead and memory traffic can become substantial.

To mitigate these effects, systems commonly deploy ATCs within the translation agent to cache recent address translations and reduce lookup latency.

However, when the ATC is implemented only at the host-side I/O TA and shared across multiple I/O functions, several challenges can arise:

  • Increased resource pressure on the TA
  • Higher probability of ATC thrashing
  • Potential performance bottlenecks for DMA-intensive applications
Figure 6. Bottlenecks of centralized address translation for multi-device DMA
Figure 6. Bottlenecks of centralized address translation for multi-device DMA
Figure 6. Bottlenecks of centralized address translation for multi-device DMA
Figure 6. Bottlenecks of centralized address translation for multi-device DMA

To address these limitations, the PCIe specification defines mechanisms to optionally enable ATS and device-side ATCs within PCIe I/O functions. This allows an I/O device to participate directly in the address translation process and to cache translations locally.

The key benefits of implementing ATC within the device include:

  • Reduced translation load on the host-side TA by distributing caching responsibility, lowering the risk of cache thrashing
  • Decreased performance dependency on the size of the system ATC
  • Improved and more predictable access latency by issuing pre-translated requests directly to the PCIe root complex
Figure 7. PCIe endpoint with integrated ATC
Figure 7. PCIe endpoint with integrated ATC
Figure 7. PCIe endpoint with integrated ATC
Figure 7. PCIe endpoint with integrated ATC

QEMU-based ATC evaluation

QEMU, a widely adopted open-source emulator, was used to enable and evaluate ATC functionality at the PCIe I/O function (endpoint) level. ATC experiments were conducted to assess the impact of device-side address translation caching under high-traffic workloads targeting an NVMe SSD. The entire evaluation environment, including host, IOMMU, and device components, was emulated using QEMU.

The following host- and device-level parameters were varied to evaluate their impact on ATS/ATC behavior:

  • OS page size / granularity (e.g., 4 KB, 2 MB)
  • Smallest translation unit (STU) configuration (multiple of 4 KB or OS page size)
  • IOMMU ATC / I/O translation lookaside buffer (IOTLB)* size
  • Cache type used in the device-side ATC
  • Device-side ATC size
  • Selected caching algorithm
  • Pinned memory allocation vs. dynamic memory allocation
* IOTLB: A cache inside the IOMMU that stores recent device address translations to reduce DMA address translation latency.
Figure 8. QEMU-based experimental environment for ATS and ATC evaluation
Figure 8. QEMU-based experimental environment for ATS and ATC evaluation
Figure 8. QEMU-based experimental environment for ATS and ATC evaluation
Figure 8. QEMU-based experimental environment for ATS and ATC evaluation

ATS and ATC profiling and performance characterization

Using the QEMU-based setup described in Figure 8, on-device ATC and ATS were enabled on an emulated NVMe device attached to a QEMU VM.

Profiling and performance characterization were conducted by measuring IOTLB hit/miss events and on-device ATC hit/miss events while running random read/write I/O workloads using file I/O tester (FIO) on the NVMe device, with ATS and ATC both enabled and disabled for comparison.

Basic setup details are as follows:

  • A nested virtualization setup in which the Level-1 (L1) guest is configured with emulated NVMe devices augmented with on-device ATC
  • A virtual IOMMU (emulating Intel VT-d) augmented to support configurable IOMMU IOTLB (host ATC)
  • QEMU debug enabled using backend logging via '--enable-trace-backends=log' (trace output to stdout)
  • QEMU tracing enabled during VM launch to capture:
  • vtd_iotlb_*: Events Events related to host-side IOTLB (IOMMU ATC)
  • pcie_atc_*: Events related to PCIe on-device ATC

Configuration and execution parameters:

  • A series of benchmarks were executed to analyze IOMMU IOTLB thrashing behavior under concurrent I/O traffic
  • On-device ATC page sizes were varied (2 MB huge pages vs. 4 KB standard pages) to evaluate ATC efficiency and capacity impact
  • Host IOTLB and device ATC lookup, hit, and miss events were collected for quantitative analysis
  • Multiple configuration combinations were evaluated on the QEMU VM:
  • ATS enabled / disabled
  • Different ATC Sizes (number of ATC entries)
  • Workload type (FIO↗)

[global]
name=fiotest
ioengine=libaio
direct=1
iodepth=32
group_reporting
time_based=1
runtime=60
startdelay=5
 
[random-rw-test1]
rw=randrw
bs=4k
size=16G
numjobs=8
filename=/dev/nvme0n1
Figure 9. IOTLB and on-device ATC hit/miss profiling results
Figure 9. IOTLB and on-device ATC hit/miss profiling results
Figure 9. IOTLB and on-device ATC hit/miss profiling results
Figure 9. IOTLB and on-device ATC hit/miss profiling results
* IOTLB size and ATC size are expressed as the number of cache entries.
 

 

Impact of ATS and ATC presence

In the absence of PCIe function–level ATCs, the number of IOMMU IOTLB lookups and hits increases significantly and scales with the number of attached and active devices. A high volume of IOTLB lookups is a critical concern, as it can lead to head-of-line blocking in real hardware implementations where physical circuitry constraints apply.

By contrast, introducing function-level ATCs dramatically reduces the number of IOMMU IOTLB lookups, thereby mitigating the risk of such bottlenecks.

As shown in Figure 9, when PCIe ATC is disabled and a random read/write workload is executed on the NVMe device, a large number of IOTLB lookups and hits are observed across different IOTLB sizes. When PCIe ATC is enabled, however, most address translations are handled by the device’s local ATC. As a result, IOMMU IOTLB lookups are significantly reduced across all ATC size configurations.

A key outcome of this analysis is that these improvements are achieved without any modification to host software. The Linux kernel natively leverages ATS/ATC capabilities within the root complex, TA, and PCIe functions, including cache management operations such as invalidation and eviction.

One limitation of the QEMU-based analysis is that it does not capture the additional latency associated with in-function ATC lookups. In practice, random I/O workloads may incur some latency overhead due to these local cache checks. However, this latency is now handled at the device level, eliminating the need for the TA to perform additional address lookups—thereby removing a major source of contention from the host IOMMU.

 


 

References

[1] “CXL Ecosystem Innovation Leveraging QEMU-based Emulation” 
 
[2] “IOMMUFD — The Linux Kernel documentation”
 
[3] “PCI Express 6.0 Specification”
 
[4] “AMD I/O Virtualization Technology (IOMMU) Specification”
 
 

 
 
* The contents of this blog are provided for informational purposes only. No representation or warranty (whether express or implied) is made by Samsung or any of its affiliates and their respective officers, advisers, agents, or employees (collectively, "Samsung") as to the accuracy, reasonableness or completeness of the information, statements, opinions, or matters contained in this blog, and they are provided on an "AS-IS" basis. Samsung will not be responsible for any damages arising out of the use of, or otherwise relating to, the contents of this blog. Nothing in this blog grants you any license or rights in or to information, materials, or contents provided in this blog, or any other intellectual property.
 
* The contents of this blog may also include forward-looking statements. Forward-looking statements are not guarantees of future performance and that the actual developments of Samsung, the market, or the industry in which Samsung operates may differ materially from those made or suggested by the forward-looking statements contained in this blog.
 
Explore more
episodes