Write amplification (WA) has always been one of the critical technical challenges in SSDs. Its essence lies in the fact that the actual physical write volume of an SSD consistently exceeds the original write requests issued by the host. This mismatch fundamentally stems from the nature of NAND flash memory, which must be erased in block units, whereas host write requests occur in page units that are often distributed across multiple blocks. When the ratio of valid data in a block falls below a threshold, the SSD controller initiates garbage collection (GC) to migrate the remaining valid data to a newly erased block, thereby freeing the original block for reuse. This process inevitably introduces additional write operations. These operations directly impact system-level performance—reducing throughput, increasing latency—and accelerate SSD wear-out.
Addressing WA has therefore become a pivotal research focus in the SSD domain, driving the development of advanced data placement technologies. Such technologies require hardware/software co-design to optimize the physical mapping of host data to NAND media. The primary objective is to minimize unnecessary GC, thereby fundamentally suppressing the mechanisms that generate WA.
The flexible data placement (FDP) technology is defined in the NVM Express® Base Specification Revision 2.1. Building on the advantages of traditional data placement approaches, FDP significantly reduces excessive write traffic caused by invalid GC operations, while also lowering the adaptation complexity of the storage stack through its dedicated command set. In addition, FDP SSDs maintain backward compatibility, enabling reuse of existing software stacks. This balance establishes a practical tradeoff between mitigating WA and preserving software ecosystem compatibility.
FDP overcomes the limitations of conventional SSDs’ passive data management by exposing device resource information to the host and providing a data classification interface. Through this interface, the host can actively classify and place data (e.g., cold and hot data) into different storage units on the SSD according to their characteristics. This mechanism enables data segregation and fine-grained layout optimization of storage units, fundamentally reducing redundant data migrations caused by the mixed placement of data with varying lifetimes.
For an in-depth analysis of FDP implementation mechanisms, please refer to the technical white paper “Introduction to Flexible Data Placement: A New Era of Optimized Data Management.”
Starting from version 4.13, the Linux virtual file system (VFS) introduced a file lifetime management mechanism, categorizing files into four distinct tiers: SHORT (Short-lived), MEDIUM (Medium-lived), LONG (Long-lived), and EXTREME (Extremely long-lived). Applications can explicitly assign these lifetime tiers to files using the fcntl() system call. Additionally, the Linux VFS defines two special lifetimes: NOT_SET, which is the default applied when a file’s lifetime is not declared, and NONE, which indicates that the file is not associated with any lifetime configuration. When integrated with FDP SSDs, this mechanism maps files of different lifetimes to predefined data streams within the FDP, thereby reducing WA.
As a representative use case, RocksDB is an open-source high-performance embedded key-value storage system. It is built on the log-structured merge-tree (LSM-Tree) data structure and employs leveling mechanisms to organize data. RocksDB uses append-only writes to achieve high throughput and compaction processes to maintain low latency and optimize storage efficiency. The sorted string table (SSTable) is its foundational file format for persistent storage. SSTables store key-value pairs in sorted order and are organized into levels. Newer data resides in level 0, while older data is progressively merged into higher levels through compaction.
RocksDB's default data classification strategy is aligned with VFS file lifetime identifiers, classified as follows: write-ahead log (WAL) files are marked as SHORT; level 0 and level 1 SSTable files as MEDIUM; level 2 as LONG; Level 3 and higher as EXTREME. Other files (including MANIFEST, CURRENT, and checkpoint logs) are not explicitly marked and default to the NOT_SET identifier.
Our analysis of real-world RocksDB workloads shows that SSTable files display distinct lifetime distribution characteristics across the LSM-Tree hierarchy. Files in levels 0 to 3 typically exhibit shorter residence times, while those in levels 4 and above persist much longer.
Building on these findings, we propose an optimized lifetime classification strategy that replaces the default configuration. In this scheme, WAL file markings remain unchanged. Level 0 to 3 files are uniformly marked as MEDIUM, level 4 files as LONG, level 5 and above files as EXTREME, and all other unclassified files are marked as NOT_SET. This revised mapping better reflects the actual lifetimes of RocksDB data, as summarized in Table 1.
We conducted a comparative evaluation of our optimized RocksDB classification scheme on Samsung's FDP SSD (model U.2 PM9D3a, 7.68 TB) using the XFS file system. Both native and optimized schemes were tested on FDP SSDs as well as regular SSDs, with data loading and update operations on 200 million records using the Yahoo! cloud serving benchmark (YCSB). The results showed that the native scheme reduced WAF by ~8% on FDP SSDs compared to regular SSDs, while the optimized scheme achieved a 30% WAF reduction, a 10% increase in operations per second (OPS), and a 55% improvement in p99.9 tail latency.
Beyond file system-level support, we developed an end-to-end solution for RocksDB on FDP SSDs. RocksDB provides wrapper APIs for different storage backends, which we extended to implement TorFS—an FDP-based RocksDB plugin. TorFS integrates FDP features to enable data routing, supports multiple I/O paths through the open-source xNVMe library, and provides a standardized I/O interface for developers to build customized I/O paths.
Evaluation of TorFS showed that RocksDB can effectively suppress WA, achieving a WAF close to the theoretical minimum of 1. At the same time, system performance metrics reached optimal levels, delivering both high throughput and low latency.
Our experiments confirm that FDP improves storage resource utilization and overall system performance through data lifetime management, while also extending SSD lifespan. These results highlight FDP’s potential to align data placement strategies with underlying media characteristics, offering significant benefits for hyperscale data center architectures.
References