NVMe FDP - A promising new SSD data placement approach
Written by Arun George, Global Open-ecoSystem Team (GOST)
The Flexible Data Placement (FDP) technology has garnered significant attention in the NVMe SSD world lately. This blog post gets into the details of NVMe FDP and introduces the key concepts. We will discuss how FDP fits into the NVMe ecosystem and what it means to integrate FDP into a software stack. This blog post might be for you if you are an engineer, architect, manager, marketing professional or even a tech enthusiast who is just interested in recent innovations.
1. Recap on WAF and data placement
Write amplification is a well-known problem in the field of SSDs. Write Amplification Factor (WAF) indicates the extra NAND media write operations the SSD performs in response to the host writes (WAF = total NAND media writes / total host issued writes). The additional media writes are necessitated due to the way in which NAND flash media handles writes. NAND flash pages once programmed cannot be overwritten, unless the entire flash block (1 block = N pages) is erased again. Since the erase is a costly operation, the SSD firmware avoids the unnecessary erase operations by handling the writes in a log structured manner. So any overwrites will be redirected to a new flash page and the old page is marked as invalid. Eventually many such invalid pages are generated. The Garbage Collection (GC) process of SSD will handle such pages by moving the valid pages to a new flash block. GC process also releases these flash blocks for erase and eventually for new writes. The GC causes additional writes due to the movement of valid pages, in addition to the ongoing host initiated writes. This process is the root cause of WAF problem in SSDs. The extent of the WAF problem can vary based on the active host workloads. For example, a sequential workload might not cause much of WAF since it aligns with the log structured write design of the SSD software, while a random workload with lots of overwrites can cause high WAF in SSD.
SSDs normally employ a certain level of overprovisioned blocks to cushion the GC process impact and reduce WAF. Overprovisioning (OP) is a method of keeping an additional NAND media capacity as opposed to the exposed SSD capacity to the host. Furthermore certain host ecosystems extend this device OP by employing a host level overprovisioning. This host OP is achieved by means of limiting the SSD utilization below the full capacity of 100%. But this results in inefficient utilization of SSD capacity.
SSD Data placement technologies are an attempt to solve the WAF problem by coordinating the data placement between host and SSD. Some of those technologies like Open-Channel SSD and NVMe-Streams did not find much traction in the industry, while others like Zoned Name Space (ZNS) faced challenges because they required the host to modify the data pattern and software stack to achieve the WAF reduction. So none of them gained enough traction to be solution that industry had been looking for. The previous blog post in this data placement series is an excellent resource for getting in-depth knowledge on the history of data placement technologies.
2. FDP
NVMe Flexible Data Placement (FDP) specification, ratified as TP4146, is a new approach in the realm of SSD data placement technologies. FDP is the result of the experiences the industry has gained so far on this topic. It attempts to provide most of the WAF benefits to the host with minimal changes in its stack, while providing enough hooks for data placement.
The concepts of FDP for the efficient SSD data placement are detailed in the whitepaper "Introduction to Flexible Data Placement: A New Era of Optimized Data Management". We will focus only on the core concepts here for simplicity. For the beginner, FDP can be bootstrapped by referring the whitepaper "Getting Started with Flexible Data Placement (FDP)", which details the host ecosystem and utilities to start playing around with an FDP drive.
The key ideas in FDP can be summarized as in the following diagram. While conventional (CNS) SSD exposes only a range of logical addresses to the host (LBAs), a data placement enabled SSD exposes additional attributes pertinent to the media topology as well. An FDP SSD exposes enough attributes to allow the host to segregate its data streams and align its data structures to media boundaries. At the same time, it does not put any additional burden on the host for garbage collection of media blocks or any other media management.
2.1 Simplified view of Media Topology
FDP allows the host to have a simplified view and moderate awareness of media topology.
FDP SSD retains the control over logical-2-physical mapping, garbage collection and the bad block management of the NAND media like any other conventional drives. If we compare this aspect to other technologies, a ZNS drive requires the host to take control of the logical-2-physical mapping and garbage collection processes. Host addresses an FDP drive using the same logical addressing method (LBAs) as in a conventional drive except for an optional data hint about the placement. An FDP drive exposes a few key aspects to the host so that it can take advantage of them for data placement.
- Reclaim Unit (RU): RU is a set of NAND blocks onto which the host written data will be programmed. RU might be equal to a superblock or a sub-superblock depending on the design of the device firmware (superblock is the collection of NAND erase blocks across all channels, banks, dies and planes). RU is the granularity at which host would be able to track writes and GC events. The size of an RU will be typically in ~GBs in the current implementations.
- Reclaim Group (RG): RG is a collection of RUs. The RG concept allows the host to perform data isolation at Die level. A typical RG consist of a Die or multiple of Dies.
2.2 Host's ability to segregate data streams
FDP provides a concept called Reclaim Unit Handle (RUH) which allows the host to write to multiple RUs at the same time. This is similar to the erstwhile NVMe streams. Note that a conventional SSD would always write to only one head block (or superblock). And only when the current head block is fully programmed, the next one will be picked. FDP provides multiple append points by means of multiple head blocks available to the incoming host data. The host can select the append point by specifying the RUH that the data should be programmed into. The number of RUHs depend upon the device configuration and can be typically in the range of 1 to 128 based upon the product requirements.
2.3 Feedback from the device
One of the key differentiating factors of FDP from similar technologies like NVMe-Streams is that the device provides active feedback to the host in terms of the effectiveness of the placement semantics provided. The typical feedback can be in terms of the garbage collection (GC) events produced. These feedback can be periodically collected from the FDP log registers and can be used to fine tune the placement semantics. Keep in mind that the feedback mechanism might not be in SSD's performance path and hence may not be used in the critical IO path in the host software.
3. What FDP means for SSD?
It would be interesting to have a peek at what it means to have FDP enabled at the SSD device level. First diagram explains the architecture of a conventional drive in which the A) shows a simplified high level media layout of the drive. It shows a single head block for host writes. The B) of the diagram details it further into channels and Dies shows how the write append point is managed across them. The key point is that each of the blocks in the combination has a write position in terms of NAND pages which corresponds to the head superblock append point. Note that this is a simplified view and particular implementations may vary.
The FDP drive architecture diagram below gives a slightly different view, though the implementation is almost similar to that of a conventional drive. An FDP drive exposes multiple append points in terms of Reclaim Unit Handles, while conventional drive has only one append point for the host writes. The following diagram explains the case where the size and composition of Reclaim Unit is equal to that of a superblock. This need not be the case in all implementations and may vary based on product requirements. Some implementations might choose a Reclaim Unit size of sub-superblocks, which may span only a fraction of <Channel, Die, Plane> combination.
3.1 Multiple append points as RUHs
Since an FDP SSD provides multiple append points as Reclaim Unit Handles, it might have to keep additional resources such as write buffer stripes to support it. These overheads and particular deployment needs influence the number of RUHs that can be supported in a given SSD. A way to support many RUHs is to configure the Reclaim Unit size to be sub-superblock size, which allows to have multiple host append points in the same superblock itself. This might have performance impacts for the SSD since a superblock sized configuration is devised to generate maximum throughput as it programs across <Channel, Die, Plane> combinations concurrently. So selecting the number of append points (RUHs) is an intelligent decision that should be taken at the time of SSD design considering the performance needs and flexibility required.
It can summarized as,
- small number of append points - better performance guarantees with lower flexibility for the host
- large number of append points - lower performance guarantees with higher flexibility for the host
3.2 What is the optimal RU, RG configuration?
Since a sub-superblock RU size allows for more RUHs, this is a configuration considered for many FDP based SSDs. While it may seem better to have as many RUHs and RGs as possible to have efficient host control on the data placement, it might have performance ramifications. So it will be an SSD architect's design choice to select between <small RU, many RUH> and <big RU, few RUH> based on whether you want your SSD to support finer data placement control by the host at a lesser performance, or lesser placement control by the host at maximum performance. The same argument is true for RG configuration also as having more RGs might help in finer placement control but now the responsibility for achieving maximum throughput lies with the host. Note that an FDP enabled SSD might support only a few (it can be as few as one) <RG, RUH> configurations to select from.
So,
- <small RU, many RUH> or <many RGs> - higher host control with lesser performance guarantees from SSD
- <big RU, few RUH> or <few RGs> - lesser host control with higher performance guarantees from SSD
3.3 Initially Isolated vs Persistently Isolated RUHs
FDP allows the SSD to define Initially Isolated and Persistently Isolated RUHs. An Initially Isolated RUH configuration means that the data segregation is guaranteed at the host write, but not throughout the life of the data. An eventual GC process can intermix this data with those from the other RUHs. A Persistent RUH means data's segregation from those of the other RUH's has been guaranteed for the life of the data. Persistently Isolated RUHs looks good in theory, but it might require more resources for the SSD to implement it. So expect the initial FDP devices to roll out with Initially Isolated RUHs as they help to achieve the basic segregation at lesser cost. FDP integration results on Meta's CacheLib shows that Initially Isolated RUHs are enough to achieve a WAF close to 1.
4. What FDP means for host?
One key aspect of FDP is that it is backward compatible and is an optional feature. An FDP enabled SSD can continue to work with an FDP agnostic software stack as before without any overheads. An FDP enabled software stack can seamlessly work with a non-FDP SSD without any issues. The only downside is that the added WAF benefits of FDP will not be available then. Since FDP is an optional feature, an FDP enabled SSD will process the host provided hints only if it is enabled in the IO command (DTYPE and DSPEC fields are used for it).
4.1 When should we choose FDP
FDP is an ideal technology if the host wants to get most of the data placement benefits without much engineering effort. When we say 'most', it means an SSD write amplification factor (WAF) which is 'very close to the ideal value of 1' under 'most' scenarios. Achieving the ideal WAF of 1 is quite possible with FDP and it depends on the workloads.
If the host wants to achieve an SSD WAF of 1 in all scenarios, then mechanisms like ZNS might suit better that have strict sequential write interfaces. But the host software stack has to be re-engineered to make sequential IO patterns to suit those interfaces, if the software stack is not designed that way. And it will result in added application level WAF, which negates any gained device WAF and the end-to-end WAF will remain the same. So those strict write interfaces work well only if the host software was designed from the scratch to adhere to 'WAF = 1'. If the host wants to achieve most of WAF benefits without much engineering effort and without any impact in application level WAF, then FDP is the ideal choice.
4.2 Design Considerations for FDP
4.2.1 Which workloads could benefit from RUH segregation?
Someone should analyze the workload and figure out if it can be benefited from the RUH segregation. A workload with multiple distinct IO patterns with hot data temperature (many overwrites) is definitely a high WAF causing one and will be certainly benefitted from the FDP RUH based segregation.
For example, the following workloads could be easily benefit from FDP.
- Workloads with inherent hot / cold data separation, like SSD caches.
- Workloads which require tenant based isolation
- Workloads with significant meta overhead (where meta and data can be segregated for WAF and tail latency improvements)
- Workloads that can be identified to have separate write threads
4.2.2 When should host use RG segregation?
The Reclaim Group concept of FDP provides an interesting level of placement control for the host. If the FDP SSD provide one RG per Die or a number of Dies, the host would be able to make use of it by placing the data which requires isolation in those RGs. One scenario where such isolation could be useful is different tenants in a multi-tenant use case. This allows one tenant to be insulated from the performance impact of the other tenants in terms of GC events etc. But it may come at the cost of reduced performance because the smaller RGs have correspondingly smaller bandwidth due to the reduced use of parallelism in SSD internal write operations. In many scenarios, a simple scheme of having only 1 RG per device and letting the SSD to derive the best performance by utilizing the parallelism would be sufficient.
4.2.3 Should host track the writes in an RU?
FDP allows the host to track the writes in an RU. Using FDP spec, host can get to know the remaining bytes in an RU and it can even instruct the SSD to pick up a new RU for a given RUH when needed. While host architects might rejoice at such a provision, this needs to be handled with care. These commands for RU tracking might be implemented as admin commands and might be significantly slower than the IO commands. So it would be better to design the RU write tracking in control path as an offline verification mechanism, rather than as the online IO tracking methodology. Also RU write tracking becomes difficult when multiple threads are writing to the same RUH at the same time. One more factor to be considered is the possibility of command reordering in the SSD controller when 'IO queue depth > 1' is employed.
So host is recommended to do RU write tracking only when,
- Only single thread is writing to a given RUH
- NVMe IO queue depth = 1 is used
In short, the optimizations using RU level tracking makes sense when the host has consistent RU-sized (or similar sized) data movements. Otherwise the tracking overheads might outweigh the benefits. And one needs to check if this RU tracking might be complex to implement in the existing applications as it involves modifying the IO path behaviors. Also mostly it might be expensive in 'QD = 1' cases due to the increased latencies.
5. How FDP works actually?
It is interesting to see how FDP could help to achieve low Write Amplification Factor (WAF). Let us take the case of RUH based segregation with 3 different applications and analyze the SSD GC behavior.
It is clear from this example that FDP based segregation helps to alleviate the load on the GC process of SSD and help to release the blocks by the inherent invalidation of the workloads. FDP makes sure that this release is not affected by the data of different stream/application which has a different life time. Thus FDP helps the SSD to maintain a WAF of ~1 in the above situation.
6. Exciting early results
At Samsung, we enabled FDP on Meta (Facebook)'s CacheLib deployments using Samsung PM9D3 SSD. This work has been merged with the mainstream CacheLib code. The integration results were quite exciting. Originally CacheLib deployments were facing a high WAF of ~3.5 when the full SSD is utilized. So the deployments were forced to reduce the SSD utilization to 50%, thus employing a host OP of 50%, to keep the SSD WAF below 1.3. With FDP Cachelib KV Cache workload is able to maintain a WAF of ~1 even at full SSD utilization. This results in efficient capacity utilization, better latencies (due to lesser GCs) and lower SSD power consumption (again due to lesser GCs). Please watch out for the next blog post in this series which details about FDP integration in CacheLib ecosystem and its sustainability benefits.
7. What is next?
NVMe FDP is an exciting new SSD Data Placement technology. It is the result of years of industry's experience in trying out different approaches for solving the WAF problem in SSD. While providing a moderate level of control for the host in data placement, it makes sure that the data placement objectives are met without putting any additional burden on the host software stack. The benefits of FDP are manifested by means of better SSD write amplification, lower tail latencies and lowered SSD power consumption.
The host ecosystem for FDP is ready and the Linux kernel support is available through the IOUring_Passthru mechanism. FDP support in the regular Linux block layer path is in the final stages of development as it stands. Please refer the whitepaper for more details on the ecosystem support for FDP.
As more industry and academic folks are getting interested in this new technology, more intriguing use cases are expected to happen shortly. The future indeed looks bright for FDP for now.
8. References :
1. "Introduction to Flexible Data Placement: A New Era of Optimized Data Management"
https://download.semiconductor.samsung.com/resources/white-paper/FDP_Whitepaper_102423_Final.pdf
2. "Getting Started with Flexible Data Placement"
https://download.semiconductor.samsung.com/resources/white-paper/getting-started-with-fdp-v4.pdf
3. "Nuances in FDP Implementation"
https://www.sniadeveloper.org/events/agenda/session/697
4. "A Brief History of Data Placement Technologies"
https://semiconductor.samsung.com/news-events/tech-blog/a-brief-history-of-data-placement-technologies/