This is part 2 of our series on addressing support for HC SSDs in the software ecosystem.
The demand for larger capacity drives requires the adoption of larger IUs. This creates a challenge for software stacks: how can drives with larger IUs be optimally adopted without any modifications to software applications? Is this possible? Intel's 2018 white paper, titled "Achieving optimal performance & endurance on coarse indirection unit SSDs" about the topic on QLC suggests that software applications should use direct I/O instead of buffered I/O to align writes to the IU and that applications should also use allocators for buffers for writes to data which also allow explicit alignment requirements such as the libc posix_memalign(). Intel's suggestions require software applications to be modified to be aware of the IU and support for buffered I/O is not possible. Many workloads need to use buffered I/O, provided the Linux page cache, either because of software limitations such as in the case of PostgreSQL or because of other requirements such as when working with specific large data set AI workloads. One way to grow support for large IUs is to require support for larger LBA formats. However, that would first require confidence that existing workloads are aligning writes to the IU through I/O introspection. Analysis is required to prove that this is possible first. Using a larger LBA format is also a non-backward compatible and non-scalable solution. Drive capacities would have to be reduced if smaller LBA formats were to be used, and a complete analysis of changes required in standards would be required. With ever increasing HC SSDs requiring an ever-increasing LBA size, the host SW would need to be written to dynamically operate across various numerous SSD LBA sizes concurrently. Can anything be done to avoid any software application changes while also supporting buffered I/O without requiring the industry to move to a new LBA format?
Linux supports multiple filesystems. Each filesystem designs how data is written on disk. To ensure protection against power failure filesystems have different strategies they can use to ensure writes are consistent in case of power failure. One of these strategies is to embrace a filesystem journal, the idea is you write to the journal prior to considering data written on disk. On journal-based filesystems such as XFS and EXT4, in case of power failure the filesystem can replay the journal for unfinished writes. Copy on Write filesystems, such as btrfs, take a slightly different strategy to address power failure, it writes data to a new location and then links the change in, changes are not committed until the last write. With this strategy if a power failure occurs the uncommitted portion of data is lost. The btrfs filesystem however already relies on 16 KiB for metadata, and so if it could write that entire 16 KiB atomically, it would allow btrfs to simplify their writes in the future. Regardless of what strategy a filesystem follows, this is purely a filesystem design question, it is up to filesystem developers to implement a solution to this problem. In order for filesystems to design a solution, a filesystem needs to know the minimum I/O size it can use to write atomically in case of power failure. This size is known as the filesystem sector size. The sector size is the minimum size the filesystem can rely on to write without concern of power failure. The actual layout of how a filesystem writes to a journal or metadata varies by filesystem.
The filesystem sector size is used by filesystems for the minimum I/O possible when writing filesystem data. It leverages this as the minimum I/O to use for writing to the journal or metadata.
The NVMe parameter Namespace Preferred Write Granularity (NPWG) is used to represent the smallest recommended write granularity. In practice today, most HC NVMe SSD drives should be reporting the IU through the NPWG. This is for two reasons. First, the author of the NVMe specification that introduced NPWG had the IU in mind as an appropriate value. Second, its required as part of the Open Compute Project NVMe Cloud Specification. Starting with the v6.15 release of Linux you can query for this with the simple stat --print=%o call on the respective block device.
■ NPWG as IU
The value of NPWG only helps us get the IU of a drive. Operating system developers should take care to ensure that NPWG is not used to implicate that a write will be atomic. Atomic writes are handled through other NVMe parameters, described next.
■ AWUN for normal operation
The NVMe Atomic Write Unit Normal, AWUN, tells us the controller's atomic write size during normal operation. That is, it is atomic to the NVM with respect to other reads and write operations. If a write is larger than this size atomicity is not guaranteed by the controller. Normal here is used to describe the world where we're not considering power failure. Power failure is a very special case to support on SSDs and requires its own separate parameter, described next.
■ AWUPF for power fail consideration
The NVMe parameter atomic write unit power fail (AWUPF) is used to represent the maximum I/O size allowed to be used for a write which is guaranteed to not fail in case of power failure.
■ Leveraging NVMe parameters for larger filesystem sector sizes
Based on the review in the prior sections we can now evaluate which ones we can use to support larger filesystem sector sizes. There are two mechanisms by which an NVMe drive can allow filesystems to leverage a larger filesystem sector size:
Supporting a larger filesystem sector size is expected with a larger LBA Format. However, that requires all users to support these larger sector sizes. Allowing users to create filesystems with a filesystem sector size matching the IU size is only possible with support from an SSD with its maximum power fail safe write size greater than or equal to the IU size.
■ How AWUPF ≥ NPWG = IU is flexible
NVMe drives which follow an NVMe AWUPF ≥ NPWG = IU paradigm strategy can provide write protection against power failure against filesystems on large IU SSDs. This strategy is also flexible for users, in that they can opt-in to specify a larger sector size at filesystem creation time. A namespace can have its own AWUPF value, in which case NAWUPF would be defined. The same applies to when NAWUPF is defined NAWUPF ≥ NPWG = IU, however it is simpler and shorter to just refer to this concept as AWUPF ≥ NPWG = IU. For example, users of an NVMe drive on a 4 KiB LBA format with AWUPF 16k and NPWG of 16 KiB can use either of these commands to create 16 KiB XFS filesystem, the only difference between the two is the sector size changes.
The flexibility comes from the fact that users don't need to leverage the larger sector size on filesystems, they can simply opt-in to use the larger sector size only if they wish to and if the filesystem supports it. This puts the power and flexibility in the hands of the users to opt in for larger sector sizes when and if they're ready.
■ Determinism of AWUPF ≥ NPWG = IU
Supporting an AWUPF matching the IU also enables users to benefit from large atomics. Hyperscalers have been enabling support for large atomics on databases for at least 6 years now using custom storage solutions. Given an API for large atomics was only merged as of v6.13 it begs the question how hyperscalers were able to support large atomics without a filesystem API for it. The answer lies in manual code vetting on the block layer and filesystem. The point to a filesystem API is to enable the Operating System to help provide guarantees over the requirements. While the industry has been relying on ext4 with bigalloc feature with 16 KiB cluster sizes, software vetting is required to ensure proper functionality. The Operating System API also enables NVMe users to also require an error from the kernel if a write does not meet the criteria to be atomic – this is because contrary to SCSI, NVMe does not require a special write command for it be atomic. So long as your write follows the requirements for being atomic NVMe will write it atomically. If you want assistance by the kernel to assist with vetting the requirements you can adopt the atomic write API on your user space application.
An important but overlooked requirement for vetting correctness when using atomics is to respect the required SSD hardware atomic boundary sizes. For NVMe there are two values to consider the Namespace Atomic Boundary Size Normal (NABSN) and Namespace Atomic Boundary Power Fail (NABSPF). One is important for normal operation, while another in case of power failure. For HC SSDs with a 16 KiB IU both NABSN and NABSPF may also be 16 KiB. An XFS filesystem with 16 KiB filesystem block size but 4 KiB sector size on these types of drives will ensure most writes aligned to the 16 KiB boundary. Some writes are not aligned, and so they cannot take advantage of the atomic writes feature and they may incur a read modify write, so are not as performant. I/O introspection reveals that in XFS these unaligned writes in XFS are doing metadata write work. I/O introspection also reveals that by leveraging a 16 KiB sector size we get deterministic alignment to NABSN / NABSPF / NPWG alignment. Similarly, IO introspection of ext4 with bigalloc feature with 16 KiB cluster sizes also reveals some write can still be 4 KiB. At this year's LSFMM during the ext4 write atomics talk this topic came up, and Ted Ts'o revealed that this is due to the metadata writes. Additionally, the topic of how ext4 could support 16 KiB sector sizes came up, and the path forward to enable that would be to have ext4 supporting LBS eventually as well.
Supporting larger sector size through LBS enables a filesystem to have deterministic aligned writes for both the larger IU and for large atomics. The AWUPF ≥ NPWG = IU strategy provides empowerment to users so that they can opt-in for the larger sector sizes when and if they are ready for them.
This is part 3 of our series on addressing support for HC SSDs in the software ecosystem.
Other than helping with aligning writes for HC SSDs, are there other benefits today's software and filesystems can take advantage from large atomics?
We presented our findings at last year's Open Compute Project on the talk titled "Enabling large block sizes to facilitate adoption of large capacity QLC SSDs". We also wrote automation to support reproducing our findings on bare metal and on the cloud through kdevops on both MySQL and PostgreSQL. We define the TPS variability as the square of the standard deviation. We define outliers as TPS values 1.5 outside IQR. To provide an easy to reproduce baseline we have used AWS i4i.4xlarge 4 KiB IU NVMe nitro drives for our evaluation. A summary our findings show:
LBS relies on the fact that you can simply write one filesystem block size atomically, if your hardware supports it. The XFS filesystem first supported large atomics through LBS. Although LBS was written to help support large IU drives you can still leverage LBS on 4 KiB IU drives to take advantage of large hardware atomics. Discussion about actually leveraging large hardware atomics for MySQL through LBS on 4 KiB IU drives has been discussed in the community. The performance evaluation in the community reveals that if you leverage at least 10 threads for MySQL then LBS works well on a 4 KiB IU drive. If you use less than 10 threads then the MySQL redo log may write 512 bytes at a time followed by periodic syncs, and this can cause a performance regression against 4 KiB XFS filesystem workloads. One solution to this problem is to just use a separate directory for the redo log on another drive on another filesystem which will write 512 bytes at a time efficiently. However, this would not be suitable if you wanted to leverage snapshotting using all of MySQL data and redo log on the same filesystem. To help with that corner case on 4 KiB IU drives there has been community development effort on supporting large atomic writes where the atomic write may be larger than the filesystem block size. A v9 series supporting this is now out for review. The only caveat with this support is extent allocations by the filesystem are not deterministic when larger than the filesystem block size. To address this a software solution to large atomics is needed with CoW in case a write is not aligned or the granularity size is not guaranteed. One concern raised at LSFMM this year was the possible impact to variability in performance this might have. However, performance evaluation so far by community stakeholder seems to yield good results. And so, while this is the case for 4 KiB IU drives, LBS provides a cleaner requirement for 16 KiB IU drives since you want all writes ideally aligned to 16 KiB. This currently puts the onus of a redo log architecture on MySQL development for smaller than 10 threads on HC SSDs, unless of course you can manage to place the redo log in a separate directory on a separate filesystem. A future possible solution to evaluate would be if small file embeddings could be supported, where a 16 KiB write would include both data and metadata.
Despite the software challenges – most hyperscalers have been supporting large atomic writes on the cloud for databases for at least 6 years now and have been doing so by leveraging ext4 with the bigalloc feature with 16 KiB cluster sizes for MySQL and with Direct IO. LBS provide a unified strategy for filesystems to support large atomics on HC SSDs. The advances with the atomic API in the Linux kernel, support for LBS with larger sector sizes to provide complete atomic alignment determinism, and the Linux kernel community's recent advancements for supporting large atomic write than the filesystem block size will only improve the situation further. At this year's LSFMM we also started to review how we could leverage large hardware atomics for databases with buffered IO given the clear gains observed with PostgreSQL.
How filesystem can leverage large atomics are a topic of future R&D in the community which is ripe for exploration. For example, can small writes be embedded atomically with metadata? Can we avoid the journal for atomic writes?