This is part 4 of our series on addressing support for HC SSDs in the software ecosystem.
Users of filesystems expect buffered I/O support. Linux implements its support of buffered I/O through the page cache. The largest data block size you can support on filesystems has for decades always been limited by the CPU system PAGE_SIZE and so on x86_64 that is 4 KiB. Since system calls will work with files and these are memory mapped in-memory on the page cache, we have supported atomicity of data in the page cache in units of the PAGE_SIZE. Linux buffered I/O support on x86_64 with a max granularity base page size of 4 KiB means that is also the max filesystem data block size supported. Systems such as ARM64 and PowerPC have supported filesystem block sizes larger than 4 KIB for years, however filesystems created on these systems could not be used on x86_64. Linux has historically supported compounded pages together, however the advances for dealing with large allocations for filesystems have changed over the years, first with hugetlbfs, and then transparent huge pages (THP) and these APIs only supported working with huge pages, which are extremely large: 2 MiB and 1 GiB on x86_64 and even larger on ARM64. The only filesystem which ever got support for THP was tmpfs. The Linux page cache has needed an overhaul to support a proper memory management and filesystem API to leverage different granularity compound pages which does not require one to use always use huge pages. Without a solution to this, buffered I/O cannot be supported on IUs larger than 4 KiB with full determinism on alignment.
In computing, managing memory efficiently is crucial for performance. Traditional memory management in operating systems uses "pages" which are small chunks of memory, typically 4 KiB in size. This method involves keeping track of many individual pages, which can become cumbersome and inefficient, especially when dealing with large amounts of data. Memory folios are a more modern approach designed to improve efficiency. Instead of managing many small pages, a folio can combine multiple pages into a larger, single unit. This reduces the overhead associated with handling each page individually. Order-0 is the basic unit of a folio, equivalent to a standard memory page, usually 4 KiB in size. Large folios are made up of multiple pages up to an arbitrary maximum size of 2 MiB today.
The preliminary work for folios was introduced in Linux v5.14 (Memory folios) and merged on v5.16. Since then, the Linux kernel has been transitioning from using pages to folios with the goal of making Linux more effective at managing memory. Although this transition was an Operating System memory management advancement, some manufactures such as AMD or ARM have specific methods such as PTE Coalescing (AMD) and Hardware Page Aggregation (HPA in ARM) which also enable more hardware efficient use of contiguous memory provided by large folios.
A key enabler in the Linux kernel to support (LBS), where the block size can be greater than the system page size, is the adoption of its struct Xarray and its use in the page cache. The Linux kernel v4.20 release was the first to sport this new key data structure. The Linux kernel Xarray docs suffice to get an understanding of the general goal of Xarray, it however does not provide much information about the advanced API. The advanced API is how the Linux page cache uses struct Xarray and it is also what enables LBS. The page cache used to use a customized radix tree, and so one of the goals behind Xarray was to allow it to be used as a drop-in replacement for where radix trees are used. The Xarray enables us to implement a radix tree replacement just that its access mechanism and user API has been modified to fit the actual use case: a flexible array. Different Linux radix tree users also got converted to use Xarray other than the page cache, one of the first ones was the ID allocation (IDA) API which is used to get unique IDs to different subsystems.
The adoption of Xarray in the page cache enables embracing large folios in the page cache for I/O. Operating Systems can use architecture specific huge pages through different means. In Linux the first mechanism supported was through hugetlbfs, however this required dedicating huge pages at boot. An evolution was transparent huge pages (THP) API. Although support for large I/O which leveraged the page cache was originally supported using THP, only one filesystem ever got support for huge pages: tmpfs. Supporting huge pages also required filesystem modifications. The adoption of folios for higher order arbitrary size compound pages and the use of Xarray in the page cache allows us to extend the page cache to start treating large folios for I/O in a true transparent manner. The only compromise is that filesystems still need to adopt APIs which work with folios and support high order folios. Filesystems can either work with folios directly or use filesystem library helpers such as iomap for local filesystems or netfs for networked based filesystems.
The Xarray API has two APIs available for users, the simple API and the advanced API. The advanced API is what the page cache uses. Both have support for using an optional feature of Xarray called multi-index support: CONFIG_XARRAY_MULTI. The Linux Kconfig help attribute for it describes it well: "Support for entries which occupy the multiple consecutive indices in the Xarray". There are example uses of this with the Xarray respective lib/test_xarray.c kernel self-test, but one striking use case was missing: the advanced API. We have extended lib/test_xarray.c to mimic the advanced API use in the page cache, this helps demonstrate the API, the expectations, and the requirements that this holds on LBS support. The relationship between supporting LBS and large folios in the page cache is also clearly illustrated with this effort: both share the API to ensure entries can occupy multiple consecutive indices in the Xarray.
We can simplify an abstraction of the page cache implementation of the advanced API as follows, for addition and deletion of an entry into the Xarray for the page cache.
#ifdef CONFIG_XARRAY_MULTI
static noinline void check_xa_multi_store_adv_add(struct xarray *xa,
unsigned long index,
unsigned int order,
void *p)
{
XA_STATE(xas, xa, index);
xas_set_order(&xas, index, order);
do {
xas_lock_irq(&xas);
xas_store(&xas, p);
XA_BUG_ON(xa, xas_error(&xas));
XA_BUG_ON(xa, xa_load(xa, index) != p);
xas_unlock_irq(&xas);
} while (xas_nomem(&xas, GFP_KERNEL));
XA_BUG_ON(xa, xas_error(&xas));
}
static noinline void check_xa_multi_store_adv_delete(struct xarray *xa,
unsigned long index,
unsigned int order)
{
unsigned int nrpages = 1UL << order;
unsigned long base=round_down(index, nrpages);
XA_STATE(xas, xa, base);
xas_set_order(&xas, base, order);
xas_store(&xas, NULL);
xas_init_marks(&xas);
}
To use this API, and demonstrate how an entry occupies multiple indices we provide a helper routine below which adds elements. It takes as input an offset of a file as input and expects the routine then to figure out the magic between the order used.
static unsigned long some_val = 0xdeadbeef;
static unsigned long some_val_2 = 0xdeaddead;
/* mimics the page cache */
static noinline void check_xa_multi_store_adv(struct xarray *xa,
unsigned long pos,
unsigned int order)
{
unsigned int nrpages = 1UL << order;
unsigned long index, base, next_index, next_next_index;
unsigned int i;
index=pos>>PAGE_SHIFT;
base = round_down(index, nrpages);
next_index = round_down(base + nrpages, nrpages);
next_next_index = round_down(next_index + nrpages, nrpages);
check_xa_multi_store_adv_add(xa, base, order, &some_val);
for (i = 0; i < nrpages; i++)
XA_BUG_ON(xa, xa_load(xa, base + i) !=&some_val);
XA_BUG_ON(xa, xa_load(xa,next_index) !=NULL);
/* Use order 0 for the next item */
check_xa_multi_store_adv_add(xa, next_index , 0, &some_val_2);
XA_BUG_ON(xa, xa_load(xa, next_index) !=&some_val_2);
...
}
The above is sufficient for a cursory review of how the page cache uses Xarray to enable both large folios and LBS. The only piece which is relevant to LBS is the use of
base = round_down(index, nrpages);
We need this for LBS because we are requiring the use of large folios of 16 KiB size, so holding 4 pages of order-2 of a 4 KiB page size, and so that means we'd be adding folios at indices at multiples of 4. That is, aligned to 4. There is no strict requirement to align folios at random index offsets to be aligned to a specific set of number of pages, but to support LBS this is critical as we count index at multiples of the page size. Both addition and removal of an entry use of xas_set_order(&xas, base, order);. This is the component which enables both large order folios and LBS.
Support for LBS on the Linux kernel provides us with a solution to the large IU storage stack challenge, enabling large IU SSDs to be used seamlessly with standard Linux block devices and POSIX filesystems. With LBS support, filesystem block size of the filesystem can be increased more than the page size of the system, forcing the data to be written in larger filesystem block size chunks that can align with the IU of the device.
From a software stack perspective, changes are only required to a host kernel, software applications require no changes. Although LBS is Linux specific it also helps other Operating System developers consider changes to ensure seamless support for large IUs. Support for HC SSDs with large IUs was simply an operating system filesystem and memory management problem.
This is part 5 and the last in our series on addressing support for HC SSDs in the software ecosystem.
First off, it didn't really happen that fast! It took about 17 years. So, to scale this it is useful to provide a time table of some important relevant features which can help appreciate the amount of work so far required. This has been a major community collaboration effort.
There's been a lot of advances over the years, but to just name a few, perhaps the most important changes in Linux are the adoption of folios for memory management and iomap as a new filesystem library. As silly as writing community OKRs on a public spreadsheet may seem though, even if not all developers are using them, using it as basis for tracking long term community dialog and structuring our internal OKRs at Samsung certainly helped drive this effort home. Things which you don't see on the public OKRs for example are public education, appreciation over the value of large atomics and its relationship to LBS. Examples of this are the 2024 Open Compute Project talk Dan Helmick and I gave of "Enabling large block sizes to facilitate adoption of large capacity QLC SSDs". Another example of education was the v6.12 Linux large block size support LWN article which Pankaj Raghav contributed. But education comes only after you've built the code, and have a high confidence in it as well. Which is why part of the public OKRs include a huge laundry list of items under the group of "Testing".
Testing filesystems and memory management is difficult though. To help with this and make it easier for others we have advanced the development and test automation framework kdevops and have also helped fixed upstream fstests to support larger block sizes and LBS. Turns out there were many tests which were test bugs which were also failing on larger page size systems using larger block sizes. We've also added new memory pressure tests, like generic/750 which stresses a filesystem while regularly forcing compaction. The patches for LBS on XFS support were tested over 10 different ways in which you can create an XFS filesystem, we refer to these as XFS profiles. Likewise, support to enable large folios on the block cache with buffer-heads recently we also tested 10 different ext4 profiles. The automation work we have put into this has enabled us to support new kdevops kernel-ci efforts. We currently allow the XFS maintainer to push branches to kick off testing over all supported XFS profiles. We have also adopted the variability possible in kdevops by adopting Kconfig to also allow testing variability through GitHub actions. This effectively lets an XFS developer for example use a web browser to pick and choose which profiles and which tests they may want to test. We have also started doing integration of our CI with patchwork in collaboration with Meta's kernel-patches-daemon developers so to enable different subsystems to leverage kdevops to do full kernel testing, we refer to this as kdevops kernel-ci kpd integration. Likewise, the amazing kernel.org admins have helped us integrate a lei patchwork instance which can help enable testing smaller subsystems. This effectively provides a map of kernel code to respective CI pipeline. Our expectation in the future is that with collaboration with the community, patches posted to the mailing list for filesystems and memory management will be tested automatically using either bare metal or cloud services. To this end tests are now automatically published as well, through the new automated kdevops dashboard and are coordinating with the kernel-ci folks for integration into kernel-ci.
As with any large endeavor, the more you divide and conquer, the easier the effort becomes. And OKRs have played a crucial role in helping us not only being our North Star for doing our part, but also giving us the clarity on the necessary difficult discussions to have with the community at LSFMM and on the mailing lists. The more you apply OKRs to sub-components, the easier it becomes to track, discuss, do, and chug on.
Other Linux long term kernel efforts should be able to benefit from drafting and sharing public OKRs, we encourage them to do so. For example, given the complexity some of the memory management bugs we've dealt with it gives me a strong appreciation over the contributions Rust might have in the future on memory management on Linux. And so, the Rust Linux project is an example long term Linux project which might also benefit from public OKRs as well.