Skip to content

High-Capacity SSDs for AI/ML using Disaggregated Storage Solution: Performance Test Results Show Promise

  • mail
High-Capacity SSDs for AI/ML using Disaggregated Storage Solution: Performance Test Results Show Promise
High-Capacity SSDs for AI/ML using Disaggregated Storage Solution: Performance Test Results Show Promise
Modern organizations produce substantial amounts of unstructured data – a reality that was not anticipated by enterprise storage providers a decade ago↗. At the time, traditional file systems and storage solutions provided an interface for the computer to predictably and consistently find data stored on disks, and all you needed to do was rotate hard drives to meet your storage needs. Back then, SSDs occupied a very niche part of the storage market. However, over time, the HDD and SSD markets began to intersect: both technologies support block interfaces to the operating system, so any software that works on HDDs can work on SSDs, as well. However, differences between the two media have widened. Developers of software for use with HDDs made assumptions about the media’s performance that don’t apply to SSDs. It’s common practice to stack as many HDDs on a server as possible to achieve the desired performance. Over the past decade, the price of SSDs has dropped, and developers continue to find new ways to exploit the performance that SSDs can deliver. While SSDs continue to penetrate deeper into the HDD market, one use case has resisted adoption: large-scale storage. However, use cases for all-flash, large-scale storage in data analytics, and especially AI/ML, are emerging at an enormous rate. These use cases provide organizations with incredible new opportunities to leverage AI/ML for predictive analysis and proactive decision-making. The benefits it brings to a company outweigh the costs and result in a much higher ROI. The total cost of ownership (TCO) of using SDDs is also compelling. Fewer servers are needed, which means both capital expenses (CAPEX) and operational expenses (OPEX) decrease. Moving to higher-capacity SSDs results in significant savings due to simpler management and reduced power consumption. This begs the question: Why have people not already moved to 16 or 32 TB SSDs for large-scale use cases? In general, Samsung’s discussions with customers and partners have revealed that the main reason for not transitioning completely to using SDDs is that large-scale, all-flash storage is still a novel concept. Most continue to use traditional storage solutions, which were originally designed for HDDs and later retrofitted for initial SSD adoption. Unfortunately, they were not designed for large-scale, all-flash use cases. To effectively make use of that capacity, organizations must ensure SSDs provide enough system-level performance. Can they deliver, and if not, what was holding them back? Solving SSD Performance Challenges with Disaggregated Storage Samsung’s Memory Solutions Lab↗ (MSL) specializes in examining system-level issues impacting modern storage solutions, and is currently involved in multiple projects related to disaggregated/composable architecture, computation acceleration, storage for ML and streaming, network based heterogeneous computing and storage virtualization/containerization involving CXL, computational storage, Ethernet SSDs, object storage and large-scale storage. According to MSL’s Senior Director Mayank Saxena, there are several critical issues with POSIX-based file systems, and even parallel file systems (pNFS). “Most of the issues deal with the metadata and the difficulty in scaling to accommodate many petabytes of data,” he said. “While NFS may work well at a small scale – for example, at less than 1 petabyte – they begin to falter as the capacity of a storage system scales up.” The table below illustrates the relative degradation in performance:
In an effort to solve this challenge, MSL has been exploring alternative storage solutions with one customer on their large-scale (100s of petabytes) AI/ML training project, which will require very high sustained bandwidth over high-speed networking with the ability to scale over time. The customer is in need of a solution that not only delivers performance and capacity, but also a small footprint. As it turns out, Samsung’s open-source Disaggregated Storage Solution (DSS)↗, is able to meet these requirements. This project provides a standard Amazon S3-compatible interface for object storage and is designed with exabyte-level scalability in mind, to make the most of commodity hardware and high-capacity SSDs. Why did MSL choose to go with object storage? For starters, many of the newest AI applications handle machine-generated data, which is often stored as objects. Instead of grappling with the complexities of managing a single namespace across multiple storage nodes, MSL tested a different approach: leaving each node with its own file system and managing the data through a single object store. In this way, an external orchestration system such as Kubernetes can coordinate data distribution. In this use case, the problem with achieving the desired performance doesn’t lie with the SSDs; rather, the metadata and file systems slow it down. Using object storage, the user doesn’t have to manage large blocks of data, but smaller data stores which can be more easily distributed and replicated across multiple disks. Managing storage in this way simplifies the process of applying data protection logic to the data itself, rather than to the drives. Validating the Performance of DSS at Scale To validate the performance of this approach, MSL worked closely with the customer to understand the characteristics of their data and how it would be transferred between the storage and the GPU. With this information, they created tools and an environment for generating traffic representative of the desired training system, keeping in mind the goal of eventually scaling to thousands of GPUs. Next, they tested two different storage solutions on identical node configurations: DSS S3 and NFS. The results were as follows:
MSL then ran tests using six servers in two different server configurations, without any erasure coding or RAID:

• DSS v0.6 - CentOS 7.8.2003 (kernel 3.10.0-1127.el7.x86_64) • NFS v4 – Ubuntu 20.04.3 LTS (kernel 5.4.0-100-generic)

It was important to compare the solution not just at the storage layer, but also at the application layer (i.e. AI training). The team leveraged an AI benchmarking tool for storage solutions utilizing Tensorflow and PyTorch – two well-known AI frameworks – to measure storage performance in terms of data load time, aggregated listing time, throughput, latency and other parameters against the customer’s AI training algorithm and dataset. The number of AI training instances per client node was varied to demonstrate performance as parallel workloads increased. The graph below illustrates the results:
During the test, the performance of DSS was significantly higher and remained high, even as the number of AI trainings – as well as number of client nodes running the AI trainings – increased. Next, the team tested the performance of the solution when scaling to a full rack with 10 storage nodes, leaving ample room and power for networking equipment. The graph below illustrates the results:
DSS was able to achieve about 270 GB/s of bandwidth from a full rack of storage nodes, suggesting that even as the system’s total capacity increases, the solution will maintain high performance, without the need to constantly rebalance the data. Finally, the team ran a test to see how DSS would perform during scaling on a per-node basis. As the graph below illustrates, throughput scales linearly with the number of storage nodes, making scaling in a disaggregated fashion easy. Using DSS, the customer can leverage the full potential of large-scale SSDs to mitigate the risk of storage becoming a performance bottleneck.
The Future of DSS While Samsung’s customer is happy with the progress made around leveraging DSS for its data-intensive AI applications, this is not the end of the journey. There are several additional tasks to undertake, to increase the overall value of the system even more:

• Perform additional large-scale testing with different data parameters (dataset size, variety, number of client nodes, etc.). • Test with GPU servers performing actual machine learning operations, as AI/ML training workloads are not like other workloads. • Increase SSD capacity/host in one of two ways: ◦ Increase SSD capacity from 32 to 64 TB. ◦ Find servers that will use 24 or more SSDs rather than only 16 which can be challenging. • Upgrade to new CPU generations to test the ceiling to determine the impact of new speeds provided by Samsung’s next-generation PCIe Gen 5 SSDs and DDR5 memory on the servers.

While there are still some unknowns, the ability of DSS to deliver high-performance throughput for extremely large-scale, data-intensive workloads, regardless of the hardware configuration, is quite promising. Learn More If you are interested in taking a closer look at this architecture, we recommend you visit↗. Stay tuned for new results as additional parameters are tested. For more information about Samsung’s market-leading memory solutions, visit