Skip to content

Where Latency Dares Not Tread: How High-Powered SSDs Enable Novel Storage Architecture

  • mail
Thumbnail_Image
Thumbnail_Image
Dropbox logo shining on Samsung SSD products
Dropbox logo shining on Samsung SSD products
From public cloud to cloud provider Dropbox is a global collaboration platform that allows users to upload, access, edit, share, and synchronize files. As of July 2021, they boast over 700 million total users and store more than 550 billion pieces of content. One of Dropbox’s most well-known technical milestones occurred when they transformed their infrastructure from primarily public cloud to a hybrid/primarily on-prem model—and they did it quickly. When Dropbox launched in 2007, their storage was split between Amazon Web Services Simple Storage Service (commonly known as AWS S3) and Dropbox’s own servers. But back in 2013, when Dropbox was only a few years old and already wildly successful, company leaders realized that their scale placed Dropbox in the upper echelon of storage providers worldwide. Instead of relying heavily on the public cloud, they determined that hosting petabytes of user data on-premises would allow them to optimize the end user experience, while also maximizing unit economics, reliability, and security. Dropbox decided to build their own proprietary multi-exabyte scale storage infrastructure, which they dubbed Magic Pocket. Much digital ink has been spilled on the topic of their original creation of custom architecture[1], but Dropbox’s continuing success is the result of refining what works and rebuilding what doesn’t, with the end goal of efficiently improving performance, year over year. In 2018, Dropbox decided to re-engineer their storage servers to utilize a new and novel type of storage drive. This transition to a new drive architecture required major innovations at the software and firmware levels, and is the reason that Samsung SSDs were introduced into the Dropbox infrastructure. How to think of Dropbox data First, a quick primer on Dropbox data. Dropbox stores both user files (documents, spreadsheets, and more), and information about those files and users (metadata). The amount of storage required for user files is enormous, not only because of the sheer number of files stored, but because Dropbox also replicates each user file across several physical locations, to ensure file accessibility even in the event of massive power outages in a given region. The metadata stack includes information about the user accounts and files (who uploaded or accessed a file, and when, and how many times, what they changed, etc.) across all Dropbox services and applications. Each Dropbox application has a unique performance profile: different read-write mixes, queue depths, bandwidth, and latency requirements. The metadata stack also supports the filesystem—that’s the mapping of where and how user files are stored.
Metadata Servers Infrastructure Infographic
Metadata Servers Infrastructure Infographic
Samsung SSDs are used in both the storage and database tiers at Dropbox. Novel storage architecture The files that users upload to Dropbox—documents, spreadsheets, photos, videos, executables—are stored on Magic Pocket. For Dropbox, the density of storage racks is of utmost importance. In 2018, Dropbox became the first major company to adopt SMR (Shingled Magnetic Recording) hard drive technology for user file storage. SMR technology currently offers the greatest storage capacity at the lowest cost per unit—exactly the magic combination that Dropbox needed to maximize density. With SMR drives, data must be written sequentially in fixed zone sizes, resulting in slower write speeds[2]. To take advantage of SMR drive density, Dropbox needed to develop a system to address write latency that would make their SMR drives as efficient as the previously used perpendicular magnetic recording (PMR) drives had been.
SMR Data Writing & Caching Infographic
SMR Data Writing & Caching Infographic
They re-engineered their storage software code to create a robust, powerful staging/caching layer[3] for data writes—that’s data being held in temporary memory until it has been fully written to the SMR drive (and replicated across locations). That staging/cache layer needed a high-availability, low-latency SSD to manage variable workloads across a range of queue depths. Enter the Samsung PM1733 NMVe SSD. “We needed a drive that checked all the boxes for supporting our novel architecture,” said Ali Zafar, Sr. Director for Platform, Strategy and Operations at Dropbox. “In our storage cache, throughput, latency and overall write endurance matter. That’s why we chose Samsung’s PM1733.” At Dropbox, not only is their “average” data write larger than the average defined by JEDEC, but it’s also replicated 4x within a cell to ensure data integrity/durability. “When we decommission servers, it involves moving petabytes of data to a new cluster,” explained Zafar. “Given that we quadruplicate every workload, and every workload is larger than the norm, it’s essential that we get maximum write cycles from the drive before it reaches its lifetime endurance expectation.” Metadata stack—no latency allowed The metadata stack at Dropbox supports several large categories of databases, which span many thousands of servers and comprise several petabytes of data. It drives the core infrastructure for metadata storage across Dropbox applications, including filesystem and application metadata. While latency is a concern in Dropbox’s storage tier, its impact in the metadata realm is often more obvious to end users. Latency can pop up nearly anywhere, but due to the sheer variety of operations performed in the database stack, its effects are felt more accurately in the hot zone. “Metadata must be highly available with low-latency; accessing and updating can take no longer than single-digit milliseconds.” said Yang Lim, Director of Engineering at Dropbox. Due to the sheer number of operations being undertaken at any given time, applications like the Dropbox home page are extremely sensitive to metadata latencies. Performance is measured through TTI (time to interactivity) and must not exceed very low single-digit seconds. A small lag in one area of an application can impact the end user experience by many seconds, a sort of butterfly effect of latency.
Web Inband Share (P95) infographic
Web Inband Share (P95) infographic
Small latencies can have big effects To illustrate how critical even minor latencies can be to Dropbox, here’s an example of latency in action: in June 2020, the average time of a folder sharing function rose an unexpected 25%, from 8.08 seconds to 10.73 seconds at P95. This 2.65 second difference was caused by a 10ms latency in a discrete set of operations that access Dropbox’s metadata stack. That’s why, when selecting storage for the metadata stack, Dropbox chose Samsung’s PM983 NVMe PCIe SSD, which offers four times IOPS of SATA SSDs, and consistent performance across a range of operating parameters. To meet the demand for high-utilization, high-duty cycle data centers, the PM983 SSD firmware prioritizes quality of service for sustained random workloads. Optimized for always-on, always-busy workloads, PM983 was developed to help data centers deploy NVMe SSDs cost-effectively, at scale. “Samsung’s PM983 SSD run on very unique and variable block sizes and workloads. Its throughput and latency performance have met all MySQL and Service Owner requirements at Dropbox,” says Lim. Beyond data management and storage, Dropbox also decided to adopt Samsung SSDs in their transition from SATA HDD to SSD, selecting the Samsung U.2 PM983 NVMe for booting the Compute tier. A partnership built on continuous innovation Both Dropbox and Samsung see strength in the ongoing partnership between the two companies. Asked to describe the relationship, Zafar stated, “Dropbox strongly values our ongoing partnership with Samsung, and our continuous collaboration opportunities as we look for new ways to develop our infrastructure to meet ever-evolving customer needs. “The Samsung Enterprise and Datacenter SSD roadmap meets the infrastructure demands that Dropbox requires to deliver high performance in both the hot and cold tiers. That’s why we’re planning on deploying Samsung’s next-gen SSDs in our data centers in the future.” In their shared quest to continually build better user experiences, Dropbox and Samsung both devote countless hours to enable efficient cloud infrastructure breakthroughs. Donovan Hwang, Senior Director of Customer and Market Insights in Memory at Samsung Semiconductor, concurs. “Dropbox is a really exciting company to partner with, in part because of the constantly evolving, innovative nature of their engineering teams. They keep searching for opportunities to increase rack densities and optimize TCO, while maintaining or improving flexibility as they scale.” “Samsung has a similar goal, at a different level—finding new ways to create the fastest, densest, most reliable memory in the industry, so innovators like Dropbox have the power they need to do amazing things.” Case study prepared in collaboration with Siddharth Anand, Senior Commodity Manager at Dropbox.
[1] Dropbox still employs AWS storage and compute power for a variety of projects, including their overseas storage. [2] Typically no greater than 10Mbps, as opposed to 100Mbps for a PMR drive. [3]The staging layer is a bit like the waiting area by an airport gate. Eventually, all the passengers (data) will be on (written to) the plane (SMR disk), but there’s going to be a backed up line until everyone is finally recorded (sitting) in the correct block (seat).