acm-header
Sign In

Communications of the ACM

Practice

A File System All Its Own


A File System All Its Own, illustration

Credit: Joel Ormsby

back to top 

In the past five years, flash memory has progressed from a promising accelerator,7 whose place in the data center was still uncertain, to an established enterprise component for storing performance-critical data.4,9 With solid-state devices (SSDs), flash arrived in a form optimized for compatibility—just replace a hard drive with an SSD for radically better performance. But the properties of the NAND flash memory used by SSDs differ significantly from those of the magnetic media in the hard drives they often displace.2 While SSDs have become more pervasive in a variety of uses, the industry has only just started to design storage systems that embrace the nuances of flash memory. As it escapes the confines of compatibility, significant improvements are possible in the areas of performance, reliability, and cost.

The native operations of NAND flash memory are quite different from those required of a traditional block device. The flash translation layer (FTL), as the name suggests, translates the block-device commands into operations on flash memory. This translation is by no means trivial; both the granularity and the fundamental operations differ. SSD controllers compete in subspecialties such as garbage collection, write amplification, wear leveling, and error correction.2 The algorithms used by modern SSDs are growing increasingly sophisticated despite the seemingly simple block-read and block-write operations that they must support. A very common use of a block device is to host a file system. File systems, of course, perform their own type of translation: from file creations, opens, reads, and writes within a directory hierarchy to block reads and writes. There is nothing innate about file-system operations that make them well served by the block interface; it is just the dominant standard for persistent storage, and it has existed for decades.

Layering the file-system translation on top of the flash translation is inefficient and impedes performance. Sophisticated applications such as databases have long circumvented the file system—again, layers upon layers—to attain optimal performance. The information lost between abstraction layers impedes performance, longevity, and capacity. A file system may "know" that a file is being copied, but the FTL sees each copied block as discrete and unique. File systems also optimize for the physical realities of a spinning disk, but placing data on the sectors that spin the fastest does not make sense when they do not spin at all. Volume managers, software that presents collections of disks as a block device, led to similar inefficiencies in disk-based storage, obscuring information from the file system.

Modern file systems such as Write Anywhere File Layout (WAFL)5 ZFS and B-tree file system (Btrfs)1 integrated the responsibilities previously assigned to volume managers and reorganized the layers of abstraction. The resulting systems were more efficient and easier to manage. Poorly optimized software mattered when operations were measured in milliseconds; it matters much more on flash devices whose operations are measured in microseconds. To take full advantage of flash, users need software expressly designed for the native operations and capabilities of NAND flash.

Back to Top

The State of SSDS

For many years SSDs were almost exclusively built to seamlessly replace hard drives; they not only supported the same block-device interface, but also had the same form factor (for example, a 2.5- or 3.5-inch hard drive) and communicated using the same protocols (for example, SATA, SAS, or FC). This is a bit like connecting an iPod to a car stereo using a tape adapter; now it seems that 30-pin iPod connectors are more common in new cars than tape decks are. Recently SSDs have started to break away from the old constraints on compatibility: some laptops now use a custom form-factor SSD for compactness, and many vendors produce PCI-attached SSDs for lower latency.

The majority of SSDs still emulate the block interface of hard drives: reading and writing an arbitrary series of sectors (512-byte or 4KB regions). The native operations of NAND flash memory are different enough to create some substantial challenges. Reads and writes happen at the granularity of a page (usually around 8KB) with the significant caveats that writes can occur only to erased pages, and pages are erased exclusively in blocks of 32-64 (256KB-512KB). While a detailed description of how an FTL presents a block interface from flash primitives is beyond the scope of this article, it is easy to get a sense of its complexity. Consider the case of a block in which all pages have been written, and the device receives an operation to logically overwrite the contents of one page. The FTL could copy the block into memory, modify the page, erase the block, and rewrite it in its entirety, but this would be very slow—slower even than a hard drive! In addition, each write or erase operation wears out NAND flash. Chips are rated for a certain number of such operations—anywhere from 500–50,000 cycles today depending on the type and quality, and those numbers are shrinking as the chips themselves shrink. A native approach to block management would quickly wear down the media; and to compound the problem, a frequently overwritten region would wear out before other regions. For these reasons, FTLs use an indirection layer that allows data to be written at arbitrary locations and implements wear leveling, the process of distributing writes uniformly across the media.2

Back to Top

Bridging the Gap

The algorithms that make up an FTL are highly complex but no more than those of a modern file system. Indeed, the FTL and file system have much in common. Both track allocated versus free regions, both implement a logical to physical mapping, and both translate one operation set to another. Newer FTLs even include facilities such as compression and deduplication—still marquee features for modern file systems. FTLs and file systems are usually built in isolation. The idea of a dramatic integration and reorganization of the responsibilities of the FTL and file system represents a classic conundrum: who will write software for nonexistent hardware, and who will build hardware to enable heretofore-unwritten software?

Most SSD vendors are focused on a volume market where requiring a new file system on the host would be an impediment rather than an advantage. SSD vendors could enable the broader file-system developer community by providing different interfaces or opening up their firmware, but again—and without an obvious and compelling file system—there is little incentive. The exception was Indilinx's participation in the OpenSSD10 project, but the primary focus was FTL development and experimentation within conventional bounds. OpenSSD became effectively defunct when OCZ acquired Indilinx. There seems to be no momentum and only vague incentive for vendors to give developers the level of visibility and control they most want. Mainstream efforts to build flash awareness into file systems have led to more modest modifications to the interface between file system and SSD.

The most publicized interface between the file system and SSD is the ATA TRIM command or its counterpart, the SCSI UNMAP command. TRIM and UNMAP convey the same meaning to a device: the given region is no longer in use. One of the challenges with an FTL is efficient space management; and the more space that is available, the easier it is to perform that task. As free space is exhausted, FTLs have less latitude to migrate data, and they need to keep data in an increasingly compact form; with lots of free space FTLs can be far sloppier.

For both performance and redundancy, almost all SSDs "overprovision." They include more flash memory capacity than the advertised capacity of the SSD by anywhere from 10% to 100%. File systems have the notion of allocated and free blocks, but there is not a means—or a reason—to communicate that information to a hard drive. To let SSDs reap the benefits of free storage, modern file systems use the TRIM or UNMAP commands to indicate that logical regions are no longer in use. Some SSDs—particularly those designed for the consumer market—greatly benefit from file systems that support TRIM and UNMAP. Of course, for a file system whose steady state is close to full, TRIM and UNMAP have very little impact because there are not many free blocks.

Back to Top

Incremental Revolution

While many companies participate in incremental improvements, the most likely candidates to create a flash-optimized file system are those that build both SSDs and software that runs on the host. The singular popularized example thus far is DirectFS6 from FusionIO. Here, the flash storage provides more expressive operations for the file system. Rather than solely using the legacy block interface, the DirectFS interacts with a virtualized flash storage layer. That layer manages the flash media much as a traditional FTL but offers greater visibility and an expanded set of operations to the file system above it.

DirectFS achieves significant performance improvements not by supplanting intelligence in the hardware controller, but by reorganizing responsibilities between the file system and flash controller. For example, FusionIO has proposed extensions to the SCSI standard that perform scattered reads and writes atomically.3 These are easily supported by the FTL, but dramatically simplify the logic required in a file system to ensure metadata consistency in the face of a power failure. DirectFS also relies on storage that provides a "sparse address space," which effectively transfers allocation and block mapping responsibilities from the file system to the FTL, a task the FTL already must do. A 2010 article by William Josephson et al. states that "novel layers of abstraction specifically for flash memory can yield substantial benefits in software simplicity and system performance."6

As with TRIM, incrementally adding expressiveness and functionality to the existing storage interfaces allows file systems to take advantage of new facilities on devices that provide them. Storage system designers can choose whether to require devices that provide those interfaces or to implement a work-alike facility that they disable when it is not needed. Device vendors can decide whether supporting a richer interface represents a sufficient competitive advantage. Though this approach may never lead to an optimal state, it may allow the industry to navigate monotonically to a sufficient local maximum.

* The Chicken and the Egg

There are still other ways to construct a storage system around flash. A more radical approach is to go further than DirectFS, assigning additional high-level responsibilities to the file system such as block management, wear leveling, read-disturb awareness, and error correction. This would allow for a complete reorganization of the software abstractions in the storage system, ensuring sufficient information for proper optimization where today's layers must cope with suboptimal information and communication. Again, this approach requires a vendor that can assert broad control over the whole system—from the file system to the interface, controller, and flash media. It is certainly tenable for closed proprietary systems—indeed, several vendors are pursuing this approach—but for it to gain traction as a new open standard would be difficult.

The SSDs that exist today for the volume market are cheap and fast, but they exhibit performance that is inconsistent and reliability that is insufficient. Higher-level software designed with full awareness of those shortcomings could turn that commodity iron into gold. Without redesigning part or all of the I/O interface, those same SSDs could form the basis of a high-performing and highly reliable storage system.

Rather than designing a file system around the properties of NAND flash, this approach would treat the commodity SSDs themselves as the elementary unit of raw storage. NAND flash memory already has complicated intrinsic properties; the emergent properties of an SSD are even more obscure and varied. A common pathology with SSDs, for example, is variable performance when servicing concurrent or interleaved read and write operations. Understanding these pathologies sufficiently and creating higher-level software to accommodate them would represent the flash version of an existential software parable: enterprise quality from commodity components. It is a phenomenon that the storage world has seen before with disks; software such as ZFS from Sun has produced fast, reliable systems from cheap components.

The only easy part of this transmutation is finding the base material. Building such a software system given a single, unchanging SSD would already be complicated; doing it amid the changing diversity of the SSD market further complicates the task. The properties of flash differ between types and fabrication processes, but change happens at the rate of hardware evolution. SSDs change not only to accommodate the underlying media and controller hardware, but also at the speed of software, fixing bugs and improving algorithms. Still, some vendors are pursuing11 this approach because, while it is more complex than designing for purpose-built hardware, it has the potential to produce superlative systems that ride the economic curve of volume SSDs.

* Next for Flash

The lifespan of flash as a relevant technology is a topic of vigorous debate. While flash has ridden its price and density trends to a position of relevance, some experts anticipate fast-approaching limits to the physics of scaling NAND flash memory. Others foresee several decades of flash innovation. Whether it is flash or some other technology, nonvolatile solid-state memory will be a permanent part of the storage hierarchy, having filled the yawning gap between hard-drive and CPU speeds.8

The next evolutionary stage should see file systems designed explicitly for the properties of solid-state media rather than relying on an intermediate layer to translate. The various approaches are each imperfect. Incremental changes to the storage interface may never reach the true acme. Creating a new interface for flash might be untenable in the market. Treating SSDs as the atomic unit of storage may be just another half-measure, and a technically difficult one at that.

Some companies today are betting on the relevance of flash at least in the near term—some working within the confines of today's devices, others building, augmenting, or replacing the existing interfaces. The performance of flash memory has whetted the computer industry's appetite for faster and cheaper persistent storage. The experimentation phase is long over; it is time to build software for flash memory and embrace the specialization needed to realize its full potential.

q stamp of ACM QueueRelated articles
on queue.acm.org

Anatomy of a Solid-state Drive
Michael Cornwell
http://queue.acm.org/detail.cfm?id=2385276

Enterprise SSDs
Mark Moshayedi, Patrick Wilkison
http://queue.acm.org/detail.cfm?id=1413263

Flash Disk Opportunity for Server Applications
Jim Gray, Bob Fitzgerald
http://queue.acm.org/detail.cfm?id=1413261

Back to Top

References

1. Btrfs wiki; https://btrfs.wiki.kernel.org/index.php/Main_Page

2. Cornwell, M. Anatomy of a solid-state drive. ACM Queue 10, 10 (2012); http://queue.acm.org/detail.cfm?id=2385276.

3. Elliott, R. and Batwara, A. Notes to T10 Technical Committee. 11-229r4 SBC-4 SPC-5 Atomic writes and reads; http://www.t10.org/cgi-bin/ac.pl?t=d&f=11229r4.pdf; 12-086r2 SBC-4 SPC-5 Scattered writes, optionally atomic; http://www.t10.org/cgi-bin/ac.pl?t=d&f=12-086r2.pdf; 12-087r2 SBC-4 SPC-5 Gathered reads—Optionally atomic; http://www.t10.org/cgi-bin/ac.pl?t=d&f=12-087r2.pdf

4. Gray, J. and Fitzgerald, B. Flash disk opportunity for server applications. ACM Queue 6, 4 (2008); http://queue.acm.org/detail.cfm?id=1413261

5. Hitz, D., Lau, J. and Malcolm, M. File system design for an NFS file server appliance. WTEC '94 USENIX Winter 1994 Technical Conference; http://dl.acm.org/citation.cfm?id=1267093

6. Josephson, W.K., Bongo, L.A., Li, K. and Flynn, D. DFS: A file system for virtualized flash storage. ACM Transactions on Storage 6, 3 (2010). http://dl.acm.org/citation.cfm?id=1837922

7. Leventhal, A. Flash storage today. ACM Queue 6, 4 (2008); http://queue.acm.org/detail.cfm?id=1413262

8. Leventhal, A. Triple-parity RAID and beyond. ACM Queue 7, 11 (2009); http://queue.acm.org/detail.cfm?id=1670144

9. Moshayedi, M. and Wilkison, P. Enterprise SSDs. ACM Queue 6, 4 (2008); http://queue.acm.org/detail.cfm?id=1413263

10. The OpenSSD Project; http://www.openssd-project.org/wiki/The_OpenSSD_Project

11. PureStorage FlashArray; http://www.purestorage.com/flash-array/purity.html

Back to Top

Author

Adam H. Leventhal is the CTO at Delphix, a database virtualization company. Previously he served as Lead Flash Engineer for Sun and then Oracle where he designed flash integration in the ZFS Storage Appliance, Exadata, and other products.

Back to Top

Figures

UF1Figure. Pricing trends.

Back to top


©2013 ACM  0001-0782/13/05

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2013 ACM, Inc.


Comments


Melissa Cayer

"...In addition, each write or erase operation wears out NAND flash. Chips are rated for a certain number of such operationsanywhere from 50050,000 cycles today depending on the type and quality, and those numbers are shrinking as the chips themselves shrink..."

How would I, the end user, notice the wear out? I make periodic backups of some of my files.

-Melissa Cayer


Adam Leventhal

Hi Melissa,

Depending on the file system and the SSD you might notice the wear in different ways. High quality SSDs will turn on the "predict fail" flag far in advance of actual data loss, recovering at-risk data, and moving it to safe locations. After a failure, SSDs may percolate bad data back to the file system. A modern file system (like ZFS) will identify the bad data and report an error; a typical consumer file system such as Apple's HSFS+ will just use the bad data leading to system crashes or visible data corruption.


Displaying all 2 comments