On File Systems

Tango File Manager

Introduction

Although the file system is one of the most important pieces of an operating system, we generally put little thought into them these days. Put bits in, pull bits out. It usually works well enough for desktop systems - until the power fails - but even that is usually pretty painless these days.

On Linux, there are many contenders in the file system arena. ext2 had been the standard for many years, though around 2001 and onward a few other choices become mainstream. Without delving into too much history: journaling support was added to ext2 in the form of ext3, ReiserFS is released, SGI ported XFS, and IBM ported JFS in no specific order. For a few reasons, mostly political, ext3 becomes the de facto file system for Linux.

UNIX - Live Free or Die

Classic File Systems

In what I will refer to as "classic" file systems, the idea is basically the same. They essentially bolt journaling to the traditional file system UNIX layout. Here are the highlights from each of these:

XFS is hailed by some for its excellent support of large files and large file systems, and has some nice modern amenities like extents, delayed allocation, and online defragmentation. [http://en.wikipedia.org/wiki/Xfs]

XFS is not without a fair share of faults though. It's a somewhat half hearted port in my opinion. Mainly due to its IRIX roots, performance is usually comparable to the other era file systems but CPU usage is relatively high [http://www.debian-administration.org/articles/388]. Data atrocities after power outage or machine crashes seem to be common and even design decision. This LKML posting really struck a nerve with me: http://lkml.org/lkml/2007/3/28/316. A lot can go wrong on the hardware side of things, and if a file system is not privy to this it is a recipe for disaster. ZFS is a hero here with checksumming, which I will touch lightly later.

It is worth noting that XFS is still under active development and has a decent roadmap forward [http://www.xfs.org]
ReiserFS (Reiser3) is one of the first journaling file systems for Linux. It has some initial growing pains, but is quite a nice file system by kernel 2.6.

Performance is excellent with small files, but its scalability is questioned by many of the Linux elite [http://en.wikipedia.org/wiki/Reiserfs]. Hans Reiser then goes on to work on Reiser4 which is essentially a rewrite. SUSE Linux keeps a few developers on it, but it becomes pretty clear that this is a doomed file system [http://www.linux.com/feature/57788].
IBM's JFS is another of the UNIX ports. JFS traces its lineage back to IBM's AIX in 1990. An IBM team ports and improves it for use in OS/2 and later releases the code as free, open source software (FOSS).

The resulting Linux file system is noted for being scalable, resilient, and in particular easy on the CPU [http://en.wikipedia.org/wiki/JFS_(file_system)]. It also includes extents support. For whatever reason the kernel community and distros don't really latch on to it. JFS is basically just slowly maintained throughout its life cycle.
ext3 is a journaling extension of the Linux native ext2. Out of all the file systems in this generation, it is probably the least technologically advanced lacking features like extents. What makes up for this is the easy upgrade path for ext2 users, relatively simple code base, and broad upstream adoption. [http://en.wikipedia.org/wiki/Ext3]

Broad upstream adoption makes ext3 winner for most users and distributions, and it is now the most stable and supported Linux file system.

In hindsight it seems somewhat tragic that JFS or even XFS didn't gain the traction that ext3 did to pull us through the "classic" era, but ext3 has proven very reliable and has received consistent care and feeding to keep it performing decently.

Disk Layout

Nextgen File Systems

In 2005, Sun Microsystems released the bombshell ZFS file system. This ushered in the era of what I will call "nextgen" file systems. As hard disks have gotten larger, strategies for backup, integrity checking, and support for large files have become much more important. These file systems also aim to ease management by blurring the traditional VFS line or offering tight integration with LVM and RAID. Silent corruption by bad hardware is also cause for alarm, and checksumming has been baked into some of these "nextgen" file systems to counter this.

In many ways, Linux was caught completely off guard and most developers weren't thinking very hard about the future of file systems prior to the ZFS release. Reiser4 explored some interesting ideas and aimed to be a killer file system (okay I'm really tasteless...) but Hans Reiser enjoyed a particularly bitter relationship with other kernel devs. Luckily, some even more advanced file systems have come into existence recently.

Reiser4 was the first effort for a next generation Linux file system. Introduced in 2004, it seemed to have some excellent new technology including transactions, delayed allocation, and an interesting plugin architecture for adding features like encryption and compression. Hans Reiser even advertised using the file system directly as a database with advanced metadata.

Reiser, the primary developer, often rubbed other kernel developers the wrong way when championing his new file system. Hans seemed to get defensive when questioned about code style and design decisions - particularly that of the plugin architecture. I think that a lot of this was due to misunderstanding and bad tempers, but to this day Reiser4 has yet to enter Linus' kernel tree. With Hans Reiser's murder conviction in 2008, the future of Reiser4 was frequently called into question. At this point it seems unlikely that Reiser4 will ever see upstream adoption, but some of the ideas explored have already been integrated into other "nextgen" file systems. [http://en.wikipedia.org/wiki/Reiser4#Integration_with_Linux].
ext4 was started as an effort to make a 64 bit ext3 to support large file systems. Later, others (Lustre, IBM, Bull - see [Theodore T'so comment below, #25 got involved and added extents, delayed allocation, online defragmentation, and more. [http://en.wikipedia.org/wiki/Ext4]

ext4 enjoys forward compatibility with ext3, and limited backward compatibility if extents are not enabled. Again the clear advantage here is that it improves upon the stable ext3 base, provides an easy migration path, and has many great developers working on it. However, it needs to be said that ext4 is still somewhat of a "classic" file system and doesn't have the level of features and scalability that the other "nextgen" file systems do.
Btrfs is clearly Linux's response to ZFS. Started by Oracle, this project has gained backing from all of the major Linux corporations. Traffic on LKML suggests that this will be the file system to carry on the torch from ext4.

Btrfs' key design feature is copy-on-write which will allow for inexpensive snapshots useful for backup and recovery. The goal is to completely surpass ZFS however, and many exiting features such as data/journal checksums, tight device-mapper integration, built in RAID, online fsck, SSD optimization options and even in place ext3 upgrades are being worked on [http://btrfs.wiki.kernel.org/].
On the tails of Btrfs, another advanced file system called Tux3 has been announced. The project's developers have been making use of FUSE to quickly prototype and test ideas. The initial work on a kernel port has just been posted [http://lkml.org/lkml/2008/11/26/13].

This project aims to do away with traditional journaling, instead playing back the logs (or recovering) on every mount. It will also feature inexpensive snapshots and versioning. The project's developers seem to be quite good at championing their ideas, but I predict it will be up to 3 years before we see this ready for mainstream use since coding has really just begun [http://tux3.org/].

Conclusions

With ext4 coming out in kernel 2.6.28, we should have a nice holdover until Btrfs or Tux3 begin to stabilize. The Btrfs developers have been working on a development sprint and it is likely that the code will be merged into Linus's kernel within the next cycle or two [http://www.heise-online.co.uk/news/Kernel-Log-Ext4-completes-development-phase-as-interim-step-to-btrfs--/111742].

It seems pretty clear that Solid State Disks (SSD) will be here for good. Theoretically these should blow magnetic storage away in terms of speed. We are already starting to see competitive write performance, and random access and IOPS are very impressive with the latest Intel SSDs. It is good to know that Btrfs plans to incorporate SSD optimization from the start, but these new devices may warrant yet another file system to achieve maximum speed. I personally think that wear leveling and FAT emulation are holding these devices back and would be better implemented by the file system.

P.S.:
I'd been meaning to write this article for a while, but things have been changing rapidly with the introduction of ZFS, Btrfs, and Tux3. I plan on doing benchmarks soon with kernel 2.6.28 against all the classic file systems, ext4, and Btrfs so subscribe to my RSS feed if you are interested. Any comments, corrections, or questions would also be appreciated!

P.P.S.:
Wordpress seems to mangle the format of this post so the bullet justifications are off.

Introduction

Classic File Systems

Nextgen File Systems

Conclusions

Related Posts:

Comments