On File Systems

Update: More Linux File Systems

Tango File Manager

Introduction

Although the file system is one of the most important pieces of an operating system, we generally put little thought into them these days.  Put bits in, pull bits out.  It usually works well enough for desktop systems – until the power fails – but even that is usually pretty painless these days.

On Linux, there are many contenders in the file system arena.  ext2 had been the standard for many years, though around 2001 and onward a few other choices become mainstream.  Without delving into too much history: journaling support was added to ext2 in the form of ext3, ReiserFS is released, SGI ported XFS, and IBM ported JFS in no specific order.  For a few reasons, mostly political, ext3 becomes the de facto file system for Linux.

UNIX - Live Free or Die
UNIX – Live Free or Die

Classic File Systems

In what I will refer to as “classic” file systems, the idea is basically the same.  They essentially bolt journaling to the traditional file system UNIX layout.  Here are the highlights from each of these:

  • XFS is hailed by some for its excellent support of large files and large file systems, and has some nice modern amenities like extents, delayed allocation, and online defragmentation.  [http://en.wikipedia.org/wiki/Xfs]

    XFS is not without a fair share of faults though.  It’s a somewhat half hearted port in my opinion.  Mainly due to its IRIX roots, performance is usually comparable to the other era file systems but CPU usage is relatively high [http://www.debian-administration.org/articles/388].  Data atrocities after power outage or machine crashes seem to be common and even design decision.  This LKML posting really struck a nerve with me: http://lkml.org/lkml/2007/3/28/316.  A lot can go wrong on the hardware side of things, and if a file system is not privy to this it is a recipe for disaster.  ZFS is a hero here with checksumming, which I will touch lightly later.

    It is worth noting that XFS is still under active development and has a decent roadmap forward [http://www.xfs.org]

  • ReiserFS (Reiser3) is one of the first journaling file systems for Linux.  It has some initial growing pains, but is quite a nice file system by kernel 2.6.

    Performance is excellent with small files, but its scalability is questioned by many of the Linux elite [http://en.wikipedia.org/wiki/Reiserfs].  Hans Reiser then goes on to work on Reiser4 which is essentially a rewrite.  SUSE Linux keeps a few developers on it, but it becomes pretty clear that this is a doomed file system [http://www.linux.com/feature/57788].

  • IBM’s JFS is another of the UNIX ports.  JFS traces its lineage back to IBM’s AIX in 1990.  An IBM team ports and improves it for use in OS/2 and later releases the code as free, open source software (FOSS).

    The resulting Linux file system is noted for being scalable, resilient, and in particular easy on the CPU [http://en.wikipedia.org/wiki/JFS_(file_system)].  It also includes extents support.  For whatever reason the kernel community and distros don’t really latch on to it.  JFS is basically just slowly maintained throughout its life cycle.

  • ext3 is a journaling extension of the Linux native ext2.  Out of all the file systems in this generation, it is probably the least technologically advanced lacking features like extents.  What makes up for this is the easy upgrade path for ext2 users, relatively simple code base, and broad upstream adoption. [http://en.wikipedia.org/wiki/Ext3]

    Broad upstream adoption makes ext3 winner for most users and distributions, and it is now the most stable and supported Linux file system.

In hindsight it seems somewhat tragic that JFS or even XFS didn’t gain the traction that ext3 did to pull us through the “classic” era, but ext3 has proven very reliable and has received consistent care and feeding to keep it performing decently.

Disk Layout
Disk Layout

Nextgen File Systems

In 2005, Sun Microsystems released the bombshell ZFS file system.  This ushered in the era of what I will call “nextgen” file systems.  As hard disks have gotten larger, strategies for backup, integrity checking, and support for large files have become much more important.  These file systems also aim to ease management by blurring the traditional VFS line or offering tight integration with LVM and RAID.  Silent corruption by bad hardware is also cause for alarm, and checksumming has been baked into some of these “nextgen” file systems to counter this.

In many ways, Linux was caught completely off guard and most developers weren’t thinking very hard about the future of file systems prior to the ZFS release.  Reiser4 explored some interesting ideas and aimed to be a killer file system (okay I’m really tasteless…) but Hans Reiser enjoyed a particularly bitter relationship with other kernel devs.  Luckily, some even more advanced file systems have come into existence recently.

  • Reiser4 was the first effort for a next generation Linux file system.  Introduced in 2004, it seemed to have some excellent new technology including transactions, delayed allocation, and an interesting plugin architecture for adding features like encryption and compression.  Hans Reiser even advertised using the file system directly as a database with advanced metadata.

    Reiser, the primary developer, often rubbed other kernel developers the wrong way when championing his new file system.  Hans seemed to get defensive when questioned about code style and design decisions – particularly that of the plugin architecture.  I think that a lot of this was due to misunderstanding and bad tempers, but to this day Reiser4 has yet to enter Linus’s kernel tree.  With Hans Reiser’s murder conviction in 2008, the future of Reiser4 was frequently called into question.  At this point it seems unlikely that Reiser4 will ever see upstream adoption, but some of the ideas explored have already been integrated into other “nextgen” file systems. [http://en.wikipedia.org/wiki/Reiser4#Integration_with_Linux].

  • ext4 was started as an effort to make a 64 bit ext3 to support large file systems.  Later, others (Lustre, IBM, Bull – see Theodore T’so comment) got involved and added extents, delayed allocation, online defragmentation, and more. [http://en.wikipedia.org/wiki/Ext4]

    ext4 enjoys forward compatibility with ext3, and limited backward compatibility if extents are not enabled.  Again the clear advantage here is that it improves upon the stable ext3 base, provides an easy migration path, and has many great developers working on it.  However, it needs to be said that ext4 is still somewhat of a “classic” file system and doesn’t have the level of features and scalability that the other “nextgen” file systems do.

  • Btrfs is clearly Linux’s response to ZFS.  Started by Oracle, this project has gained backing from all of the major Linux corporations.  Traffic on LKML suggests that this will be the file system to carry on the torch from ext4.

    Btrfs’ key design feature is copy-on-write which will allow for inexpensive snapshots useful for backup and recovery.  The goal is to completely surpass ZFS however, and many exiting features such as data/journal checksums, tight device-mapper integration, built in RAID, online fsck, SSD optimization options and even in place ext3 upgrades are being worked on [http://btrfs.wiki.kernel.org/].

  • On the tails of Btrfs, another advanced file system called Tux3 has been announced.  The project’s developers have been making use of FUSE to quickly prototype and test ideas.  The initial work on a kernel port has just been posted [http://lkml.org/lkml/2008/11/26/13].

    This project aims to do away with traditional journaling, instead playing back the logs (or recovering) on every mount.  It will also feature inexpensive snapshots and versioning.  The project’s developers seem to be quite good at championing their ideas, but I predict it will be up to 3 years before we see this ready for mainstream use since coding has really just begun [http://tux3.org/].

Conclusions

With ext4 coming out in kernel 2.6.28, we should have a nice holdover until Btrfs or Tux3 begin to stabilize.  The Btrfs developers have been working on a development sprint and it is likely that the code will be merged into Linus’s kernel within the next cycle or two [http://www.heise-online.co.uk/news/Kernel-Log-Ext4-completes-development-phase-as-interim-step-to-btrfs--/111742].

It seems pretty clear that Solid State Disks (SSD) will be here for good.  Theoretically these should blow magnetic storage away in terms of speed.  We are already starting to see competitive write performance, and random access and IOPS are very impressive with the latest Intel SSDs.  It is good to know that Btrfs plans to incorporate SSD optimization from the start, but these new devices may warrant yet another file system to achieve maximum speed.  I personally think that wear leveling and FAT emulation are holding these devices back and would be better implemented by the file system.

P.S.:
I’d been meaning to write this article for a while, but things have been changing rapidly with the introduction of ZFS, Btrfs, and Tux3.  I plan on doing benchmarks soon with kernel 2.6.28 against all the classic file systems, ext4, and Btrfs so subscribe to my RSS feed if you are interested.  Any comments, corrections, or questions would also be appreciated!

P.P.S.:
WordPress seems to mangle the format of this post so the bullet justifications are off.

Be Sociable, Share!
  • email
  • Reddit
  • HackerNews

39 thoughts on “On File Systems”

  1. Actually, ext4 was not a Bull project. The Bull developers are one of the compnies involved with the ext4 development, but certainly by no means were they the primary contributers. A number of the key ext4 advancements, especially the extents work, was pioneered by the Clusterfs folks, who used it in production for their Lustre filesystem (Lustre is a cluster filesystem that used ext3 with enhancements which they supported commercially as an open source product); a number of their enhancements went on to become adopted as part of ext4. I was the e2fsprogs maintainer, and especially in the last year, as the most experienced upstream kernel developer have been responsible for patch quality assurance and pushing the patches upstream. Eric Sandeen from Red Hat did a lot of work making sure everything was put together well for a distribution to use (there are lots of miscellaneous pieces for full filesystem support by a distribution, such as grub support, etc.). Mingming Cao form IBM did a lot of coordination work, and was responsible for putting together some of the OLS ext4 papers. Kawai-san from Hitachi supplied a number of critical patches to make sure we handled disk errors robuestly; some folks from Fujitsu have been working on the online defragmentation support. Aneesh Kumar from IBM wrote the 128->256 inode migration code, as well as doing a lot of the fixups on the delayed allocation code in the kernel. Val Henson from Red Hat has been working on the 64-bit support for e2fsprogs in the kernel. So there were a lot of people, from a lot of different companies, all helping out. And that is one of the huge strengths of ext4; that we have a large developer base, from many different companies. I believe that this wide base of developer is support is one of the reasons why ext3 was more succesful, than say, JFS or XFS, which had a much smaller base of developers, that were primarily from a single employer.

    – Ted

  2. LFS, the log-structured filesystem, was in BSD/OS in 1995 and is definitely what you call next-generation. It’s copy-on-write and includes an lfs_cleanerd process to free up space taken by old copies of changed data.

    It was the inspiration or the outright ancestor of NetApp’s WaFL (hard to tell because BSD license doesn’t require NetApp to share their source). WaFL provides more-or-less everything ZFS does: multiple snapshottable entities inside a single pool (which they call aggr’s), writeable snapshots, the ability to store RAID-like volumes inside the WaFL aggr rather than just files, and export these volumes performantly with iSCSI or FC, and snapshot and clone volumes like filesystems (you could use this with a Linux box and FC host adapter, XFS, xfs_freeze. A lot of people use it with NTFS and Exchange—netapp gives you an xfs_freeze equivalent for NTFS, if you pay them). They also do things ZFS doesn’t do like performant replication that actually works, and deduplication. And they have checksums. ZFS kool-aid drinkers will argue about where the checksums are and brag that sometimes ZFS checksums can protect you against bad memory but-not-always—i think it’s very silly, that checksums are important but that ZFS’s aren’t meaningfully better than the checksums NetApp, Hitachi, and EMC are doing.

    NetApp is very hard to work with, though. They insist their software licenses are “non-transferrable”, so if you buy a used system, there is no good way to get even the most basic support for it, basic as in, like, “an install CD for the software I rightfully own thanks to first-sale doctrine.” You cannot even get manuals because they are all on NOW website protected by a password. AIUI the only reasonable way to get WaFL is to buy supported hardware and consider it leased for your support period, then go through the usual game of contract extensions and upgrade discounts.

    FFS/UFS, a “first generation” filesystem in your terms, supports snapshots, and snapshots are actually implemented in Solaris and FreeBSD but I think they do not survive a reboot. XFS in Linux does also, by cooperating with the RAID layer. They provide a tool called xfs_freeze that makes the filesystem instantaneously consistent on disk, then you take a block-level snapshot with LVM or your storage subsystem, and mount that read-only. Snapshots are not such a big deal, but WaFL/ZFS/BtrFS might make them a bit easier to manage because the size of the dataset you’re snapshotting doesn’t need to be fixed.

    Your comments about ZFS surpassing XFS in resilience to data loss caused by bad hardware or broken iSCSI stacks are dangerous, and dead-wrong, and you would realize this instantly if you had as much experience with ZFS as you need to have with XFS to encounter 1 problem. If you read the zfs-discuss list on opensolaris.org, you’ll find a lot of horror stories from people with this level of experience, and admins of large sites saying “we’ve lost tons of pools with ZFS. It is much more prone to this than vxfs or UFS. We just restore from backup, but the problem that’s messing us up is, with larger pool sizes this takes too long.” This is exactly the point—to efficiently use X4500-size systems with higher capacity/performance ratios than we previously had with storage, restoring from backup is less and less of an option.

    We do not know where all the bugs are so far, but one of the Sun people seems to think ZFS is pretty robust to scribbling randomly onto disks—it has good spatial redundancy—but has an achilles heel when it comes to temporal redundancy. If it loses write access to the storage for a contiguous stretch of time, then gets it back, the pool could become corrupt.

    In general I think ZFS’s claim it’s “always consistent on disk” is bogus. It’s just as always consistent as any of the what you call “first generation” filesystems that have logs, like UFS+logging, ext3, XFS, JFS, HFS+journal, FFS+wapbl. It’s just that ZFS is *entirely dependent* on this log, and comes with no fsck tool. It’s very pedantic about noticing some inconsistency and then just refusing to touch the pool in any way. Many of the recovery operations on the list are just rewriting various superblock equivalents or disabling pedantry checks.

    There are still a lot of fixes going into ZFS, but it is no longer such a young filesystem, and I find the level of flakiness really concerning. I don’t think it’ll be hard for BtrFS to compete with it, which is sad for me because I’ve invested a lot of effort into ZFS. The other thing that’s concerning is the ZFS fixes are taking almost a year to make it from OpenSolaris/SXCE into stable Solaris 10, so there is really no up-to-date Solaris equivalent to the stabilifying/regression-testing in RedHat/CentOS.

    My advice to anyone using ZFS is: you always need filesystem-level backups. meaning, onto tape, or onto another RAIDset. ZFS snapshots and RAID5 and RAID6 and such are *not* a substitute for the age-old requirement of backups which protect you from your own mistakes and from filesystem bugs. You always knew you needed this, but if you want to experiment with ZFS, *now* is the time to start budgeting for it. You can keep the backup on a nother ZFS zpool if you like, or use LVM2+XFS or whatever, but do NOT use ZFS without this backup, because of the corruption bugs. and some other paint-yourself-into-a-corner problems it has.

  3. Hi Ted,

    Thank you for the clarification and further details on ext4. I have updated the article. I had meant to describe your involvement. I do remember Bull having pages on ext4 very early on – I want to say 2003 or even before? Thank you for your work on getting ext4 out the door!

    Kevin

  4. Can you comment on why FreeBSD’s soft-updates approach with UFS2 has not been adopted by other filesystems? We switched from Linux/XFS to FreeBSD/UFS2 and saw very impressive performance gains for an I/O bound environment. Our previous efforts were mostly failures as they didn’t solve the problem (ext3->XFS, hardware upgrades, rewrites), but UFS2 did! I don’t care if I’m using FreeBSD or Linux, but its a shame that we has to switch operating systems to get decent FS performance.

  5. You might also have a look at DragonFlyBSD’s HAMMER file system, which is under active development and working quite well.

  6. Cypherpunks — what I’d really like to hear is “I am using HAMMER for XXX and it’s working well,” like Ben in the softdep comment above. In blogs I get really frustrated by kool-aid drinking, a bunch of parrots advising each other to use things they’ve never tried.

    Ben — on Solaris, ZFS usually beats UFS, but for database loads it still doesn’t. I think what’s going on, for big database files with no metadata updating, UFS is able to make itself very light-weight, almost like the LVM2 “extent” pseudofs, while ZFS has to do COW which means double-writes and fragmentation. What was your workload’s access pattern, and how big of an improvement did you get from softdep that you considered “successful”?

    My impression until your comment was jornaling filesystems usually beat softdep in spite of the double-write of using a journal, because seeks are so expensive compared to streaming bandwidth (a maxim that might fail if you redid the tests on SSD). Page 5 of this paper seems to agree where gjournal beats softdep for most things, but not for streaming write:

    http://www.usenix.org/publications/login/2007-06/pdfs/dawidek.pdf

    so I wonder what was this workload you found where softdep beats a filesystem that journals metadata only—I can’t guess one offhand.

    another thing, I hate to complicate the discussion with what ought to be totally tangent, but it’s been a gigantic factor (like 2 – 3x gigantic) in some tests people have posted to mailing lists, is the write cache in the disk drive. I wonder if you compared UFS+softdep to ext3-over-LVM, if ext3 would suddenly be extremely fast because LVM discards cache flush commands.

    http://lwn.net/Articles/283161/

    If so, this wouldn’t be legitimately superior technology, just politics and sloppyness, but now you have to go rigorously hunt this sloppyness everywhere before you can use your test results to decide which direction of filesystem design is most hopeful.

  7. Admittedly I watched this from afar and the problem was non-critical. We have a slow build system (Java+ANT), very slow build machines (Hudson), and Perforce would lockup due to how project branches were integrated. So the problems was for developers who suffer long build times and the SCM group who monitor the CI builds.

    To solve this they bought expensive workstations with higher-end WD drives. They recently tried to cut back for new machines, buying more CPU power but very cheap disks for a large net negative. Our build system has been partially rewritten so many times that it requires a full rebuild, unless dangeriously customized, do to the mess. The build machines are thrashing their disks as they’ve overloaded them trying to cut back on IT expenses such that we may at best get a CI build report once a day.

    There was only one person who tried to take a methodological approach, who is now gone, and the rest shooting from the hip. They tried changing to XFS and a lot of time was spent trying to tune it. When I got fed up with it, as a developer, I switched to FreeBSD and my builds took 1/3 the time. Under Redhat the CPUs would be poorly utilized due to disk, but under BSD they were maxed to 100%. They were beginning to test the migration of Perforce and the build machines until layoffs redirected the focus. So we still have problems, just that fewer are concerned about them.

    So my experience is very subjective and not scientific. However, every time we tried using UFS2 instead of ext3/XFS the disk stopped being our bottleneck and performance was acceptable.

  8. Re: the comments on ZFS on-disk state being always consist.

    ZFS does not actually need its log (called ZIL), and it actually can be safely disabled. The log is only there to provide assurance for applications like databases that when they call fsync(), the data is actually written to the log with blocking I/O.

    With or without the log, ZFS stores a minimum of two copies of every piece of metadata, on top of and in addition to whatever RAID (1/5/6) you are already using. As it updates its tree of blocks, the superblock becomes the last one to be updated, and because it has checksums, it knows after a crash which copy of the superblock (or any other block) is the valid one.

    So for ZFS to lose your pool you need to have multiple combined hardware failures which will knock out all copies and all redundancy of the metadata.

    Even then, it makes a good effort at letting you get off what you can in read-only mode and returning I/O errors for files which are lost.

    Thats the theory – of course, there were some nasty bugs that cropped up that meant data loss for some unfortunate users. AFAIK, most of these have been fixed and ZFS is now extremely stable on Solaris.

  9. @Miles Nordin — Two big upsides of the ZFS vs. NetApp debate are 1) you can get ZFS ‘in the cloud’, which is impossible with NetApp because it’s a h/w platform and 2) the CLI interface to ZFS is really just quite excellent.

    I think #2 really gets overlooked alot and I consider it a tremendous asset. If some of the issues you mention re: ZFS get resolved, which I have no reason not to believe, then it’s a great filesystem to build solutions on top of.

    At the very least, if it pushes the envelope on what’s acceptable, I’ll be very happy. VxFS and LVM CLI interfaces are… well, completely unacceptable.

  10. Miles/Randy: I do agree that WaFL is an advanced file system and the hardware is quite nice too. However, NetApp as a company is a royal pain in the ass to deal with. I predict that PCs running file systems such as Btrfs and ZFS will sooner or later displace their entry and mid sized product line. Systems like Sun’s thumper would make excellent Filers, and using SCST(Linux) or Comstar(Solaris) you can even plug other systems in using iSCSI, real SCSI/SAS/SATA, and Fibre Channel(!) that will treat it like a normal disk.

    Also, I find Linux LVM quite easy to work with if you have an understanding of what a PV, VG, an LV are. Bash tab completion is a big help at finding the available operations for each of these. But in another hindsight case, EVMS was clearly superior in features and management but cast aside due to NIH syndrome. [http://en.wikipedia.org/wiki/Enterprise_Volume_Management_System]

  11. also re ssds, zfs can intelligently use mixed storage devices, via extension to the ARC ( cache ) to be multilevel. first level in ram, second level ssd, with persistence to disk, which is in addition to log storage (ZIL) on ssd. all of which amounts to a large cost benefit ratio of plugging a few ssds on top an array of hd in terms of IOPS performance. The latest sun storage server line up is based on these mixed storage solutions. the sun sponsored openstorage summit has a nice video on the subject http://blogs.sun.com/storage/en_US/entry/flash_performance_in_storage_systems

    on the linux side, i’m holding out for btrfs.

  12. Use of alternative filesystems such as reiserfs, xfs, and jfs have been discouraged through rumor, ego, and artificial restrictions in boot loaders and partitioning tools. the current activity against ZFS is an classic example of technical alpha male headbutting over licensing issues, performance misrepresentations, and instead of repairing what doesn’t work right, developers are going off and creating something totally new from scratch. ZFS is not perfect but it’s here now, it works, and it looks like it’s a small number of development months away from being ready for production on Linux.

    With filesystems like jfs/xfs around, as far as I can tell, there really is no clear mission for ext4. The rather thin argument of only having a small number developers at one company for jfs/xfs is only true because of politics. There is no organizational or technical reason why developers could not work on those filesystems.

    kev009′s comment about EVMS is right on target. EVMS was superior to LVM. I think it didn’t win mostly because of politics but also in part because it had a very user hostile interface.

    I came to the attitude of “functionality trumps politics” because of a disability. I use Windows because I must use speech recognition. I wish that linux developers could take a similarly pragmatic approach to filesystems and focus on supplying what the end user needs instead of what fits the political landscape.

  13. I work with VMware a lot, I have a bunch of VMware machines. The virtual disk files are enormous. Suspending and resuming virtual machines requires huge amounts of disk I/O. My experience is that XFS is MUCH better than any of the available alternatives in this application. Deleting huge files on ext3 is just painful. I run XFS on a huge Raid 0 partition and performance is quite good. I keep good backups so I don’t worry about data loss. I rotate new drives into the Raid array every year or so, the old ones get recycled into test machines. I use only Seagate drives and I haven’t had a data loss in many years.

  14. annoyed_slightly,

    I’m not convinced that ZFS is the correct answer. Look up through the comments here and you will see just some of its weaknesses. I personally think the binding tie to Sun (not just CDDL, but development wise) will keep it from reaching the same level as Btrfs which is a more open community with many developers and corporations interested.

    It is unfortunate that a BSD licensed file system could not have filled the void from the start so we could have a ubiquitous file system for UNIX. I have some hope for HAMMER, but right now it looks like Linux will be Btrfs and Solaris/FreeBSD/OS X will be ZFS in the next couple of years.

  15. @Fran Taylor: Wait til you try vmware images on top of ZFS! They compress to about half the size which means half the disk I/O needed, as well as doubling your space.

  16. @Fran Taylor: If you are running a RAID 0 array .. how do you cycle the disks?? Break array, replace drive, rebuild array, restore from tape?? Or did you mean RAID 1?

  17. When I “cycle” the drives, I just reformat the array from scratch and restore from my backups. I usually do it at the same time as an OS upgrade.

  18. Wasn’t there a TuxFS that was being worked on in the late ’90s or early 2K, but never really saw the light of day because of a lawsuit?!

  19. annoyed_slightly Said (on November 29th, 2008 at 10:46 pm):

    “Use of alternative filesystems such as reiserfs, xfs, and jfs have been discouraged through rumor, ego, and artificial restrictions”

    It was Hans Reiser himself who told me I shouldn’t use Reiser4 in a commercial webserver, and that I probably shouldn’t use ReiserFS either. This was a few years ago, when he was visiting a users group in Pasadena. Of course that same night, afterwards, he seemed very civil towards his wife.

    Now his filesystems appear to be as dead as his wife, and both are great losses.

    Hans Reiser Said on November 29th, 2008 at 2:50 pm:

    “So I killed my wife. What’s the big fucking deal?”

    The bfd, if I may respectfully submit, is that we’ve lost some good stuff because you’ve killed your wife. You lose, your kids lose, your family loses, her family loses, and we all lose continued development on ReiserFS and on Reiser4, all because you couldn’t accept that your wife fell out of love with you.

  20. Very informative article.

    PPPS
    You should probably learn how to use CSS to style your bullets instead of whining about it.

    Especially:
    list-style-type
    list-style-position
    padding
    margin

  21. Andrew, some of your facts are wrong or arguable. ZFS’s copy-on-write mantra means that the whole filesystem is somewhat log-structured. It has the same log-structured problems as LFS and WaFL with excessive fragmentation when more than ~80% full, or with lots of certain kinds of random-write to the middles of files. And this is the log I meant—it’s entirely dependent on the corruption resilience implied by this ordered pattern of writing-without-overwriting and the quick O(1) recovery step on import, and it DOES include an lfs-like O(1) log roll on import, and it ships with no fsck/recovery tool.

    It does not degrade to a read-only mode on certain kinds of errors—I’m not sure from what other blog you picked up that idea, but think maybe you’re thinking of ext3, or maybe solaris’s UFS.

    but in general, I think you are getting lost in bulletted-feature-list arcania. It doesn’t matter how many copies of some piece of metadata exist when the ZFS code which exists right now declares the whole pool corrupt and refuses to read any of the copies, which is what it often does. It’s also common to get in a situation where the pool seems to be working ok, reporting and recovering from errors, but if you reboot you’l lnever be able to import that pool again. Given the ZFS format’s well-advertised sorts of spatial redundancy it would be possible to write a really aggressive fsck tool or copy-out forensic tool, but THAT TOOL DOES NOT EXIST RIGHT NOW so it is dangerously misleading to write about ZFS’s resilience as if it can do anything the disk format could do assuming perfect software that we don’t have—we don’t approach anywhere near such software fantasies, how many years, two, three, four? after ZFS’s original release in Nevada.

    The simple user interface is also a liability in some situations. If you’ll have a look at the zfs-discuss list you’ll see a variety of problems—different kinds of operation are called “resilver” not just device recovery, the output of ‘zpool status’ is often out-of-date or less informative than you can get by observing zfs’s internal status state machine indirectly, certain kinds of status like mirror dirtyness don’t survive reboot and should, it’s impossible to resilver two devices at once which is a problem for really big pools or for certain kinds of pool gymnastics, someone has a pool with a device that can’t be replaced because he interrupted a device-replacement half way through, many different errors are compressed to the phrase “no valid replicas”, and the ONLINE/DEGRADED/FAULTED/OFFLINE statuses of devices are also compressed from a larger number of true internal statuses, hiding information in the name of simplicity.

    A few serious bugs are still open, and many serious bugs were fixed recently and are not backported to Solaris 10 stable version. There seems to be a lag of almost a full year in backporting Nevada/Opensolaris fixes to plain Solaris, and they’ve stated they will only backport fixes when a Sol10 contract-holder complains about that specific problem, not proactively.

    The original cause of the routine corruption from bouncing iSCSI targets is only specilatively explained and is *NOT* fixed. i mean, in that scenario, it’s (speculatively) a bad iSCSI stack causing the corruption, not necessarily ZFS bugs (maybe bugs in responding to the failure like not resilvering enough, but maybe not), but what’s sure is that other filesystems consistently deal with it with more grace than ZFS, so ZFS needs more robustness here. This is not fixed. And there are probably other things making pools non-importable.

    Bob Paulson I’m using SXCE (solaris express community edition), which is like OpenSolaris but is a larger DVD, is not redistributable, comes in a SPARC version, works with LiveUpgrade instead of IPS, and has source code for a smaller percentage of the binaries delivered tha nOpenSolaris. Your best bet might be Nexenta, because they roll their own stable releases outside Sun, and they’re better than Sun at proactively backporting ZFS bug fixes from Nevada to their stable release, and you get most of the source. but I haven’t tried it myself, and I heard some goofy restrictiveness about the NexentaStor license that I don’t understand. ZFS is also in Mac OS X, FreeBSD, and Linux (FUSE), but I’m not sure it’s a good idea to use it there because of the number of bug fixes going in such that even Solaris 10 ZFS is too old IMHO—you may rightly want the absolute latest version, for bug fixes not features, so SXCE OpenSolaris and Nexenta will all be newer than Mac/BSD/Linux.

    ZFS is not all bad, and I’m not telling people to stay away from it. My message is (1) backups, as in on another filesystem, are MANDATORY with ZFS, and (2) I’m tired of reading overblown hype chatter on blogs about ZFS, people with minimal experience telling each other how they heard it was supposed to behave so marvelously. Yes, ZFS still has hope, but it is currently right NOW much more corruption prone than UFS, ext3, XFS, and it is not getting more stable very quickly.

  22. Fran Taylor: you can’t, but what you can do is export ZFS contents to VMWare ESX over either iSCSI or NFS. I’ve not done this with ESX myself so I’m kind of breaking my own rule by writing about it. but, part of the point is, you can have several ZFS servers and several ESX servers with a mesh of network switches in between, and you can migrate VM’s among the ESX servers at will to balance the load. You can have storage-heavy guests and compute-heavy guests. The idea is to use ZFS in the same way you might use a NetApp/Hitachi/EMC SAN.

    Both iSCSI and NFS have some problems. The Solaris iSCSI target is very buggy, but sounds like it may be workable for some people on the list. One advantage over NFS is that you can export zvol’s directly to the VM guests, so you can use ZFS to snapshot and clone the guest volumes rather than ESX. ZFS might be faster at destroying snapshots than ESX, and it also has some primitive replication ability through ‘zfs send’ and ‘zfs recv’, but all this needs testing before you count on it.

    NFS is substantially less buggy than iSCSI, but it still has problems. I’ve problems where one failed/failing zpool will lock up NFS service for the whole system, not just for the pool that’s failing.

    I think speed is not a clear win on either side. NFS might be about as fast as iSCSI for big files—it’s quite slow when there are lots of tiny files being opened and closed, but for big files, it sounds like sites are always migrating in one direction or another for who knows what reason. So few people use this stuff, you can’t really count on anything until you test it yourself.

    I’m not sure what Andrew is talking about with the “half the size” comment. ZFS can make sparse volumes if you use the iSCSI approach where you get a volume taking no space from the pool filled with zeroes, and it starts to take pool space as the guest writes things other than zeroes into it, but (a) VMWare can do that natively/over-NFS with thin-provisioned volumes, analagous to the usual non-flat .vmdk’s in Workstation, (b) by default zvol’s get a reservation for their whole size so they don’t really take up less space unless you disable this and allow overcommitting, and (c) this doesn’t mean less I/O—the zeroes don’t have to be read or written, whether you allocate space for them or not. Andrew might have been thinking of ZFS’s compression feature, butif this is okay with you, why not enable filesystem-level compression on the guest and save the ESX-to-ZFS bandwidth? There’ve been some reports on the list of choppy non-realtime-friendly behavior with ZFS’s gzip compression, and gradual slowdowns to virtual lockups over a few days. The less-tight lzjb compression has less problem reports. both I think both are not well-tested with ESX so I’d wait for some real positive reports before counting on them to work well in a big vmware setup. What’s usually wanted for big vmware setups is deduplication, which ZFS doesn’t do yet.

    if you want desktop virtualization you can use VirtualBox instead of vmware, under SXCE. just be sure to get 64-bit CPU for the extra kernel address space, and loads of RAM.

  23. Nathan,

    I made brief mention of NILFS on my follow up article (http://www.kev009.com/wp/2008/12/more-linux-file-systems/). The shortlist: it is a log structured FS for better or worse. It offers much of the features of ZFS (which I think is log-influenced) and Btrfs. You probably haven’t heard of it because the TODO list is quite long (nilfs.org) and the devs and NTT (company behind it) haven’t been doing much of the social work required to get code reviewed, tested, improved and ultimately merged upstream. According to the Wikipedia article, log file systems should be best on SSD storage where seeks are less of a problem.

    Indeed, NILFS could be a sleeper and add another contender to the ext4-Btrfs-Tux3 trifecta that is shaping up right now.

  24. How about advfs on Linux. it looks to be good FS. Any idea in what state it is on Linux and what it would provide. it seems HP is open sourcing it under GPL.

  25. All references I’ve seen to Advfs stated that the code would be there to research and analyze, but I haven’t heard of a porting effort (correct me if I’m wrong). Under the GPL, any useful code could be moved into a current FS but probably more important than code would be patent indemnification. All in all its a noble move by HP but I don’t think their goal was to get people to use Advfs, rather just offload the work so others can learn from it.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>