FS-Cache merged in Kernel 2.6.30

FS-Cache has been merged into the upcoming kernel 2.6.30.  This allows for a generic caching interface in the kernel for other file systems.  For example, you can use local hard disks to cache data accessed via NFS, AFS, or CD-Rom.  Since these tend to be high-latency while the disks are low latency, it should provide for a nice speedup.

Of particular interest to me, I contacted maintainer David Howells who is a Redhat employee.  I asked whether this infrastructure would help with large disk image files stored on NFS — a common though not particularly efficient case for VMWare, Xen, KVM, etc.  His exact response was “Quite feasible.  As long as you have a local disk on which to cache the files.”

I am quite happy as I run this setup at work for some production VMs since it allows for easy migration and backup without the complexity and cost of a SAN or cluster FS.  I look forward to testing when 2.6.30 hits the stable tree.

Welcome back, Palm! The New Palm Pre

I’ve always been a fan of Palm hardware, and even had a soft spot for PalmOS until it was left to rot for so long.

Indeed, my Palm 800w is a fine piece of hardware even though it was quickly eclipsed by the Treo Pro, but Windows Mobile has been really painful. It is slow, bloated, and crashes frequently. It has zero intuition and single-handed operation is out of the question. Worst of all, there seems to be a lack of useful or quality apps that were abundant in PalmOS.   I am a bit mad that I paid $500 for such a lemon.

Enter the Pre, which looks like just the thing to get me back on board with Palm and get them back in the game.

The hardware design looks beautiful, giving a large screen but keeping the tactile and speed of a keyboard (this is the iPhone’s biggest weak spot IMHO).  It looks like typical form factor for this style of phone, similar to the new Blackberry, so nothing earth shattering there, but the OS looks top notch and is based on Linux.  I think if Palm is able to deliver, they will once again be competitive, and they will keep me as a customer.

Take a look at these links for a nice report:
http://arstechnica.com/news.ars/post/20090108-palm-launches-new-handset-pre-operating-system-at-ces.html
http://www.palm.com/us/products/phones/pre/index.html

I dream of pervasive virtualization…

I dream of a day where virtualization is pervasive.

Instead of thinking about services in terms of servers, CPUs or directly mapped resources, I should be able to to add virtual machines in terms of guaranteed throughput rate over a whole grid.  Scaling out should be as easy as adding a blade or racking another server.

At the low level, I should have the option of running N+N redundancy.  That is, the VM should run in lockstep across multiple machines – so if it is running on 2 vcpus, 4 in total would be used.  This would allow for any node to fail.  And the VM should be an aggregate of the low level hardware – e.g. a VM grid across 4 8-core servers should scale near-linearly when a single OS instance is running 32 processes.

Current solutions only attempt to do some of the tasks above, and most fail miserably.  IBM mainframes have been doing it for ages.

If I had the time, I know I could build software to do this better than anyone else.  All the puzzle pieces are there, especially the tough ones like hypervisors and Infiniband.  This could have been done at least 3 years ago.  I bet it will take the industry 3-4 years yet to get anywhere close.

This is a real virtual datacenter.

Xen 3.3 in RHEL/CentOS 5 and more Link Aggregation Fun

RHEL 5 includes the now ancient Xen 3.0 hypervisior.  A lot has been improved since then, especially in the current 3.3 release.  Additionally, RedHat now owns the company behind KVM, so it is unlikely they will spend much time backporting Xen stuff for RHEL 5.3 or the likes.

Why Xen?

Xen is a proven hypervisor.  It works well on lots of hardware, including servers without hardware virtualization and older 64-bit Opterons that wont run 64-bit guests in the likes of VMWare.  Since the OS is usually paravirtualized, performance is top notch.  By making an OS aware of the environment it is running in, you can optimize it for virtualization.  KVM is playing catchup here, realizing that paravirtualization is still ideal for many things.

How..

Okay, so we are using or want to use Xen. Others have already built the packages we need, thankfully!

Head over to http://www.gitco.de/repo/ and grab the repo for your arch.  (Most likely wget http://www.gitco.de/repo/CentOS5-GITCO_x86_64.repo in /etc/yum.repos.d/ for the uninitiated).

If you already have Xen installed, you may need to remove and readd it.

yum groupremove Virtualization
yum groupinstall Virtualization

You’ll also get some updated tools like Virtual Machine Monitor 0.6.0 that make it easier to install newer guests such as Fedora 10 or Ubuntu.  Sweet!

Double check /etc/sysconfig/kernel.  It should be set to kernel-xen.  Likewise, check /boot/grub.conf and make sure that the Xen kernel is the default if the aforementioned was not done beforehand.

Reboot!

Xen 3.3 and Link Bonding

See my previous post for general information, but it gets harder.

This one is a nightmare.  In my previous post, I detailed how to get Xen to work with link aggregation with Xen 3.0.  Well, it doesn’t work in 3.3.  Xen decides that it still owns eth0 and completely destroys your bond0 setup.

Like these people, I’ve come to the conclusion that the integrated network scripts suck.  This is alarming since you’d think link bonded setups would be the norm for Xen setups.

The quick fix is to let the OS handle networking.  We do that like so: add a br0 interface and tell the bond to bridge with it.

File /etc/sysconfig/network-scripts/ifcfg-br0

DEVICE=br0
ONBOOT=yes
BOOTPROTO=none
IPADDR=10.0.6.201
NETMASK=255.255.255.0
GATEWAY=10.0.6.1
NO_ALIASROUTING=yes
TYPE=Bridge

Then, edit your /etc/sysconfig/network-scripts/ifcfg-bond0 and add “BRIDGE=br0″ and comment out any IP related information (since you are now defining that in the bridge.  Head over to /etc/sysctl.conf and add:

net.ipv4.ip_forward = 1

Now, edit your Xen VMs in /etc/xen/ or /etc/xen/auto and change xenbr0 to br0:

vif = [ ‘mac=ee:cc:aa:88:66:44, bridge=br0′, ]

Okay, now disable the Xen networking garbage.  Open /etc/xen/xend-config.sxp and comment out anything  that looks like (network-script ….).

Almost done, but wait!  RHEL 5.2 has a bug that prevents the bridge coming up on a bonded interface.  Hopefully this will make the 5.3 cut or be pushed to 5.2, but until then go here.  Download the new patch into /etc/sysconfig/network-scripts/ and run patch -p0 < ifup-eth.patch for instance.

Finish

Reboot.  You now have Xen 3.3 goodness on a big Ethernet channel!  Post a comment if you have any trouble or questions.

Link Bonding Craziness in RHEL/Centos 5

I just went through hell in a handbasket trying to get 802.3ad Link Aggregation set up on a Centos 5.2 Xen box.  Setting up link aggregation itself isn’t that bad – http://wiki.centos.org/TipsAndTricks/BondingInterfaces for a simple guide (after your managed switch is configd) – but what ever I did, I was unable to get both interfaces simultaneously active.

About the only useful debugging info I got was that the MAC was in use.  I was puzzled because as far as I know, link agg takes over the primary MAC and sets that up for both NICs.  Furthermore, the same exact hardware was working great on Fedora 10.

bonding: bond0: Warning: the permanent HWaddr of eth0
 - [MAC ADDR]- is still in use by bond0. Set the HWaddr of eth0
to a different address to avoid conflicts.
bonding: bond0: releasing active interface eth0
bonding: bond0: making interface eth1 the new active one.
bonding: bond0: Removing slave eth1
bonding: bond0: releasing active interface eth1
ADDRCONF(NETDEV_UP): bond0: link is not ready
bonding: bond0: Adding slave eth0.

Luckily, I stumbled across this bug report.  If you scroll down to the last comment, this appears to be a Xen specific issue.  By default Xen tries to set its bridge up on eth0, and I assume this prevents the kernel bonding driver from taking over the NIC.  By opening up /etc/xen/xend-config.sxp and adding:

(network-script 'network-bridge netdev=bond0')

Xen will bridge to the bond0 interface, and everything will work as expected.

Another trick I had to do was add a start delay to the networking scripts.  This is useful if your hardware is crap (cough Broadcom), you need a dhcp lease and it fails, or you are running STP, link aggr., etc.  On Fedora, RHEL, and derivitives this is accomplished by adding the NETWORKDELAY directive to /etc/sysconfig/network:

NETWORKING=yes
NETWORKDELAY=31

If you need more granularity, you can set delays to specific adapters in the /etc/sysconfig/ifcfg-{x} files with the LINKDELAY directive.

Just a couple of hard lesssons from the trenches, hopefully this will save someone else some time.

USB 3.0 on Linux

Take a look at this Intel developer’s blog:

http://sarah.thesharps.us/2008-12-07-13-35.cherry

The video shows a USB flash drive transfering video at 125MB/s!  To give you an idea of this speed, it is likely more than your hard disk can put out which is probably around 80MB/s.  The rest of the article gives a good overview of USB 3.0 and she states that the bus should have about 400MB/s bandwidth in the real world.  This is a breath of fresh air for external devices of all sorts, and I can’t wait to see this on new computers.

The Linux drivers are currently under development, with the subsystem patches going into review.  The xHCI driver will have to wait until Intel finalizes and releases the specs.  Anyways, it’s safe to say that Linux should have USB 3.0 support as soon as products hit the market in the middle of 2009.

Retrocomputing for Fun and Profit

  1. Buy Old Computers
  2. ???
  3. Profit

What is retrocomputing?

I define retrocomputing [wikipedia] as the collecting and use of old computers.  Why might one do this?  Well, for one, enterprises cycle out machines fairly frequently.  2,3,4 and 5 year old systems are often sent out to scrappers in droves despite still being plenty useful.  Top of the line systems for large companies often have more than enough power for small and medium sized ones at pennies on the dollar compared to new hardware.  These machines are likely complete overkill for home use, but none the less are very useful for fun and learning.

IBM mainframe ops in the 1980s

Why?!

A lot of what I know about computers has been learned on old machines.  Hooking up a couple of servers and desktops and trying to make something useful out of them is a great exercise for the aspiring system administrator.  With open source software, it can all be done freely and easily.

Yes, you can run Linux, BSD, and Solaris from the comfort of your Windows desktop in a virtual machine (weak sauce…).  Yet there is something much different when you cluster several high technology servers together, tethered to a Fibre Channel storage array and have them share a single distributed file system.  The knowledge of setup, installation, and troubleshooting I’ve gained from mock scenarios like this I cannot compare to anyone else I’ve ever met.  Breaking things here usually means digging deep and fixing it.  If you were to screw something up at work like some of the things I’ve gotten into, it would probably cost you your job.

BENCHNET - where I rip into computers that cost as much as a house and my "production" rack

Retrocomputing is also fun.  I am personally into old IBM hardware, though old UNIX workstations of all sorts are interesting to me.  You can see my collection of IBM PS/2 and RS/6000 knowledge here: http://ps-2.kev009.com:8081/.  There is a particular thrill to booting up a machine that cost between $20,000 and $50,000 10 years ago.  Knowing that these same machine models were used to design the Boeing 777, composed the famous Deep Blue machine, and were used in the largest automotive and shipbuilding firms not to mention some of the most important space craft to date also brings a sense of power and nostalgia.  In some ways its similar to having a classic car, but different.  Maybe if that classic car was a big ass bulldozer, tank, jet or some other well engineered piece of equipment :-P.

Some old systems I had at one time or another.  Left to right: IBM PS/2e (first "green" environmental pc), RS/6000 43p (7043-140), Apple PowerMac 7100/80, RS/6000 7006-42W, RS/6000 7012-397, HP Visualize c360 (PARISC)

IBM PS/2e, RS/6000 43p, PowerMac 7100, RS/6000 x2, HP Visualize c360

Nostalgia is one of the biggest things I get out of using particularly old hardware.  I missed the mainframe days, the minicomputer days, the PC and DOS days, the Apple II days (well, actually I used these a bit at a very young age), and to a degree the early Windows days.  Just like a history class, studying these old machines gives me insight as to why things are done the way they are today.  It gives me appreciation for modern systems and makes me write clean and well optimized code.  The old computer games that captivated me as a child (Sim City, Sim Tower, Sim Ant, Sim Farm, Gizmos and Gadgets, The Incredible Machine, Oregon Tail etc.) implanted a high degree of logic and understanding at a young age and it is heartwarming to revisit these.  I grew up a Mac user as well, so seeing what I was(or: was not :>) missing on PCs is also interesting.

Old MIPS UNIX server booting and logging in

Some of the benefits of retrocomputing:

  • Enterprise class hardware
  • Cheap, possibly even free
  • Different design philosophies – not everything is x86 – a lot of this gear is quite different.  For example, UNIX workstations integrated most of what we enjoy on our PCs years before it became available to consumers.  SGI machines were doing A/V and 3D in the early 90s.  IBM midrange AS/400s have an advanced integrated database, programming languages, and environment that make PCs look like a joke for business programming.  WinFS, Object Storage Devices, etc are just now being talked about for PCs.  The channel philosophy from mainframes is still pretty new to PC servers (fibre channel), not to mention virtualization.
  • If you break it, you can fix it and learn from it or toss it
  • The engineering and craftsmanship in some of these systems is downright astonishing
  • Old computers are works of art: they give you a window into the technology and culture of times past
  • You should never trust a computer you can’t lift

It is interesting that we as humans produce such elaborate machines, only to discard them as scarp a few years later.  It is humbling and shows you the incredible progress we are making.

How?

eBay is your friend, but also look for local scrapyards or businesses doing overhauls.

If you are faint of heart, plenty of good abandonware sites exist for games and operating systems that can be run on emulators or VMs.  Check out this IBM mainframe emulator, Hercules.  Some of the original IBM OSes are public domain.

If you don’t want old PCs and big iron overtaking your house, there is plenty of good material on YouTube as well.  The Computer Museum is a good start.  Some of the consoles, offices, and outfits are hilarious.

Old SGI tech demo – pretty impressive!

Phoronix Benchmarking.. Statistically Significant? and Other Performance Concerns

Phoronix has been cranking out a slew of benchmarks recently, pitting various different Linux distros against each other and even different operating systems with their own automated test suite.

  1. Ubuntu 7.04 to 8.10 Benchmarks
  2. Mac OS 10.5 vs. Ubuntu 8.10 Benchmarks
  3. Ubuntu vs. OpenSolaris vs. FreeBSD benchmarks
  4. Fedora 10 vs Ubuntu 8.10 Benchmarks
  5. “Real World” Benchmarks of the [sic x2...]EXT4 File-System
  6. OpenSolaris 2008.05 vs. 2008.11 Benchmarks

What I would like to know is… are they bullshit?  I’m no statistician, yet the proximity of the numbers and lack of error bars raise my own bullshit detection meter.  See this URL for some background on statistical significance and error bars: http://www.graphpad.com/articles/errorbars.htm.

I spot plenty of outlandish things, such as FPS benchmarks for file system tests and JDK version changes through product life cycles, also to mention the somewhat unfairness of crappy binary graphics drivers across versions.  JDK 1.6.10 is SUPPOSED to be faster; results from benchmarking against this are insignificant unless it is run across all versions.  Yes.. GCC, glibc, and the kernel change between releases as well but these are not typically components that a user can swap out as easily as a JVM which should probably be bumped for security reasons anyways.

Furthermore, I realize all benchmarking should be taken with a grain of salt – one particular set of hardware and software will never map correctly to another set of hardware or software, but it should be possible to set up tests to gain some useful intelligence.

Can this kind of macro/micro benchmarking (depending on how you look at it) help weed out regressions?  GCC 4.0 was noticeably slower on x86 than 3.x (See: http://www.coyotegulch.com/reviews/gcc4/index.html, http://people.redhat.com/bkoz/benchmarks/).  At the same time I think PowerPC saw significant improvement due to auto vectorization and use of Altivec/VMX.   But it also seems to be improving over time.  I’ve heard 4.4 is supposed to be much better with a new register allocator (IRA).  This probably the most important component of modern open source operating systems, so some of the blame might be placed here if the numbers have meaning.

All of this makes LLVM look more and more appealing.  LLVM is able to do not only compile time, but also link and run time optimization.  This is very appealing for commercial software where you are given a binary blob by the manufacturer that will likely that will not change through its lifetime.  It also reminds me of Java and speedup through JVM upgrades, except this should work on any language.

“LLVM is… designed to enable effective program optimization across the entire lifetime of a program. LLVM supports effective optimization at compile time, link-time (particularly interprocedural), run-time and offline (i.e., after software is installed), while remaining transparent to developers”

One thing the Phoronix numbers do show is that things seemed to go down hill coincident with CFS (Completely Fair Scheduler), dyntick, and SLUB merging as well.

Evgeniy Polyakov of POHMELFS fame raised the alarm with some fairly significant networking regressions – how financial crisis affects tbench performance – that seem to support a general slowdown between 2.6.22 and 2.6.27.  This resulted in noise on LKML and hopefully we will see improvements soon.

I guess what I am getting at is that compute power is so cheap that it seems stupid to not have automated tests against such things these days.  Diego Petteno of Gentoo fame has been doing such things recently with Gentoo’s excellent build system.  I have set up Hudson, a Java Continuous Integration system, before to track commit regressions and such a system seems ideal for all modern software testing.

Anyways, I am interested in hearing your thoughts on benchmarking, software testing, and automation and how it can be used to improve modern software.

More Linux File Systems

It seems I caught the wave of interest in Linux file systems.  Here are some articles worth checking out:

Of course, there are some other file systems that I haven’t talked about that came up in the comments of my last post.  Most of these are special purpose, still on the fringes, or different in scope than the first article which was about local storage.

  • The native flash file systems: UBIFS and LogFS.  In theory file systems like these would be ideal for SSDs, but we need manufactures to stop putting FAT/Hard disk emulation and wear leveling into their drives.  Windows is the culprit for this.  These file systems also seem keyed toward embedded device flash memories at the moment and not general purpose storage.  Neither are upstream yet.  This could be positive, as the disk format and code can be readily changed to make them compatible with future SSDs if there is a change in manufacturing.  Val Henson, a Linux file system authority, has some interesting thoughts.
  • The Log file systems: LogFS (flash centric, see above), NILFS.  I believe ZFS and Tux3 share design philosophy from these.  The idea has been around for a long time but none have ever really succeeded.
  • The shared disk and distributed parallel file systems: OCFS2, Lustre, GFS, PVFS.  There
    are a laundry list of these.  Some implement entire disk file systems, while others add clustering or distributing properties to other file systems.  SUSE [link] and I think Red Hat even considered using one as the default file system but booting (Grub) is an issue.
  • Network file systems: these add networking to other disk file systems. POHMELFS and CRFS (distributed extension for Btrfs) are interesting new ones here.  Of course there are a laundry list network file systems for Linux.  NFS, AFS, and CIFS are the old timers.
  • The others: These are more experimental or research oriented.  chunkfs, spadfs, and many more.  Many others just don’t have steam behind them yet or are dead in the water.

My overview is clearly Linux oriented, though I mentioned ZFS in passing because I think it spurred a lot of these recent developments.  That isn’t to say the BSDs are sitting still with HAMMER, and FreeBSD is keeping UFS2 moving forward while NetBSD has LFS – a log file system.

P.S.:

I plan on performing some benchmarking soon after 2.6.28 goes stable.  The list will include ext2/3/4, JFS, XFS, Reiser3 and Btrfs.  The setup will include single and multi-disk configs.  If you have any requests or suggestions for setup, please contact me.

My thoughts on software and complexity

My thoughts on the growth of the Linux kernel and the status quo of using and developing software..

Prompted by discussion of this article: 1986 Mac Plus Vs. 2007 AMD DualCore. You Won’t Believe Who Wins

[Ed: My response to accusations of Linux Kernel bloat]

The [Linux] kernel never has really been the problem. In 1 to 2 MB of compressed/compiled code on my computer (gentoo-sources + my custom .config and a couple of patch sets from future merges), there is some of the most advanced file system, networking, protocol, hardware and scheduling code ever conceived. Indeed, there are many areas that need work and are constantly being updated, but find me a kernel that supports NUMA, scales quite linearly with SMP, implements fair queuing of IO and CPU scheduling, has NO tick interval, virtualization and supports a wide gamut of platforms and hardware. It runs on systems as small as a microcontroller and as large as BlueGene/L. Did I mention it is free and I can learn from and hack on it?

The kernel isn’t really expanding at a rate to be concerned with, because only a small subset ends up being needed for most users and systems. No, the problem really lies in user space on UNIX systems. Modern UNIX userland involves many many layers of programs interacting and building on top of each other. I really don’t see it getting better in the future either. As higher and higher levels of programing languages are being used, more and more layers are added to the onion. This can make a programmer’s life easier and allows more complex systems to be designed, but there are many drawbacks as well. Bug creep, feature creep, usability, complexity, and resource usage all come to mind.

Do I know the answer? Not at all. I don’t think there is one. Software will develop organically in the wake of hardware progress for the foreseeable future. If and when this progress slows, perhaps things will change course. A sea change of compiler optimization, small is beuatiful engineering, and an emphasis on efficiency..