Phoronix Benchmarking.. Statistically Significant? And Other Performance Concerns

Phoronix has been cranking out a slew of benchmarks recently, pitting various different Linux distros against each other and even different operating systems with their own automated test suite.

What I would like to know is... are they bullshit? I'm no statistician, yet the proximity of the numbers and lack of error bars raise my own bullshit detection meter. See this URL for some background on statistical significance and error bars: http://www.graphpad.com/articles/errorbars.htm.

I spot plenty of outlandish things, such as FPS benchmarks for file system tests and JDK version changes through product life cycles, also to mention the somewhat unfairness of crappy binary graphics drivers across versions. JDK 1.6.10 is SUPPOSED to be faster; results from benchmarking against this are insignificant unless it is run across all versions. Yes.. GCC, glibc, and the kernel change between releases as well but these are not typically components that a user can swap out as easily as a JVM which should probably be bumped for security reasons anyways.

Furthermore, I realize all benchmarking should be taken with a grain of salt - one particular set of hardware and software will never map correctly to another set of hardware or software, but it should be possible to set up tests to gain some useful intelligence.

Can this kind of macro/micro benchmarking (depending on how you look at it) help weed out regressions? GCC 4.0 was noticeably slower on x86 than 3.x (See: http://www.coyotegulch.com/reviews/gcc4/index.html, http://people.redhat.com/bkoz/benchmarks/). At the same time I think PowerPC saw significant improvement due to auto vectorization and use of Altivec/VMX. But it also seems to be improving over time. I've heard 4.4 is supposed to be much better with a new register allocator (IRA). This probably the most important component of modern open source operating systems, so some of the blame might be placed here if the numbers have meaning.

All of this makes LLVM look more and more appealing. LLVM is able to do not only compile time, but also link and run time optimization. This is very appealing for commercial software where you are given a binary blob by the manufacturer that will likely that will not change through its lifetime. It also reminds me of Java and speedup through JVM upgrades, except this should work on any language.

"LLVM is... designed to enable effective program optimization across the entire lifetime of a program. LLVM supports effective optimization at compile time, link-time (particularly interprocedural), run-time and offline (i.e., after software is installed), while remaining transparent to developers"

One thing the Phoronix numbers do show is that things seemed to go down hill coincident with CFS (Completely Fair Scheduler), dyntick, and SLUB merging as well.

Evgeniy Polyakov of POHMELFS fame raised the alarm with some fairly significant networking regressions - how financial crisis affects tbench performance - that seem to support a general slowdown between 2.6.22 and 2.6.27. This resulted in noise on LKML and hopefully we will see improvements soon.

I guess what I am getting at is that compute power is so cheap that it seems stupid to not have automated tests against such things these days. Diego Petteno of Gentoo fame has been doing such things recently with Gentoo's excellent build system. I have set up Hudson, a Java Continuous Integration system, before to track commit regressions and such a system seems ideal for all modern software testing.

Anyways, I am interested in hearing your thoughts on benchmarking, software testing, and automation and how it can be used to improve modern software.

Related Posts:

Comments