Phoronix Benchmarking.. Statistically Significant? and Other Performance Concerns

Phoronix has been cranking out a slew of benchmarks recently, pitting various different Linux distros against each other and even different operating systems with their own automated test suite.

  1. Ubuntu 7.04 to 8.10 Benchmarks
  2. Mac OS 10.5 vs. Ubuntu 8.10 Benchmarks
  3. Ubuntu vs. OpenSolaris vs. FreeBSD benchmarks
  4. Fedora 10 vs Ubuntu 8.10 Benchmarks
  5. “Real World” Benchmarks of the [sic x2...]EXT4 File-System
  6. OpenSolaris 2008.05 vs. 2008.11 Benchmarks

What I would like to know is… are they bullshit?  I’m no statistician, yet the proximity of the numbers and lack of error bars raise my own bullshit detection meter.  See this URL for some background on statistical significance and error bars: http://www.graphpad.com/articles/errorbars.htm.

I spot plenty of outlandish things, such as FPS benchmarks for file system tests and JDK version changes through product life cycles, also to mention the somewhat unfairness of crappy binary graphics drivers across versions.  JDK 1.6.10 is SUPPOSED to be faster; results from benchmarking against this are insignificant unless it is run across all versions.  Yes.. GCC, glibc, and the kernel change between releases as well but these are not typically components that a user can swap out as easily as a JVM which should probably be bumped for security reasons anyways.

Furthermore, I realize all benchmarking should be taken with a grain of salt – one particular set of hardware and software will never map correctly to another set of hardware or software, but it should be possible to set up tests to gain some useful intelligence.

Can this kind of macro/micro benchmarking (depending on how you look at it) help weed out regressions?  GCC 4.0 was noticeably slower on x86 than 3.x (See: http://www.coyotegulch.com/reviews/gcc4/index.html, http://people.redhat.com/bkoz/benchmarks/).  At the same time I think PowerPC saw significant improvement due to auto vectorization and use of Altivec/VMX.   But it also seems to be improving over time.  I’ve heard 4.4 is supposed to be much better with a new register allocator (IRA).  This probably the most important component of modern open source operating systems, so some of the blame might be placed here if the numbers have meaning.

All of this makes LLVM look more and more appealing.  LLVM is able to do not only compile time, but also link and run time optimization.  This is very appealing for commercial software where you are given a binary blob by the manufacturer that will likely that will not change through its lifetime.  It also reminds me of Java and speedup through JVM upgrades, except this should work on any language.

“LLVM is… designed to enable effective program optimization across the entire lifetime of a program. LLVM supports effective optimization at compile time, link-time (particularly interprocedural), run-time and offline (i.e., after software is installed), while remaining transparent to developers”

One thing the Phoronix numbers do show is that things seemed to go down hill coincident with CFS (Completely Fair Scheduler), dyntick, and SLUB merging as well.

Evgeniy Polyakov of POHMELFS fame raised the alarm with some fairly significant networking regressions – how financial crisis affects tbench performance – that seem to support a general slowdown between 2.6.22 and 2.6.27.  This resulted in noise on LKML and hopefully we will see improvements soon.

I guess what I am getting at is that compute power is so cheap that it seems stupid to not have automated tests against such things these days.  Diego Petteno of Gentoo fame has been doing such things recently with Gentoo’s excellent build system.  I have set up Hudson, a Java Continuous Integration system, before to track commit regressions and such a system seems ideal for all modern software testing.

Anyways, I am interested in hearing your thoughts on benchmarking, software testing, and automation and how it can be used to improve modern software.

No related posts.

7 thoughts on “Phoronix Benchmarking.. Statistically Significant? and Other Performance Concerns

  1. Hi Kevin,

    As an FYI, with all Phoronix test results (and any other Phoronix Test Suite driven tests you see on Phoronix Global or elsewhere), if they were generated following Phoronix Test Suite 1.6.0 Alpha 2, it’s possible to generate the error bars yourself.

    To do this, after you find the corresponding Phoronix Global entry, such as http://global.phoronix-test-suite.com/?k=profile&u=michael-366-4219-4691

    With the Phoronix Test Suite client you would just run:

    phoronix-test-suite clone michael-366-4219-4691
    phoronix-test-suite analyze-all-runs michael-366-4219-4691

    And from there it will show you a candlestick chart with information from all runs, etc. It’s possible to also run other commands and look at the results in different ways.

    Michael

  2. I share your skepticism of the Phoronix benchmarks. None of them ever seem to provide any useful data and they are far from scientific.

  3. I agree all benchmarks are to be taken with a grain of salt. I’m sure the metrics used in any benchmark can be improved. However, I have never seen a benchmark methodology void of criticism. The question is, if all benchmarks are flawed, then are they useful? Generally yes. Especially when they are countered with other well thought out benchmark tests.

  4. I am a psychologist so I am very well versed in statistics; I also use various linux distros as a primary desktop. Many of the benchmarking competitions you have listed are quite interesting. You are correct though in allowing your “bullshit detector” to take over. However, the small sample sizes are problematic when conducting any parametric statistical analysis in these cases. If Phoronix would conduct the same tests multiple times for each distro and post his data, then determining statistical significance would also be more meaningful and accurate.

    But, there is one thing to keep in mind: What does it mean to obtain statistical significance when comparing two or more OSes? That is, what does a significant finding really mean to us? Not much to be honest. All of these distros are so similar that everyone still decides based on their own preferences and personal biases.

  5. @David, I think you and I come to the same conclusion. Cross platform benchmarks mean little to the average user. Encoding an MP3 one second faster will never be noticed.

    Yet, if deleting an 8GB DVD image takes 45 seconds longer on one file system compared to another, it would factor in which FS to use on a media drive for instance.

    Showing Ubuntu 7.04 as quite a bit faster than later releases seems a bit comical without analysis. Analysis of the whole stack isn’t trivial though. Hardware performance counters will probably be helpful – http://lkml.org/lkml/2008/12/4/401 – funny Linux is just now getting this feature

    @David and NG
    With careful analysis as the tbench regressions showed, we can at least catch performance regressions and possibly other problems. git bisect seems to be excellent here.

    There are also one off cases where an administrator might have two equal solutions for a problem. For example, lets say Linux and FreeBSD both support all of your routing requirements. Doing some synthetic transfer comparisons to see which can pass more packets per second (pps) would have meaning here for both hardware sizing and OS selection.

    @Michael
    Thanks for the clarification. How come you don’t publish these charts as the default?

  6. kev009:

    They aren’t published by default since there haven’t been enough requests (at least not yet) for having error bars shown within each Phoronix article.

    As another clarification, the Phoronix Test Suite normally runs test between three and five times (which are all archived within the XML results file) before averaging the final results.

  7. In my view Phoronix shows the wrong spot most of the time. Without wanting to say that they don’t do a great job doing benchmarks, the benchmarking most of the time show misleading things that apply to most of the industry. In many cases it compare updated software where the graphics driver, compiler improvements, task scheduler and disc IO performance are benchmarked at large. But those give the wrong image about software. I really care a bit that Mac OS X 10.6 adds some speed improvements, but at large I just want that all software to work just as expected. The boot startup time is better perceived by user as a fast system than a 15% slower system because of all system.
    Raw numbers means close to nothing in any field than is not just specific server workload in my view, and this is the part that Phoronix fails at large. I’m a developer in .NET world, and this are the biggest concerns, how the application to “behave fast” instead to be the mind blowing raw speeds.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>