Archive for the ‘Programming’ Category

boo2pdf Update

Friday, November 27th, 2009

I did some minor updates to boo2pdf. Graphics should now be within the page margins.  Please let me know if there are any other common formatting mistakes.

Unfortunately, IBM’s “transmogrifier” utility doesn’t work very well in Wine, so you should preprocess older books in Windows before running them through the boo2pdf web service (download is on that page).

Source code is now available from boo2pdf gitweb.

Announcing boo2pdf

Tuesday, October 20th, 2009

I’ve just uploaded a beta of boo2pdf, an IBM BookManager to PDF conversion app & web service. I’m currently experimenting with the HTML to PDF backends and would like feedback with book files I haven’t tried. Once the code is cleaned up, I will dump it on my site.

Motivation

I have a large collection of old IBM machines and documentation. I want this documentation indexed by my own search facilities and Google for easy retrieval. PDF is widely read, while BookManager requires proprietary software and no search engines I know of parse it. This will probably be useful to Mainframers as well.

Take the web service for a spin here:

http://ps-2.kev009.com:8081/boo2pdf/.

Java: The Good Parts

Wednesday, October 14th, 2009

javascript-the-good-partsA while back, a book entitled JavaScript: The Good Parts made waves on the internet, especially social networking sites.  This book purported to show the inner beauty of a language that was long considered second or third rate, coming of age.  With the advent of toolkits like JQuery, Javascript/AJAX development has become easy and even fun.

I aim to do the same by showing “Java: The Good Parts” here at a high level.java_powered_logo_rgb

When I was younger, I used to despise Java for political reasons and bad memories of early applets and applications.  I suspect many users and developers (especially Libre software devs!) are on the same boat.  By the end of this article, I hope I swayed your opinion or at least caused you to reevaluate your bias.  I also wish to encourage further discussion about these points and ways we can improve any deficiencies.

Rough and Tumble Upbringing

When Java first started gaining popularity, it was loudly hyped as the end all language.  It was expected that Java would take the “rich client” by storm, and applets would be the go to solution for enhancing web pages.  What happened was a bit different.  Java floundered and struggled to find a niche.  On the client side, AWT apps looked horrendous despite using native widgets.  Then Swing came about and despite easing development, it looked equally bad on all platforms (by default).  Applets were basically a stillbirth.  The ugly gray box, loadtime sometime measuring in minutes, and no coordination with the DOM and web browser made the average user hate Java.

One area Java was able to develop and secure a foundation, however, was the back end of large web applications.  The Virtual Machine approach provided a marked advantage over the CGI and interpreted scripts of the day.  Java’s rich networking libraries, clean Object Oriented design, and safety made this the language de jour for large web applications.

Open Source Matters

In my opinion, the open sourcing of Java during its early infancy would have had little impact on most of the teething issues.  The Virtual Machine, JIT, and Garbage Collection required many years of tuning to get acceptable performance and Sun did an acceptable job keeping it under wing.  The relatively limited CPU and RAM of the mid ’90s also made these concepts a bit ahead of their time.  Somewhere in the 1999-2002 time frame, though, Sun really dropped the ball.  An Open Source Java would have led to ubiquity on the booming Linux platform and a chance for all sorts of cross-platform software.

Open Source matters, and not just for the source code.  Open Source projects naturally bring about very pragmatic and intelligent developers.  These are the folks that thoroughly enjoy their hobby, work, and tools.  The marketing guys and pointed haired bosses have much less pull here.  On one hand, a vibrant community built itself with the many Apache Software Foundation projects.  However, most of these were squarely focused on web applications or low level things such as build tools, testing frameworks, and message buses.

ThumbsUpDue to the void, interpreted languages such as Python rose to the challenge while C and C++ remained the mainstay for applications programming.  Microsoft started dominating Windows development with their .NET CLR languages.  The glib/Gtk+ and Qt toolkits brought about a renaissance in cross-platform development with C and C++ respectively [though not limited].

It wasn’t until the open-sourcing of SWT that GUI development in Java became attractive.  The obvious killer apps here were the Eclipse IDE and the Azureus (now Vuze) bittorrent client.

Sun’s closed grip of Java really stagnated any chance of abundant expansion in these middle years (2001-2006).  Microsoft leveraged this weakness to create the excellent .NET platform and associated languages to maintain their closed platform and market dominance.  The counterbalance that would have been Java was thus left playing catchup.

We are just beginning to see the fruits of this labor from 2006 through today.  The OpenJDK project is now distributed with popular Linux distributions such as Ubuntu and Fedora.  We finally have decent browser plugins and Java Web Start applications across 32 and 64-bit machines.  The Java deployment problem will slowly fade from memory.

Application Development

SWT made Java apps beautiful.  OpenJDK should make them ubiquitous.  We finally have an Open Source platform that is widely deployed.  The strong built in standard library and clean OO design patterns of Java make it a very pleasant host for developing rich client apps.  Obvious areas for improvement here include better layout/form design tools and closer integration with upstream Linux distributors.

qt-logoSomewhere along the line, Trolltech/QT Software (now owned by Nokia) released Jambi — the complete Qt bindings, GUI framework, and incredibly rich library — for Java.  Oddly, this bombshell received little of the community and fanfare I thought it would or deserves.  Indeed, QT Software demoted Jambi from their teir-1 platforms and hopes the community will pick it up.  I hope this project isn’t allowed to stagnate as there is a lot of potential here.

Web Apps

Along the “Enterprise Web Application” lineage of Java, we wound up with some disgustingly overcomplicated and bloated frameworks for building web apps.  Ruby on Rails and Python Django came about and put a new spin on the development of rapid and robust web apps.  The learning curve of these frameworks is much less than Java EE and I will go as far as saying they are more capable because of it.

By using Java, JSP, and Servlets directly on top of a light Model-View-Controller, I believe Java is just as compelling as some of the more popular scripting languages.  Developers need to know they can trim the fat and that there are many advantages to developing in Java, namely because of the next topic…

Dynamic Languages, its all about the VM

It’s all about the JVM stupid!  One of the best features of Java and .NET are the underlying Virtual Machines.  By using JIT compiled VMs, Java code has a distinct advantage over the common interpreted languages such as Perl, Python, and PHP.  In the case of Java, the resultant is even naturally crossplatform.

The really interesting developments here focus on extending the JVM to syntax and paradigms other than the statically typed C++ lookalike.  Clojure and Scala deliver innovative new techniques while Jython and JRuby bring these excellent languages to the Java software platform and virtual machine.

In short, Java provides everyone with a counter to Microsoft’s .NET CLR.  The Java VM has been around the block and tuned by giants such as Sun, IBM, Oracle, SAP and more.

I call on the community to discuss how we can encourage use of the JVM for languages other than Java and build this into a defacto runtime.  Continued tuning and integration with Windows, Mac OS X, and Gnome/KDE *NIX systems is paramount.  Research for easy multi-core development is also worthwhile.  Meanwhile, distributions need to continue packaging the JRE and make it a default.  Individual developers need to be made aware of “Java: The Good Parts” and myths debunked.

Applets, Rich Media, Native Code!?

javafx_logoWith the release of Java FX, widespread deployment of the JRE, and better browser integration, Java has set the stage for a comeback to its roots.

During the late ’90s and 2000s, Adobe Flash became the tool of choice for web animation and interactive pages.  It really exploded with the advent of Youtube and other internet streaming sites making use of the Flash video format.  Unfortunately, Flash player is notoriously insecure, resource intensive, and crash-prone.  It is also not widely available for the millions of smartphones that have become more accessible than computers.

Luckily, there seems to be a shift back to the browser with new developments in AJAX, JavaScript, and HTML.  The <video> tag will hopefully make video as easy and portable as graphics are today in the browser.  Clean JavaScript libraries and fast JIT JS engines make it practical to use this paradigm for many domains.

Yet one must acknowledge that somewhere along the line, manipulating a DOM/markup language with a scripting language isn’t the most effective development platform for everything.  Google even thinks it poignant to run x86 machine code in a sandboxed environment in your browser.  I personally fail to see the logic behind this.  Java provides a well evolved, cross-platform solution.   Java can run on your ARM powered Android.  Requiring an x86 CPU just seems like the wrong track in this modern age.  Hopefully JavaFX will pick up the slack and return Java to its roots.  I would love to see the demise of the terrible Flash plugin.

Future and Conclusion

It’s time we considered Java for The Good Parts.

I hope some of my points caused you to reevaluate any bad preconceptions or past experiences you may have had with Java.  Java has undergone great change since its birth and I think it is capable of becoming the premier development platform for applications programming of all types.  Particularly interesting are some of the new languages such as Scala and Clojure.  Java has long been a staple in web development, but has traditionally scared away amateur coders.  If you cut the fat, Servlets and JSP are not much harder to set up than common place scripting languages.  Frameworks such as Grails bring it to parity with Rails or Django.

Java underwent a sea change in 2006 with the releasing of the source code and opening of the development process.  Java and the JVM should be championed by Libre software developers and users alike!

javascript-the-good-parts

Mobile phones spur cross-platform applications. Open Source is mainstream.

Thursday, September 24th, 2009

I had an interesting day today.  At school, we had a social event with an industry governing board and several local software companies in the Charleston, South Carolina region.  Aside from meeting a lot of new people, I was able to ask some of the industry leaders present about the platform and languages they used.

Perhaps the most interesting was that many Microsoft shops are moving away from fat client apps to web apps, and not simply because it is a buzzworthy thing to do.  The primary driver is the proliferation of advanced mobile devices, namely iPhone, Palm Pre, and Blackberry.  Two of the companies I talked to were VB.NET or C# shops and used to do traditional fat client software.  Due partially to the smartphone craze, they are moving SaaS.  They also mentioned the Mac as a rising popularity, but no Linux or netbooks or anything like that.   It’s funny how venerable HTML has become.. now the medium of choice for displaying cross-platform applicaitons.  I doubt anybody every imagined just how important HTML and JavaScript would become during their infancy.

Surprising and delightful to me was the talk of Open Source in the enterprise.  I spoke with two gentlemen from different large defense contractors and they were spot on with there assertions that Open Source software is superior in many ways.  Both were large Java EE shops and mentioned how they could check and verify FOSS for security much better than any proprietary software would allow.  They mentioned that the US Government is one of the largest purchasers of software but even then working with COTS vendors is difficult and FOSS solves many of these problems by allowing them to commit back changes.  The thing that made me the most happy was when one rep said that active participation in an Open Source project was a surefire way to boost a resume to the top of a stack.

Another interesting tidbit was that virtualization is synonymous with VMWare here (everyone specified this by name) even today.  I’d go as far as to say that I’d probably have received strange looks if I had mentioned KVM or even a heavyweight like Xen or MS Virtual Server.  Aside from Windows, the defense guys talked a lot about Solaris.  I didn’t get much reaction when mentioning Linux to anybody, sadly.  (somewhat comically) the non-profits were largely the 100% Microsoft shops.

El Reg Humor and Java in free software

Friday, May 8th, 2009

The Register has a good article on Sphinx search with some entertaining pop-shots at Java and “enterprise software” that got a rise out of me:

Solr is popular with the enterprise crowd, who love its Java. Being a Java program, Solr includes no shortage of technology whose acronyms contain the letters J and X.

This tickles the enterprise pink, because these sorts of developers love nothing more than hanging out around a whiteboard drawing boxes and arrows and, from time to time, writing XML to make it look like they’re doing real work. Solr thrives in this environment, being an Apache Foundation project, the Apache Foundation, of course, widely known as a cruel experiment to see what happens when bureaucrats do open source.

Having a bit of experience with Java from academia and a few open source projects I make use of, I can’t help but laugh at how comically and concisely the editor summed it up.

By and large, successful open source projects tend to be written in languages other than Java. The entire GNU/Linux OS stack is primarily C, with some components using C++ like KDE, OpenOffice and Firefox.  On the ever popular web front, PHP, Ruby, and Python lead the pack.

I think it turned out this way for a multitude of reasons.  When working on the OS stack, the power and control of C and C++ are hard to beat.  The plethora of libraries and raw speed of these compiled languages set the bar high for any newcomers.  Java exists as a kludge, mildly useful for desktop apps and mildly useful for web apps while historically having a lot of problems.  Native look and feel have long been the layman’s complaint, though SWT has done a pretty good job there.  Of course, omnipresent Java in the Linux world is relatively new.  I think Java would have been the darling language of client apps had it been open sourced sooner, but this came about 7 years too late to have a large impact on shaping the common FOSS userland.

It is interesting how the open source projects built with Java tend to be highly bureaucratic and abstract.  I think the bottom line is that FOSS programmers do what they do because it is fun and demand pragmatism.  The “enterprise software” attitude/baggage that many Java apps and libraries carry are a big turn off to pragmatism and the hacking culture.  The barrier to entry for Java web programming is also much higher than its “scripting language” competitors, which carry light and simple frameworks that focus on results, not procedure.

Java itself isn’t that of a bad language.  I actually enjoy working with it in school (…though I think it really isn’t appropriate as an introductory teaching language, shielding important concepts from students.  Maybe a future post?..).  When it comes time for real work though, I consider Python, C,  C++ more pragmatic depending on the job at hand.  That, and the fact that most of the common scripting languages are gaining JIT compilers may accelerate Java toward status as a legacy language.

Your thoughts?

One Small Step for QT, One Giant Leap for Free Software

Friday, January 16th, 2009

QT Software, under the graces of Nokia, has released the superb QT cross-platform toolkit under the LGPL.

This. is. HUGE.

For the libre software purist, this still benefits you, if indirectly.  Companies that make changes to the toolkit must still submit patches.  More influential, GPL incompatible software may now readily use QT for free.  This will likely foster more QT centric developers, boost adoption of the underlying stack (Linux, etc), and lower the barrier for vendors to release cross-platform tools.

From a Nokia business perspective, it makes perfect sense and makes the whole thing that much more beautiful.  “QT Everywhere” is really a possibility now.  And, it’s beneficial to Nokia as well as the ecosystem they are enriching.  The more QT developers, the bigger the talent pool for Nokia software.  The more contributors, the better the toolkit.  Win.  A small company like Trolltech could not afford to do this, but to a big dog like Nokia, the revenue from commercial licensing is insignificant and unimportant compared to device sales.

I know the company I work for, Analog Rails, will be able to take advantage of the license switch.  Being previous commercial QT customers, it was expensive to juggle around machines to maintain compliance.  For a companies like VMWare that deploy cross-platform software and maintain their own cross-platform extensions, this surely must be compelling.  I say, the more the merrier!

What a great day for free software, computing, and life in general :-) .

Ars Technica has outstanding coverage of the news: http://arstechnica.com/news.ars/post/20090114-nokia-qt-lgpl-switch-huge-win-for-cross-platform-development.html

I dream of pervasive virtualization…

Friday, January 2nd, 2009

I dream of a day where virtualization is pervasive.

Instead of thinking about services in terms of servers, CPUs or directly mapped resources, I should be able to to add virtual machines in terms of guaranteed throughput rate over a whole grid.  Scaling out should be as easy as adding a blade or racking another server.

At the low level, I should have the option of running N+N redundancy.  That is, the VM should run in lockstep across multiple machines – so if it is running on 2 vcpus, 4 in total would be used.  This would allow for any node to fail.  And the VM should be an aggregate of the low level hardware – e.g. a VM grid across 4 8-core servers should scale near-linearly when a single OS instance is running 32 processes.

Current solutions only attempt to do some of the tasks above, and most fail miserably.  IBM mainframes have been doing it for ages.

If I had the time, I know I could build software to do this better than anyone else.  All the puzzle pieces are there, especially the tough ones like hypervisors and Infiniband.  This could have been done at least 3 years ago.  I bet it will take the industry 3-4 years yet to get anywhere close.

This is a real virtual datacenter.

How to upgrade to ext4 in place

Wednesday, December 24th, 2008

Here’s how you upgrade to ext4.  The process is pretty easy, but requires an fsck which means unmounting or rebooting if the file system is in use.

Make sure you are using at least e2fstools 1.41.3 and kernel 2.6.28 (or a vendor kernel with latest ext4 patches applied)!  Also, its probably a good idea to have proper backups (really!).  ext4 has just been declared stable, but what that really means is that the battle hardening has just begun.  I’ve done several heavily used systems without fault so far though, so its probably good enough for your desktop.

WARNING: DON’T CONVERT YOUR /boot PARTITION. Right now, there is no stable version of grub with ext3 support.  Even if there was, it really won’t gain you anything  :-) .

Run tune2fs, e.g.:

tune2fs -I 256 -O sparse_super,filetype,resize_inode,dir_index,ext_attr,has_journal,\
extents,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize /dev/sd[x][n]

Those are the default options for an ext4 file system if you were to create it with mkfs.ext4 (e2fsprogs 1.41.3 – see /etc/mke2fs.conf).  I’m getting pretty damn good performance with this!  The ‘-I 256′ option sets 256 bit inodes, which most recent ext3 FSs use already. If this is the case, and you get a message telling you so, remove this option.  Note that extents will make the FS backwards INCOMPATIBLE with ext3.

Next, edit /etc/fstab, e.g.:

/dev/vg/home /home ext4 defaults 0 0

Either unmount and mount or reboot your system.  tune2fs marks the fs as dirty and performs a fsck and conversion.
NOTICE: distros with initrds may need to be regenerated or you won’t be able to mount your root file system.  In Fedora (replace kernel version with your own):

cd /boot
mv initrd-2.6.27.7-134.fc10.i686.img initrd-2.6.27.7-134.fc10.i686.img.old
mkinitrd initrd-2.6.27.7-134.fc10.i686.img initrd-2.6.27.7-134.fc10.i686.

That’s all there is to it.  Stay tuned for future ext4 developments like online defragmentation.

Also, ext{2,3,4} reserve 5% of space for root in case the drive fills up.  On large modern drives, this can be excessive (e.g: 50GB on a 1TB disk).  Consider running ‘tune2fs -m 1 /dev/sd[x][n]‘ to reduce this to 1%.

For more information and tweaking:

  1. Documentation/filesystems/ext4.txt from the latest kernel sources
  2. http://ext4.wiki.kernel.org/index.php/Main_Page
  3. man tune2fs
  4. http://e2fsprogs.sourceforge.net/

Bulletproof your server to survive Digg/Slashdot

Saturday, December 13th, 2008

implementing scale up for web 2.0 sites with current practices

This blog was recently featured on Slashdot over the Thanksgiving holiday in the US.  It was the perfect storm: commercial news organizations were mostly dormant creating a slow news day, and geeks like me were at home eager to get the latest technology scoop.  What surprised me is how this relatively modest box, a Linode 540MB Xen Virtual Machine, withstood up to 100+ requests a second without even breaking a sweat.  Furthermore, I had only performed some of the tuning I detail below.  It scales to over 1100 requests a second after following my guide below!

I will detail how to tune your server for optimum capacity, or what I will call free scale up (as opposed to scale up by adding hardware or scale out – adding machines, database servers, application servers, load balancing – which may come in a future article depending on interest).  Most of the ideas here are platform neutral – both OS and application server – assuming you are using a UNIX style OS.

The only tutorials I’ve found were dated and don’t detail the latest practices like varnish or Passenger, so read on for a fresh look.

Audience

The intended audience for this article is anyone running a web site.  Running your own web server gives much greater flexibility in choice of development environment.  A dedicated server and certain virtual private server providers give much more predictable performance and wont cancel your service on a whim. (I’m looking at you MediaTemple… google for horror stories).  A Linode VDS is much more flexible and very powerful for around the same cost.

Web Server

Most people use Apache.  According to Netcraft, over 50% of hosts were as of November.  For good reason, Apache has proven stability, scalability, and security.  Some folks are quick to rip out Apache due to poor configuration and tuning.  I personally find it to be an excellent choice for most sites because of the aforementioned traits and first-rate extensions.  With proper setup, you will likely max your transfer or tax your application sever before it ever becomes the bottleneck.

Apache Tuning

The key to tuning Apache is to minimize RAM usage, especially on a limited machine like my 512MB Linode.  Memory swapping of applications to disk is almost entirely unacceptable on modern servers.  Disk I/O is very expensive and the biggest bottleneck on modern computers, which is why swap is so unappealing.

Therefore, you need to:

  1. Limit overall Apache memory usage
  2. Minimize per thread/process memory usage
  3. Minimize disk I/O
Limit overall memory usage

Step one is very important.  If your server begins swapping heavily, it can be very difficult to even log on and perform administration.  You need to develop an idea of the RAM an average Apache process is using via top, ps, or another monitoring framework.  Make sure you are looking at the RES column in top, since shared libraries will be used between all processes.  Take this number and divide it by the amount of availalbe RAM.  Available RAM should take into account RAM used by other processes including your database when under reciprocal load.  Set the MaxClients directive to a number close to the resultant, and tune accordingly with benchmarks (see Benchmarking section).

Minimize memory usage

Step two determines how many child processes you can handle.  This is important because the more children, the more in flight requests, and lower end user latency.  This is also a lot more environment dependent than step one.

A good way to reduce memory consumption is to unload unneeded modules.  Most server operating systems default Apache with a wide range of modules that are probably not used on your site including several basic authentication methods.  Using shared objects rather than static modules will help memory usage as well, and most distributions ship this way.

If you use an Apache module for your application server (mod_php, mod_perl, mod_python, Passenger aka mod_rails), each child process will consume the memory of that module regardless of whether or not it is serving a static asset (images, css, etc.) or an application page.  Mitigate this by using a proxy (see next section) or moving application serving to its own processes via FastCGI (PHP, most others), AJP (Java, Python), WSGI (newer Python), proxy (Ruby, all).

Disable logging

I should take a moment to step back and hit on an important topic.  Hard disks have improved very little in regard to performance in recent years.  Disk I/O is an expensive task and therefore the primary bottleneck you wish to avoid.

When Apache logging is enabled, a write operation must occur for every hit.  If possible, consider completely disabling access logging.  You can outsource web statistics to Google Analytics.  If you require logging, make sure HostnameLookups is disabled (network I/O is even more expensive than disk!) and batch look-ups on another machine or during idle periods with a log analyzer.  As your setups grows (scale-out), log files will become more cumbersome and you will probably be logging to database or a central server anyways.  Varnish, a proxy/http accelerator detailed below has an optimized design for logging.

mod_cache

Apache has an integrated cache module that will keep frequently hit static assets in memory.  For larger sites, forgo this and use a proxy which will be more flexible and allow easier scale-out.

MPMs

Apache makes use of MPMs, or Multi-Processing Modules, for its core functionality.  The default on UNIX is prefork, which makes a separate process for each request.  By switching to a threading MPM such as worker or event, you can cut down overhead and memory use.  Some modules do not play well with threading (PHP), so you should research before changing MPMs.  prefork works well for one and two core servers.

Alternative Web Servers

Lighttpd is the leading alternative FOSS web server.  Users include A-list web sites such as Youtube and wikipedia.  Benchmarks show impressive performance.  Keep in mind Apache is by no means slow nor resource intensive and links on that page show that it is faster on some workloads.

When making comparisons, keep in mind that by design you will probably be using a FastCGI application server and most of the optimizations above will hold true for Lighty.

For sites with long connection times (download servers, AJAX keep-alive) or static content servers, I would definitely lean toward it (scale-out).

Nginx has also been picking up steam (pun intended) and is being used by large sites like Wordpress.com.  I would consider it in the same class as Lighty.

Reverse Proxying

A reverse proxy is very useful for modern web serving.  Even with just one server, a reverse proxy will keep common pages in memory – greatly reducing disk I/O.  They will also keep static requests from using potentially heavy application server HTTPd processes.  These are often very fast at basic HTTP since they are not concerned with all the features of a web server.  When it comes time to scale-out, the proxy can be moved to a separate server.  Proxies can direct traffic to different backend servers.  Proxies can even be placed in geographically disperse areas (think CDNs: Akamai, Limelight – Youtube, Google).  Logging, compression, and SSL can be offloaded to the proxy.  In short, you want a proxy even on a single server (or at least mod_cache).

Varnish

Varnish bills itself as an HTTP accelerator.  It was written from the ground up to perform reverse proxying, and this it does well.  The Varnish design philosophy is enlightened and leaves a lot of the work like memory management to modern advanced operating systems.  Logging is performed in a separate processes and is optimized.  If you need an advanced proxy and accelerator, this is likely the way to go.

Squid

Squid has traditionally been used as the de facto FOSS forward and reverse proxy.  Many large sites such as Wikipedia are extensive users.

Apache and Lighttpd

Both Apache and Lighttpd have modules that will allow them to cache and reverse proxy.  For single server setups, it would probably be worth reusing the components of your web server (think: shared memory) if your application server is external.  mod_proxy is very useful for forwarding ruby requests to a Ruby web server like mongrel or thin.

Application Server

The application server is where most of the magic happens in today’s web 2.0 sites.  Gone are the days of static HTML files.  Most sites are now dynamically generated every visit, and customized per visitor.  This is an order of magnitude more complex, and a lot of CPU time is spent on page generation.  Therefore, tuning here is often one of the best things you can do to improve site scalability.

PHP

PHP is the most widely deployed language on the web.  Many extremely popular applications are written in PHP, including: MediaWiki, Wordpress, Drupal, and phpBB.

Opcode Cache

By default, PHP breaks a script down into opcodes every time it is called.  Opcode translation is necessary to simplify programs so they can easily be parsed by the Zend Engine.  It is unnecessary for this to be done every time a script is called since the source code will rarely change once deployed.  Luckily, a cache can be added that will eliminate this step.  The net performance gain can be a factor of 2 to 10, very impressive for a simple install!

These days, you should chose APC – The Alternative PHP Cache.  Once upon a time, there were several choices here. Turck MMCache was notably fast, beating even the commercial Zend Suite, but mysteriously died out (the original author is now a Zend employee. hmm.. coincidence??).  Others have tried to revive it in the form of eAccelerator, but it isn’t stable nor active.  Any other arguments are moot point since APC will be part of PHP6 core as well as having PHP’s founder as a developer.

Modules

Just as with Apache, removing unused extensions in PHP will help reduce memory usage.  These can be commented out in php.ini.

Rails

Rails has gained a lot of steam (okay I’m wearing that one out) and is a favorite among many Web 2.0 startups including Twitter.

A lot of Rails scalability problems are due to the underlying Ruby language.  The garbage collector, threading and memory allocator have been pinpointed to be particularly bad.  Work is underway to fix these in Ruby 1.9 (bytecode) and 2.0(threading).  In the mean time, consider Ruby Enterprise Edition in tandem with Passenger.  Personally, I’d rather avoid Ruby and all you kool-aid drinkers (but I’ve done a large deployment of Passenger).  Go Python :).

Python

Python is just a plain good language.  With that out of the way, like all the other scripting languages, Python is supposed to be getting a bytecode implementation sooner or later.  Psyco can yield an average 4x performance improvement and is available now.  PyPy should be here sooner rather than later.

Java

Due to the Java language design, code is JIT (Just-In-Time) compiled and you don’t have the compilation problem that the dynamic languages above do.

Java web apps are immensely complex, and aside from the latest JDK (1.6.0.10), your container will play a big role in speed.  Jetty and Tomcat are always good choices.

Databases and Database Caching

A large portion of modern web applications are database driven.  To keep your site running, this point of contention must be addressed.  MySQL is ubiquitous and known for its speed.  PostgreSQL offers some advanced features and is known as the DBA’s FOSS database.  If you need extreme scalability, consider DB2 but prepare to pay dearly :-).

MySQL

MySQL comes configured fairly well out of the box in most distributions.  MySQL Performance Blog sums it up better than I can, so head that way for basic tuning info.

Probably one of the easiest things you can do is enable the integrated query cache.  The good news is your application doesn’t need to do anything to take advantage of this.

in my.cnf:

query_cache_size = 64M

For single server web workloads, this simple change can work miracles and prevent dreaded MySQL connection errors.  This is especially true since web apps are primarily read oriented.  The query cache isn’t perfect in all situations, and in larger sites memcached is more appropriate but has its own disadvantages (see memcached section).

PostgreSQL

PostgreSQL should also be set up fairly well by your distribution.  shared_buffers should probably be tuned, as well as max_connections.  See the PostgreSQL wiki on tuning for a good overview.

There is nothing strictly akin to the MySQL query cache, for better or worse.

Applications and Application Caching

This is potentially the hardest step to implement, yet can also yield the greatest reward.  Caching common database queries, objects, modules, or even writing static HTML versions of a page can cut server load to nothing.  If you are using a common FOSS (free) or COTS (commercial) product, chances are the software already implements some of these options and they may just need to be activated or downloaded as an extension.

Keep in mind not all things are effectively cached, and you may need to perform a major rework to implement aggressive caching like this.

Generic Data Caching – memcached and APC

Many common applications contain backends for caching against memcached or APC.  Mediawiki is a prime example of this, which integrates nicely with memcached or APC.  If you are writing your own apps, using a memory cache can greatly reduce dependency on the database.

memcached

Realizing that databases have a lot of constraints, the folks at LiveJournal.com wrote a generic caching framework called memcached.  Most large sites such as Facebook, Wikipedia, and Slashdot are all using this.

The bad news is you have to port your application to store and check against memcached.  Database queries are a prime target, but just about anything can be stored here.

It is also handy for scale-out because you can add dedicated cache severs.

APC user cache

PHP APC users can manually store information in APC’s shared memory.  This is ideal for single server solutions.  Take a look at this performance comparison vs memcached and files.

Application Caching

Although most pages are dynamically generated these days, a lot are needlessly so.  For example, a content management system might include a header, content, comments and a footer.  This output can be updated and written as a static HTML pages when an author updates them.  Static pages are then served until a user comments on an article, which triggers a cache invalidation and the page is rendered and stored again.  The output of generated menus, columns, and other objects can be stored in cache form as well.

Wordpress Cache Plugins

Wordpress has a couple of plugins that are mandatory for large sites.

WP Super Cache will generate static HTML files of posts on your blog.  They are automatically served via some mod_rewrite magic, and will expire and update automatically.  This can effectively reduce  load to almost nothing – it completely eliminated database access and PHP execution.

WP Widget Cache is a nice addition that will cache output of widgets (sidebar elements such as menus) that don’t commonly change.

Benchmarking

It is important to benchmark your site after making changes to see if it meets performance expectations.  ab is a common tool for this task.

The following will run 10 concurrent requests for 3000 total against localhost:

ab -c10 -n3000 http://localhost/

Be very careful when benchmarking a live site.  You could effectively Denial of Service your server while it is processing all those requests.

What do you think?

I’d be happy to hear your stories from the trenches.  Please share your tuning advice!

Retrocomputing for Fun and Profit

Sunday, December 7th, 2008
  1. Buy Old Computers
  2. ???
  3. Profit

What is retrocomputing?

I define retrocomputing [wikipedia] as the collecting and use of old computers.  Why might one do this?  Well, for one, enterprises cycle out machines fairly frequently.  2,3,4 and 5 year old systems are often sent out to scrappers in droves despite still being plenty useful.  Top of the line systems for large companies often have more than enough power for small and medium sized ones at pennies on the dollar compared to new hardware.  These machines are likely complete overkill for home use, but none the less are very useful for fun and learning.

IBM mainframe ops in the 1980s

Why?!

A lot of what I know about computers has been learned on old machines.  Hooking up a couple of servers and desktops and trying to make something useful out of them is a great exercise for the aspiring system administrator.  With open source software, it can all be done freely and easily.

Yes, you can run Linux, BSD, and Solaris from the comfort of your Windows desktop in a virtual machine (weak sauce…).  Yet there is something much different when you cluster several high technology servers together, tethered to a Fibre Channel storage array and have them share a single distributed file system.  The knowledge of setup, installation, and troubleshooting I’ve gained from mock scenarios like this I cannot compare to anyone else I’ve ever met.  Breaking things here usually means digging deep and fixing it.  If you were to screw something up at work like some of the things I’ve gotten into, it would probably cost you your job.

BENCHNET - where I rip into computers that cost as much as a house and my "production" rack

Retrocomputing is also fun.  I am personally into old IBM hardware, though old UNIX workstations of all sorts are interesting to me.  You can see my collection of IBM PS/2 and RS/6000 knowledge here: http://ps-2.kev009.com:8081/.  There is a particular thrill to booting up a machine that cost between $20,000 and $50,000 10 years ago.  Knowing that these same machine models were used to design the Boeing 777, composed the famous Deep Blue machine, and were used in the largest automotive and shipbuilding firms not to mention some of the most important space craft to date also brings a sense of power and nostalgia.  In some ways its similar to having a classic car, but different.  Maybe if that classic car was a big ass bulldozer, tank, jet or some other well engineered piece of equipment :-P.

Some old systems I had at one time or another.  Left to right: IBM PS/2e (first "green" environmental pc), RS/6000 43p (7043-140), Apple PowerMac 7100/80, RS/6000 7006-42W, RS/6000 7012-397, HP Visualize c360 (PARISC)

IBM PS/2e, RS/6000 43p, PowerMac 7100, RS/6000 x2, HP Visualize c360

Nostalgia is one of the biggest things I get out of using particularly old hardware.  I missed the mainframe days, the minicomputer days, the PC and DOS days, the Apple II days (well, actually I used these a bit at a very young age), and to a degree the early Windows days.  Just like a history class, studying these old machines gives me insight as to why things are done the way they are today.  It gives me appreciation for modern systems and makes me write clean and well optimized code.  The old computer games that captivated me as a child (Sim City, Sim Tower, Sim Ant, Sim Farm, Gizmos and Gadgets, The Incredible Machine, Oregon Tail etc.) implanted a high degree of logic and understanding at a young age and it is heartwarming to revisit these.  I grew up a Mac user as well, so seeing what I was(or: was not :>) missing on PCs is also interesting.

Old MIPS UNIX server booting and logging in

Some of the benefits of retrocomputing:

  • Enterprise class hardware
  • Cheap, possibly even free
  • Different design philosophies – not everything is x86 – a lot of this gear is quite different.  For example, UNIX workstations integrated most of what we enjoy on our PCs years before it became available to consumers.  SGI machines were doing A/V and 3D in the early 90s.  IBM midrange AS/400s have an advanced integrated database, programming languages, and environment that make PCs look like a joke for business programming.  WinFS, Object Storage Devices, etc are just now being talked about for PCs.  The channel philosophy from mainframes is still pretty new to PC servers (fibre channel), not to mention virtualization.
  • If you break it, you can fix it and learn from it or toss it
  • The engineering and craftsmanship in some of these systems is downright astonishing
  • Old computers are works of art: they give you a window into the technology and culture of times past
  • You should never trust a computer you can’t lift

It is interesting that we as humans produce such elaborate machines, only to discard them as scarp a few years later.  It is humbling and shows you the incredible progress we are making.

How?

eBay is your friend, but also look for local scrapyards or businesses doing overhauls.

If you are faint of heart, plenty of good abandonware sites exist for games and operating systems that can be run on emulators or VMs.  Check out this IBM mainframe emulator, Hercules.  Some of the original IBM OSes are public domain.

If you don’t want old PCs and big iron overtaking your house, there is plenty of good material on YouTube as well.  The Computer Museum is a good start.  Some of the consoles, offices, and outfits are hilarious.

Old SGI tech demo – pretty impressive!