June 2007


High availability means redundancy, right? If a piece of hardware or software fails, a backup kicks in immediately to carry the load so that users do not observe the failure. At worst, a transient error condition is seen to the client. This, at least, is the conventional wisdom now in an era of hardware with short MTBF (mean time between failures).

My wife and I experienced the leftovers of the mainframe era at the airport on Friday. Trying to get from Chicago to Philadelphia, we discovered that not only was our flight cancelled, all flights to the east coast altogether were cancelled. The weather conditions were fine across the map, so it didn’t make a lot of sense. Later on my aerospace buddy Matt clued me in that the FAA computers on the east coast had crashed. Another friend clued me in to the fact that 3 servers handle the FAA routing for the entire country, so the loss of a single server kept hundreds of people on the ground.

I don’t argue with the policy of not launching planes when their positions can’t be adequately tracked by air traffic control, but it’s a bit of an anachronism in this day and age to see a critical industry taken out by the loss of a single server. Big iron (mainframes) should all be dead and retired by now specifically to avoid this kind of situation. User requests should be automatically failed over to backup hardware. This is a solved problem. Telecom and Web 2.0 companies alike do this many times over every day.

I understand that a rewrite of the 1960s era software occurred at some point and the air traffic controllers rejected the solution. (with good reason) I think that, rather than attempting a complete replacement next time, they should try utilizing the Strangler Pattern. The failure-prone backend infrastructure could be steadily replaced without necessarily also re-writing the front end code. By maintaining different conceptual layers, the availability of the system can be improved without forcing an all-or-nothing type total system re-write number in the billions of dollars.

Big-bang software projects rarely succeed. Break big problems into manageable little problems! Ask your users what will improve their experience. And last but not least, at least have a better story for the poor saps at the airport than, “Your flight has been cancelled. Would you like to go on Sunday instead?”

I had a recruiter cold call me at work today. I told him I was busy at the moment, but that he could email me at my cornell.edu address. I like to give a cursory glance at the pitches by head hunters on the theory that eventually one of them will be interesting. This is the approximate conversation that followed:

Recruiter: “Oh, did you go to Cornell?”

Me: “Yes”

Recruiter: “I went to Dartmouth. Ivy League for the win!”

Me: “Are you a World of Warcraft player?”

Recruiter: “Yeah, I play on the Bronzebeard server.”

Me: “Nice. I’m over on Garona.”

I’d say that qualified as the oddest conversation I’ve had with a cold caller, but the interesting thing is the extent to which MMORPG and WoW vocabulary is now transferring over to RL. (real life) Also, to be perfectly honest, I’m now far more inclined to see what the guy has to say…Hilarious.