High availability means redundancy, right? If a piece of hardware or software fails, a backup kicks in immediately to carry the load so that users do not observe the failure. At worst, a transient error condition is seen to the client. This, at least, is the conventional wisdom now in an era of hardware with short MTBF (mean time between failures).

My wife and I experienced the leftovers of the mainframe era at the airport on Friday. Trying to get from Chicago to Philadelphia, we discovered that not only was our flight cancelled, all flights to the east coast altogether were cancelled. The weather conditions were fine across the map, so it didn’t make a lot of sense. Later on my aerospace buddy Matt clued me in that the FAA computers on the east coast had crashed. Another friend clued me in to the fact that 3 servers handle the FAA routing for the entire country, so the loss of a single server kept hundreds of people on the ground.

I don’t argue with the policy of not launching planes when their positions can’t be adequately tracked by air traffic control, but it’s a bit of an anachronism in this day and age to see a critical industry taken out by the loss of a single server. Big iron (mainframes) should all be dead and retired by now specifically to avoid this kind of situation. User requests should be automatically failed over to backup hardware. This is a solved problem. Telecom and Web 2.0 companies alike do this many times over every day.

I understand that a rewrite of the 1960s era software occurred at some point and the air traffic controllers rejected the solution. (with good reason) I think that, rather than attempting a complete replacement next time, they should try utilizing the Strangler Pattern. The failure-prone backend infrastructure could be steadily replaced without necessarily also re-writing the front end code. By maintaining different conceptual layers, the availability of the system can be improved without forcing an all-or-nothing type total system re-write number in the billions of dollars.

Big-bang software projects rarely succeed. Break big problems into manageable little problems! Ask your users what will improve their experience. And last but not least, at least have a better story for the poor saps at the airport than, “Your flight has been cancelled. Would you like to go on Sunday instead?”