Mon 11 Jun 2007
High availability means redundancy, right? If a piece of hardware or software fails, a backup kicks in immediately to carry the load so that users do not observe the failure. At worst, a transient error condition is seen to the client. This, at least, is the conventional wisdom now in an era of hardware with short MTBF (mean time between failures).
My wife and I experienced the leftovers of the mainframe era at the airport on Friday. Trying to get from Chicago to Philadelphia, we discovered that not only was our flight cancelled, all flights to the east coast altogether were cancelled. The weather conditions were fine across the map, so it didn’t make a lot of sense. Later on my aerospace buddy Matt clued me in that the FAA computers on the east coast had crashed. Another friend clued me in to the fact that 3 servers handle the FAA routing for the entire country, so the loss of a single server kept hundreds of people on the ground.
I don’t argue with the policy of not launching planes when their positions can’t be adequately tracked by air traffic control, but it’s a bit of an anachronism in this day and age to see a critical industry taken out by the loss of a single server. Big iron (mainframes) should all be dead and retired by now specifically to avoid this kind of situation. User requests should be automatically failed over to backup hardware. This is a solved problem. Telecom and Web 2.0 companies alike do this many times over every day.
I understand that a rewrite of the 1960s era software occurred at some point and the air traffic controllers rejected the solution. (with good reason) I think that, rather than attempting a complete replacement next time, they should try utilizing the Strangler Pattern. The failure-prone backend infrastructure could be steadily replaced without necessarily also re-writing the front end code. By maintaining different conceptual layers, the availability of the system can be improved without forcing an all-or-nothing type total system re-write number in the billions of dollars.
Big-bang software projects rarely succeed. Break big problems into manageable little problems! Ask your users what will improve their experience. And last but not least, at least have a better story for the poor saps at the airport than, “Your flight has been cancelled. Would you like to go on Sunday instead?”
June 11th, 2007 at 7:55 pm
The ancient state of many critical federal systems is really appalling. Rewriting them isn’t easy, either.
One of the reasons the FBI can’t catch Osama’s cohorts is that they (the FBI) still use paper case files in many offices while al Queda is plotting on their own wiki, for all we know. The original contract for a new FBI system was years and hundreds of million over budget before they pulled the plug and gave it to an even bigger contractor to try.
By the time they finish - and I’ve already seen this on satellite ground control systems - the software and hardware will be obsolete and due for another rewrite. That’s good for federal IT contractors, but bad for everyone else.
There is some hope for air travel, though. Instead of relying on outmoded ground computers and flying inefficient routes to stay in contact with radio beacons, there are plans for planes to navigate using GPS as a primary navigation standard and become more independent of various congestion factors.
August 2nd, 2007 at 3:54 pm
The Strangler Pattern? Is that where you try not to strangle the software architect on your team who spouts off about design patterns without actually delivering any code?
For as much as some of these patterns and testing frameworks seem self-indulgent, I am starting to see their benefit for critical production systems. When it becomes easy enough to use on even quick, one-off projects, then I’ll really become a believer.